Nick Frosst is the co-founder and CEO of Cohere, a company that provides cutting-edge natural language processing models for a wide variety of use cases. Prior to founding Cohere, Nick was one of the first hires on the Google Brain Toronto team, where he worked on Capsule Networks and adversarial examples.

Nick recently sat down with Uplimit (formerly CoRise) co-founder Sourabh Bajaj to discuss his experience building large-scale NLP models, his hopes for the future of the field, and his advice for anyone interested in a career in NLP. 

The following excerpts from their conversation have been edited and condensed for clarity.

 Nick, you started a company in the natural language processing space—what about language AI excites you? Why are you bullish on this space?

 I think language is a really great area for machine learning, for a number of reasons. First of all, language is arguably our best technology. We solve a ton of problems with language, which means there’s tremendous power in being able to understand and work with language in a systematic way.

On top of that, it turns out that transformers in particular are just really good at language. A 100 billion-parameter language model is probably better at 10 separate tasks than 10 billion-parameter models for those specific tasks. I don’t know if that’s true for many other modalities. If you just keep scaling up and throwing more and better data at your model, you keep getting better results.


Awesome. Just to double-click on that, why are transformers so effective at language?

 Thats a good question. I think there are a few reasons. There’s the way they work and the things they focus on. Transformers focus on coincidence detection, and looking for agreement between predictions. They model sequences with a correct attention mask, and then they look for agreement between the predictions and the tokens of that sequence. I think that's a really good prior for language.

The other thing is that transformers are really easy to scale. They’re similar to Capsule Networks in a lot of ways, but Capsule Networks never really scaled because it was really hard to get them to work on anything big. They procedures that we had in them, like iterative routing, took up a lot of memory. But transformers can work at a massive scale, which is great.

How do you think about access to large datasets? Will that become a bottleneck?

 I don't think so. There's enough text on the web that you can create a dataset of pretty much any size. the tricky part is that you are going to get a lot of stuff in there that you don't want to train on. Some of it’s just poor quality. Some of it’s too different from the type of text you want your model to be able to generate. Maybe you get some text in there that isn’t even human-readable. You're going to do a lot of filtration for safety and things like that. So I think finding good data can be challenging, or at least time-consuming. But I wouldn’t say it’s a bottleneck. The data is out there. We just have to do a better job of finding and curating it.

Could you touch a bit on how you think about safety? How do you decide whether data is safe to train on?

 For sure. We've been thinking about that. It starts with just acknowledging that language models are a really powerful technology. They can be used to solve all kinds of super important problems, but they can also be used to cause harm. So we’ve thought a lot about what we want our models to be bad at. We're the creator of a tool, and we want to make that tool really good for things we're proud of, and really bad for things we would not be proud of.

For example, language models could be used to generate hate speech. We don’t want that. So we've done a lot of work to filter our dataset so the model doesn’t see examples of hate speech. Some of that work is really simple—word-level filtration, domain-level filtration. Some of it’s more complicated. We published a paper a little while back that demonstrated a method where we used previous versions of the model to calculate the conditional probability of phrases, and then if that conditional probability for a document was high, we would throw that document out.

I think safety is something we’ll be working on that forever. I think as we continue to build better models, we'll continue to identify ways that we can make them better aligned to the applications we want to support, and worse aligned with the applications we don't want to support.

 One question to deep dive on the technical side. Let's say you had some notion of how to make a dataset safe, and you did one run of the training. As you’re fine-tuning your hypothesis of what a safe dataset looks like, do you consider re-training the model from scratch?

 We actually looked at this in the paper I just mentioned. We tested how much improvement you get from taking a model and just fine-tuning it on this new filtered dataset, as compared to training a new model, and it was pretty much the same improvement. That said, we do train new models every now and then. We change the tokenizer, or we make some other big change, and we do have to redo the whole thing. But we did find that you can get big improvements from just fine-tuning as well.


We’ve talked a lot about safety. What are some other interesting problems you’re working on at Cohere?

 We're pushing on embedding models and generation models. We shipped a new product called Classify recently, which allows developers to go into our platform and fine-tune a classification model. We're working on things like that, figuring out how to apply these large language models to really well-formulated problems that lots of people have, like entity extraction and summarization. There’s a lot of exciting work to be done.


 Given the rapid progress in this field, are there any resources you recommend to someone who’s trying to stay ahead of the curve on NLP?

 This may not be a super satisfying answer, but I would actually say don’t worry too much about staying ahead of the field. My advice would be to just work on what is interesting to you. That could be something that everybody is excited about, or it might be something that very few people are excited about. I've been in NL long enough to see a few hype cycles, and I’ve realized that it’s just such a fast-moving field, you can't really know what's going to be the next big thing. So working on what you’re most interested in is the best way to make a unique contribution.


 What are some NLP applications that you're excited to see being worked on in the next few years? Are there things that you want to see happen?

 Totally. The first thing I'm really excited about is using large language models to solve all of these—I'd say almost boring problems. There's a lot of stuff that’s currently done with much less powerful technologies that can be done really well with large language models. Things like classification or entity extraction or error correction.

The other thing I really hope we see is that language becomes the default interface for computing. That's the thing I really want to get done. We spend our whole lives developing language as the primary tool of communication. Then you sit in front of a computer and you use almost none of it. That's really weird to me. You should use the tool you've developed to communicate to a machine.


 We have an interesting audience question here. How do you fine-tune large language models for different demographics? For example, how do you deal with the fact that, say, a millennial will converse in a way that’s different from someone from another generation?

 Language models are definitely affected by the differences between idiolects. Every community has its own way of speaking, and if you take a model that’s been trained on one style of language and try to apply it to a different style, it’s not going to work nearly as well. The way you deal with that is just by fine-tuning. You start with the biggest, broadest model that’s been trained on everything. Then you fine-tune it on a specific idiolect, and you’ll see a big improvement.


 One more audience question. We have someone tuning in from Nigeria who’s asking about job opportunities in NLP. In particular, there aren’t as many jobs in this field in their region, so how should they think about it?

 That's a great question. I think it points to a broader issue in computer science and the tech economy as a whole, which is that there's tons of really talented, amazing people who are in places where finding a job is difficult.

One of the things that has been good about the pandemic is that people have gotten a lot more comfortable with remote work, so that’s helping open up some opportunities. For somebody who's working in NLP in a place like Nigeria where there isn't a Google office down the street, I think the thing to do would be to just start applying to companies. Don’t restrict yourself to one time zone or even to one continent. Go ahead and apply around the world. There are tons of smaller companies that are looking for great people, and a lot of them are really open to remote work. That can be a way to get a foot in and launch your career.

Nick Frosst, co-founder and CEO of Cohere, sat down with Uplimit (formerly CoRise) co-founder Sourabh Bajaj to discuss his experience building large-scale NLP models at Cohere and Google Brain and advice on careers in NLP.