thrilled to be back @Google in the @GoogleDeepMind team! The technical breadth and expertise across the whole stack (hardware->infra->deep learning->products) is truly mind-blowing. Great to see a lot of familiar faces and meet new friends. Look forward to learning a lot!
Excited to join @AIatMeta! The past 4.5 years at @OpenAI,working on embeddings, GPT-3 & 4,API and ChatGPT, have been career highlights. Now, I'm thrilled to work on the next generations of Llama and contribute to its impact on the developer ecosystem and billions of users!🚀 1/2
We explore a simple approach to task-oriented dialog. A single neural network consumes conversation history and external knowledge as input and generates the next turn text response along with the action (when necessary) as output. Paper: arxiv.org/pdf/1910.14613.pdf 1/4
We develop a non-autoregressive machine translation model whose accuracy almost matches a strong greedy autoregressive baseline Transformer, while being 3.3 times faster at inference. Joint work with @ashVaswani@nikiparmar09 Aurko Roy arxiv.org/abs/1805.11063
A thread on how we evaluate our embedding models in OpenAI’s API. We achieve state-of-the-art results in linear probe classification, text search and code search. It’s not fine-tuned, so it works great in the real world — and our customers love it. 1/7
Zero-shot results of OpenAI API’s embeddings on the FIQA search dataset. Evaluation script: github.com/arvind-neural/bei…
We zero-shot evaluated on 14 text search datasets, our embeddings outperform keyword search and previous dense embedding methods on 11 of them!
We are excited to release Taskmaster-1, a new task-oriented dialog dataset. We explore two methods for data collection, two-person and self-dialogs. Surprisingly self-dialogs are an effective way to collect dialog.
Paper accepted to @emnlp2019 : arxiv.org/pdf/1909.05358v1.p…
🔥New Video🔥
OpenAI now offers embeddings for text similarity and search, but are they holding up? We look at the release, the paper, the criticism, and most important: the price! Are the embeddings worth it? Watch here to find out:
piped.video/5skIqoO3ku0
Small models specifically fine-tuned on a dataset can do well on a narrow benchmark, but they far underperform in real-world settings, as many of our customers are discovering. This study from @FineTuneLearn shows our API performance. 7/7
OpenAI embeddings work on a very broad set of use cases. Here, Viable gets a 7.7% absolute improvement in clustering quality using OpenAI embeddings when compared to previous methods!
The cost to run this experiment with text-search-ada, embedding both documents and queries, is ~$80. text-search-ada achieves a 62% relative improvement over keyword search here!
Zero-shot results of OpenAI API’s embeddings on the FIQA search dataset. Evaluation script: github.com/arvind-neural/bei…
We zero-shot evaluated on 14 text search datasets, our embeddings outperform keyword search and previous dense embedding methods on 11 of them!
We describe a simple technique to parallelize Scheduled Sampling across time that allows us to apply Scheduled Sampling for problems that involve generating very long sequences. We get better sample quality and train almost as fast as teacher-forcing.
arxiv.org/abs/1906.04331
We've trained embedding models to produce high quality text and code embeddings. Our general purpose embeddings achieve top results in classification, text search, and code search. The models are now available in the @OpenAI API: openai.com/blog/introducing-…
We're introducing embeddings, a new feature of our API that distills relationships between concepts, sentences, and even code in a simple numerical representation — for more powerful search, classification, and recommendations. openai.com/blog/introducing-…
My team and I trained the model. We look at 33 datasets across four different categories: linear probe classification, sentence similarity, text search, and code search. All these results and figures were in our paper, released this week. arxiv.org/pdf/2201.10005.pdf 2/7
We explore a simple approach to task-oriented dialog. A single neural network consumes conversation history and external knowledge as input and generates the next turn text response along with the action (when necessary) as output. Paper: arxiv.org/pdf/1910.14613.pdf 1/4
We do a large-scale human study to compare different decoding methods for language generation and develop a globally normalized decoding method that optimally traverses the quality-diversity curve.
How does one trade-off sample quality and diversity in a language model? Which decoding method is best? We introduce a multi-objective framework maximizing human judgement score subject to a constraint on diversity (entropy). arxiv.org/abs/2004.10450 (1/7)
in case people are counting, I forgot to share the results for text search from 3 more datasets (apart from the 11 text search results already reported) 🙂
My team and I trained the model. We look at 33 datasets across four different categories: linear probe classification, sentence similarity, text search, and code search. All these results and figures were in our paper, released this week. arxiv.org/pdf/2201.10005.pdf 2/7
In our experiments we find that: 1) our model was able to incorporate external knowledge and generate factual text response with weak supervision signal. 2) our model can incorporate medium-size knowledge bases with only 8K training examples over multiple verticals.
our method actually zero-shot transfers better than bm25 to 11 search tasks on average as shown in the entire table. even our smallest models are better than bm25. while it is not the only way to exploit training data with bm25, we perform better than one such method docT5 query
The code for FIQA experiments to reproduce the results in the paper using the API: nitter.app/arvind_io/status/14882… . There's no discrepancy AFAIK. 2/4
Zero-shot results of OpenAI API’s embeddings on the FIQA search dataset. Evaluation script: github.com/arvind-neural/bei…
We zero-shot evaluated on 14 text search datasets, our embeddings outperform keyword search and previous dense embedding methods on 11 of them!
We leave out 6 not 7 BEIR datasets.Results on MSMARCO, NQ, TriviaQA are in a separate table (Table 5 in the paper).NQ is part of BEIR too and we didn't want to repeat it.The 6 datasets we leave out are not readily available and it is common to leave them out in prior work too.3/4
Agree! But, I think once widely used brown clusters (e.g., : wing.comp.nus.edu.sg/~antho/…) should also be given credit. They use language model pre-training objective on unlabeled data and transfer the word clusters to supervised tasks. They are not "contextual" though.
Data: ai.google/tools/datasets/tas…
Work done with many awesome colleagues at Google Assistant team and
@GoogleAI along with student researcher Chinnadhurai Shankar
Also, Inductive bias of Transformer makes it easier to skip words and learn long-range dependencies compared to RNNs . This paper arxiv.org/abs/1801.10198 has some supporting experiments
I think it's a little harsh to call that work flag-planting. They performed experiments on 4 real-world datasets that AFAIK were widely used by the NLP community. In comparison there were many novel methods during that period only evaluated on toy-data.
I've noticed some of these similarities as well and @paulg explains it well "A startup founder is in effect an economic research scientist." (paulgraham.com/growth.html)
We describe a simple technique to parallelize Scheduled Sampling across time that allows us to apply Scheduled Sampling for problems that involve generating very long sequences. We get better sample quality and train almost as fast as teacher-forcing.
arxiv.org/abs/1906.04331
Zero-shot results of OpenAI API’s embeddings on the FIQA search dataset. Evaluation script: github.com/arvind-neural/bei…
We zero-shot evaluated on 14 text search datasets, our embeddings outperform keyword search and previous dense embedding methods on 11 of them!
The conversation is annotated with accept/reject. At test time we would want the third-party business to implement a boolean function that returns whether transaction can be completed.Neural Assistant will learn to work with the response as it has been annotated at training time.
our method actually zero-shot transfers better than bm25 to 11 search tasks on average as shown in the entire table. even our smallest models are better than bm25. while it is not the only way to exploit training data with bm25, we perform better than one such method docT5 query
The model is trained at turn-level where the dialog history fed into model as input has previous ground-truth turns of the dialog. In the conversations here the actual text responses generated by model itself are used as the assistant’s side of dialog history to be fed as input.
Thanks for the interest. I think Neural Assistant + Taskmaster (ai.google/tools/datasets/tas…) + Google search results as source for external knowledge can work really well for task-oriented dialog!