We advance state-of-the-art #AI techniques paving the path for innovative products at @Salesforce. Focus areas: #AIAgents, #EnterpriseAI, #EGI, and #TrustedAI.

Palo Alto, CA
Looking for the cutting-edge of AI research? Follow Salesforce AI Research to see how we're transforming enterprise technology through advanced innovations. From world models to agentic systems, discover the future of AI before it hits the market.
36
434
2,444,501
Meet our “Tiny Giant.” Our 1B parameter model xLAM-1B is now the best micro model for function calling, outperforming models 7x its size, including GPT-3.5 & Claude. On-device agentic AI is here. #AIResearch #SLM #TinyButMighty Paper: arxiv.org/pdf/2406.18518 Github: apigen-pipeline.github.io/
9
101
469
58,277
Introducing SFR-Judge, our new family of three judge models (8B, 12B, and 70B parameters) — a game-changer for auto-evaluation and reward modeling. Blog: bit.ly/3Y12mTI Paper: arxiv.org/pdf/2409.14664 Github: (code coming soon!): bit.ly/4do1KvL 💥 Trained to perform pairwise comparison, direct scoring, and classification judgments 💥 Outperformed many open-source judges on 10/13 benchmarks 💥 Broken the 90% accuracy barrier on RewardBench - a first for generative models 💥 Showed less bias across 6 key metrics than many other judge models 💥 Matched/outperformed GPT-4o on most pairwise, & direct scoring, and classification tasks Accelerate your own model evaluation with SFR-Judge!
5
37
425
34,727
Time-series forecasting methods perform poorly on long sequences when data changes over time. DeepTime overcomes this issue by using forecasting-as-meta-learning on deep time-index models. Result: state-of-the-art performance and a highly efficient model. blog.salesforceairesearch.co…
9
92
439
Our CodeGen models are now available at @huggingface! (Model size variants: 350M, 2B, 6B, and 16B.) Clone the latest transformers repository and try it out! Paper: arxiv.org/abs/2203.13474 Models: huggingface.co/models?search…
6
72
350
Releasing 🚀 CodeGen2.5 🚀, a small but mighty LLM for code. - On par with models twice its size - Trained on 1.5T tokens - Features fast infill sampling Blog: blog.salesforceairesearch.co… Paper: arxiv.org/abs/2305.02309 Code: github.com/salesforce/CodeGe… Model: huggingface.co/Salesforce/co…
8
103
329
229,739
We’re thrilled to announce that Silvio Savarese (@silviocinguetta), former associate professor of Computer Science at Stanford University, has joined @salesforce as our new EVP and Chief Scientist of Salesforce Research!
10
31
267
🚀Introducing Moirai-MoE:🚀 — the first mixture-of-experts time series foundation model, a breakthrough in universal forecasting! Moirai-MoE achieves token-level model specialization autonomously, delivering an impressive 17% performance boost over its predecessor Moirai at the same model size. Plus, it outperforms other foundation models with up to 65x fewer activated parameters! 💪Dive deeper: 📄 Paper: bit.ly/3O1yiRQ 💻 Code: bit.ly/48FAF6i 🤗 Models: bit.ly/3YNozDY 🔬 Blog: sforce.co/3YXYaEy 🧵 Technical details: 👇 (1/6) Compared to our previous model Moirai, using multi-heuristic-defined input/output projection layers to model time series with different frequencies, Moirai-MoE utilizes a single input/output projection layer while delegating the task of capturing diverse time series patterns to the sparse mixture-of-experts transformers. With these designs, the specialization of Moirai-MoE is achieved in a data-driven manner and operates at the token level.
2
58
232
41,752
🌮 Introducing 🌮 TACO - our new family of multimodal action models that combine reasoning with real-world actions to solve complex visual tasks! 📊Results: 20% gains on MMVet 3.9% average improvement across 8 benchmarks 1M+ synthetic CoTA traces in training 🔓 🔓🔓Fully open-sourced! 🔓🔓🔓 Get started with: 📄 Paper: bit.ly/3PufThl 💻 Code: bit.ly/3Pw8azw 📱 Demo: bit.ly/3PwrEE2 🤖 Models: bit.ly/4j2ZG0h 📚 Datasets: bit.ly/3Pxtzbv 🧵 ...and our Technical deep-dive starts here ⤵️ (1/4) How does TACO work? 🤔 ⛓️TACO answers complex questions by generating Chains-of-Thought-and-Action (CoTA), executing intermediate actions with external tools such as OCR, calculator, and depth estimation, then integrating both the thoughts and action outputs to produce final responses. We generate the synthetic CoTA data with two approaches: model-based generation (top) and programmatic generation (bottom).
6
57
176
70,511
Breaking news! ➡️➡️➡️ We just released the MINT-1T 🍃dataset! One trillion tokens. Multimodal. Interleaved. Open-source. Perfect for training multimodal models and advancing their pre-training. Try it today! Blog: bit.ly/3YikQPP Dataset: bit.ly/3YikQiN
2
42
177
28,240
🚨🚨🚨Just released!🚨🚨🚨 🚀Introducing the Salesforce Code Embedding Model Family (SFR-Embedding-Code), ranked #1 on CoIR Benchmark! 🚀 Available in 2 sizes: 2B, 400M. Key Highlights: 1️⃣ 2B Model: Achieves #1 on CoIR. 2️⃣400M Model: Best-performing model under 0.5B parameters. 3️⃣ Multi-lingual, multi-task unified training framework for code retrieval 4️⃣ Supports 12 programming languages, including Python, Java, C++, JavaScript, C#, and more! 🧑‍💻✨Empower your next AI Coding Agent with the best code embedding models! 🧑‍💻✨ Join us in advancing #AccurateAI: 📎Paper: bit.ly/4gSZteu 🤗400M Model: bit.ly/4jhDRdp 🤗2B Model: bit.ly/3PCqxmp #CodeAI #MLResearch #SOTA #OpenScience @Salesforce Big thanks to our research team for SFR-Embedding Code: Ye Liu @YeLiu918 Rui Meng @RuiMeng_ Shafiq Joty @JotyShafiq Silvio Savarese @silviocinguetta Yingbo Zhou @yingbozhou_ai Caiming Xiong @CaimingXiong Semih Yavuz @semih__yavuz
12
30
162
22,512
Discover CodeGen - an AI model that turns simple natural-language requests into executable code. Learn more about this breakthrough in conversational AI programming. Paper: arxiv.org/abs/2203.13474 Blog: blog.salesforceairesearch.co… Code: github.com/salesforce/CodeGe…
3
31
141
🌳🌳🌳Introducing "CodeTree"🌳🌳🌳 The first unified framework combining tree-based strategy exploration + execution feedback + LLM agent guidance for code generation. 🖇️ Paper: arxiv.org/pdf/2411.04329 📈 Setting new standards with GPT-4: 95.1% HumanEval 98.7% MBPP 43.0% CodeContests Why CodeTree works: 🌳 Tree structure unifies strategy planning, implementation & refinement 🌳 Novel Critic Agent guides search & pruning 🌳 Combines execution feedback + LLM reasoning 🌳 Breakthrough on complex tasks (27.6% on SWEBench) Our framework enables efficient exploration of coding strategies and multi-stage refinement, achieving SOTA across 7 benchmarks. Dive in: arxiv.org/pdf/2411.04329
2
31
146
16,051
Open science wins again! Introducing Salesforce Research DEI, our AI software engineering agents org, achieving a 34.3% resolve rate on SWE-Bench Lite, crushing closed-source systems! GitHub: salesforce-research-dei-agen… Paper: arxiv.org/abs/2408.07060 #OpenScience #AIForAll
2
20
78
8,511
💡 Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models 💡 📄 Paper: bit.ly/44IAvuO 💻 Code: bit.ly/4lLjQgd 😵‍💫 Have a task but experiencing prompt engineering existential dread? Few-shot or zero-shot? Chain-of-thought or ReAct? Where do I get examples? Should I label data? How do I evaluate? What metrics? Manual feedback or auto-looping? Why does one word change everything? Promptomatix eliminates the entire decision tree. Describe task → receive optimized prompt → question nothing. Sanity restored ✨ #LLMs #LargeLanguageModels #FutureOfAI #EnterpriseAI
6
38
140
8,664
📣 Introducing Text2Data, open-sourced for the research community! 🖇️ Paper: arxiv.org/abs/2402.10941 ⌨️ Code: bit.ly/3DzKW90 🧪 A major advancement in multimodal AI - a low-resource, universal text-to-anything framework capable of bridging text with diverse modalities (molecules, motion sequences, time series) — without costly human annotations! 🎬 Text2Data in action: 🎬 Our framework first learns general data patterns from unlabeled data (blue), then fine-tunes with limited labeled examples (red) using constraint optimization to prevent forgetting. At bottom, you see molecules generated with increasing polarizability levels from 'very low' to 'very high' - demonstrating precise textual control. Our #AAAI2025- accepted approach (🎉!) uses unlabeled data distribution mastery + constraint optimization to move beyond traditional annotated data methods. Try it and share your results! #MultimodalAI #OpenScience
34
130
8,725
🔥Introducing XGen-7B, a new 7B LLM trained on 8K seq. length for 1.5T tokens. Better or comparable results with MPT, Falcon, LLaMA, OpenLLaMA in text & code tasks. Blog: blog.salesforceairesearch.co… Code: github.com/salesforce/xgen Training cost ~$150K for 1T token
2
38
124
22,344
🤩 📣 Announcing the 2nd Annual @Salesforce Research Deep Learning Grant 🤩 📣 We're looking for diverse individuals with innovative ideas who can join us in shaping the future of AI. Apply today, and earn up to $50,000! einstein.ai/research/grants
2
55
125
👇UPDATED DATASET👇Fineweb training dataset just got leaner! We've tackled the ~70% duplication issue in this valuable 93.4TB dataset. Same great data, now more efficient and cost-effective. bit.ly/3XI3wlB #AIResearch #DataEfficiency
2
30
128
19,002
🔬🔬🔬Introducing ProVision: A new system for transforming images into verified instruction data for multimodal language models (MLMs) at massive scale! Scene graphs + programmatic synthesis generate 10M+ diverse, automated Q&A pairs. Fully verifiable. Training MLMs? Dive in: 📰Blog: sforce.co/3WazqHi 🗞️Paper: bit.ly/4jkoocL 💻Dataset: bit.ly/4j2IojR 👇Researcher’s 🧵👇 (1/6) Why build ProVision? Training multimodal LMs demands massive instruction datasets - pairing images with Q&As. Manual creation is costly, while using existing models risks hallucinations. ProVision's novel solution? Scene graphs + human-written programs. We represent images as structured graphs capturing objects, attributes & relationships. We then use Python programs and textual templates, our data generators synthesize instruction data by creating questions and answers from the scene graph. 👇🧵 for more...
1
33
112
20,604
Did you know most #NLP models are not designed to handle code-mixing, where each sentence contains multiple languages? Learn how @samsontmr @SFResearch is changing that. Blog: blog.salesforceairesearch.co… Paper: aclweb.org/anthology/2021.na… Code: github.com/salesforce/advers…
3
19
103
(1/12) Can different LLMs give you unique and novel ideas? Very likely NO! 🤖 "𝗦𝗵𝗮𝗿𝗲𝗱 𝗜𝗺𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻: 𝗟𝗟𝗠𝘀 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗲 𝗔𝗹𝗶𝗸𝗲" reveals: LLMs often 𝗮𝗴𝗿𝗲𝗲 on purely imaginary and hallucinated contents! Explore 🧵or full paper: arxiv.org/abs/2407.16604
1
19
102
22,823
So excited to be at @emnlp2019 this week in Hong Kong, with 7 papers accepted. Check out our booth with the amazing @SalesforceEdu team that, and stay tuned throughout the week for updates on all of our different sessions!
3
97
Thank you to everyone who submitted a proposal to our third annual Salesforce AI Research Grant. We’re proud to announce our 2020 round of winners. Congratulations!! @bluevincent @Diyi_Yang @mutembesa @danqi_chen Read More: blog.einstein.ai/celebrating…
12
95
⚡ Meet BOLT: A novel approach to develop long chain-of-thought reasoning in LLMs without relying on knowledge distillation or extensive human annotations. 📄 arXiv.org/abs/2502.03860v1 Three key stages: 1️⃣ LongCoT data bootstrapping via in-context learning 2️⃣ Supervised fine tuning 3️⃣ Online refinement Achieves 40%+ gains on Arena-Hard & strong results across MT-Bench, WildBench, & MATH500 - all with just 10 examples. *Shout out to @_akhaliq for sharing it!
28
97
6,325
👁️ Looking for VLMs that go beyond generators to transform multimodal embeddings? Meet "VLM2VEC: Training Vision-Language Models for Massive Multimodal Embedding Tasks" 📎 Paper: arxiv.org/pdf/2410.05160 💻 Website: tiger-ai-lab.github.io/VLM2V… Our #ICLR25-featured paper shows how vision language models transform into powerful embedders for classification, VQA, retrieval, and visual grounding. We unlock strong emergent capabilities by deeply fusing vision and language rather than shallow combinations. 🇸🇬 Visit us in Singapore to see how we're redefining multimodal representation learning! #MultimodalAI #VLMs
26
96
6,183
Do you want to launch your career in machine learning research? Our new AI Residency Program can allow you to do just that. Set yourself up for success in applying to PhD programs w/ real-world experience at one of the industry's top AI research programs. sforce.co/AIResTwitter
1
27
91
CodeRL advances program synthesis by integrating pretrained language models + deep reinforcement learning. Using unit test feedback in model training and inference + an improved CodeT5 model, it achieves SOTA results on competition-level programming tasks. blog.salesforceairesearch.co…
3
16
95
Introducing Enterprise Deep Research (EDR): A steerable multi-agent system that transforms complex enterprise research into comprehensive, actionable reports 📊 EDR combines 5 key components: 🧠 Master Planning Agent for adaptive query decomposition 🔍 4 specialized search agents (General, Academic, GitHub, LinkedIn) 🛠️ Extensible MCP-based tools (NL2SQL, file analysis, enterprise workflows) 📈 Visualization Agent for data-driven insights 🔄 Reflection mechanism with optional human-in-the-loop guidance Results on open benchmarks: ✅ Outperforms SOTA on DeepResearch Bench (49.86 score) ✅ 71.57% win rate on DeepConsult vs OpenAI DeepResearch ✅ 68.5% coverage on ResearchQA across 7 research domains We're releasing EDR-200 dataset with complete research trajectories from 201 benchmark evaluations 📂 📄 Paper: bit.ly/49in6fp 💻 Code: bit.ly/4huXiPq 📊 Dataset: bit.ly/3LbHcOt Authors: @aksh_555 @shoonyaka1 @zxchen @iscreamnearby @huan__wang at @Salesforce AI Research #MultiAgent #EnterpriseAI #DeepResearch #OpenScience
2
16
91
8,016
🔉 New advances in LLM reasoning capabilities accepted for oral presentation at #ICLR2025! 📎 Paper: arxiv.org/abs/2410.02108 ReGenesis introduces a novel approach where models self-improve their reasoning through abstraction-to-concrete progression - no human supervision needed. Key findings: ▶️ Self-synthesized reasoning paths ▶️ Superior generalization to new tasks ▶️ 6.1% improvement in OOD performance ▶️ Validated across multiple model architectures Our work opens new possibilities for developing more robust and generalizable AI systems. Stay tuned for the full presentation and see you in Singapore! #AIResearch #AIReasoning @iclr_conf
3
14
85
6,946
Introducing XGen-Image-1, our first foray into training large text-to-image models. Trained for $75K using TPUs on the LAION dataset, XGen-Image-1 matches the performance of Stable Diffusion 1.5/2.1. blog.salesforceairesearch.co…
1
27
85
18,588
🚨 Introducing CRMArena-Pro: The first multi-turn, enterprise-grade benchmark for LLM agents ✍️Blog: sforce.co/4dKBRIq 🖇️Paper: bit.ly/3T0AY4E 🤗Dataset: bit.ly/4kiRlG3 🖥️Code: bit.ly/4fkrZVM Most AI benchmarks test isolated, single-turn tasks. Enterprise work is messy, multi-step, and demands both capability AND confidentiality 🔬Built with our exclusive synthetic dataset: Live Salesforce Org sandboxes with realistic, expert-crafted CRM data — enterprise complexity without customer exposure. What makes CRM-Pro different: 🎯 Multi-domain: Sales, service, CPQ workflows 🔄 Multi-turn conversations vs single exchanges 🔒 Confidentiality awareness testing 🏢 Live CRM environment with real API calls 📊 Complex, interconnected business records The results? Even the best LLMs struggle significantly on #EnterpriseAI scenarios. The gap between AI demos and business reality is wider than most realize. CRM-Arena Pro is the first evaluation infrastructure for true Enterprise General Intelligence. #EnterpriseAI #AgenticAI #EGI
5
23
82
13,555
Just in! Our “Tiny Giant”, xLAM-1B-fc, has officially arrived on @huggingface with a few friends!🎉 Check out for our suite of small agentic models, including xLAM-1B-fc and xLAM-7B-fc with mobile-ready, quantized versions now!⚡️#LAM #AIModels #AI 🤗:bit.ly/4faoYaQ
6
14
75
25,310
Check out Diffusion-DPO🌟 Bridging the gap between StableDiffusion & closed models like Midjourney v5. Our #TextToImage model uses human feedback for state-of-the-art alignment, marking a new era in AI creativity! Code: sforce.co/4ab7p7J Blog: sforce.co/3VHYQg3
2
14
74
8,312
🚀 Supercharge your RAG pipeline! 🚀 Introducing LlamaRank, our SOTA reranker, outperforming leading APIs in general document ranking and code search across diverse datasets! Blog: bit.ly/3MmHDTu Try it out on @togethercompute: bit.ly/3SZHybZ Built on Llama3-8B-Instruct and with linear and calibrated scoring for easy interpretation, LlamaRank isn't just powerful, it's blazingly fast.
4
21
74
23,608
📊 Meet LaTent Reasoning Optimization (LaTRO):📊 A principled variational approach to optimize LLM reasoning: 💥 Paper: bit.ly/3YUoP43 💥 Code: bit.ly/3YUoQVF By treating reasoning as sampling from a latent distribution, LaTRO improves zero-shot math accuracy by 12.5% over base models—no external rewards needed. Implement self-rewarding reasoning in your models today! #AIResearch #DeepLearning
2
16
74
9,983
📢📢📢Introducing xGen-MM-Vid (BLIP-3-Video)! This highly efficient multimodal language model is laser-focused on video understanding. Compared to other models, xGen-MM-Vid represents a video with a fraction of the visual tokens (e.g., 32 vs. 4608 tokens). Paper: arxiv.org/abs/2410.16267 Website: bit.ly/3Yvyqiy Researcher’s 🧵:👇
3
14
75
12,550
Announcing the Third Annual AI Research Grant! For more details and how to apply: Blog: blog.einstein.ai/announcing-… Website: einstein.ai/outreach/grants Good luck to our future applicants!
35
74
We’re thrilled to announce that @MetaMindIO has been acquired by @Salesforce! metamind.io/salesforce-acqui…
44
71
(1/4) Foundation models are revolutionizing time series analysis—but their success depends on large, diverse, high-quality datasets, which poses a major challenge. Enter synthetic data, reshaping Time Series Foundation Models (TSFMs) & Time Series LLMs (TSLLMs). Our survey explores how it tackles data scarcity, improves model training & unlocks new research directions. 🧵⬇️ 📝 Paper: arxiv.org/abs/2503.11411
2
18
67
9,056
We're thrilled to announce this year's @SFResearch Deep Learning Grant winners @ChenhaoTan @gregd_nlp @pulkitology Christopher Ré and Hung-yi Lee! 🎉👏 We're excited to work together to advance the state of AI. Read more about the winning proposals: blog.einstein.ai/celebrating…
19
72
🌟 Meet #Moirai: Revolutionizing time-series forecasting with universal models! Say goodbye to dataset-specific models and hello 👋 to accurate forecasts across domains! Code: sforce.co/4aADhSM LOTSA data: sforce.co/4axHHtQ Blog post: sforce.co/3TCMDqu
2
15
67
12,997
🚨Introducing "Elastic Reasoning"🚨 Our novel framework solves LLM inference budget constraints without sacrificing performance. Open and available to the research community: 📄 Paper: bit.ly/4kygc8p 💻 Code: bit.ly/3ZwjFfo 🤗 Models: bit.ly/44RmICG Key insight: Separate "thinking" and "solution" phases with independent token budgets, plus budget-constrained rollout training. Research results: 👉 E1-Math-1.5B: 35% accuracy on AIME2024 with 32% fewer tokens 👉 E1-Code-14B: Codeforces rating of 1987 (96th percentile) 👉 Models generalize to ANY budget without retraining The framework (shown) combines GRPO training under constraints + separate budgeting at inference. This means reliable reasoning even when thinking gets cut short. 🎯 Why this matters: Real deployments need predictable costs. Most reasoning models generate uncontrolled token lengths. Elastic Reasoning gives you the dial to tune compute vs performance. #OpenScience #LLM #ReasoningModels
1
13
63
7,224
🚨 New Survey Alert! 🚨 🧠”A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems” 📘 Paper: bit.ly/4cnAhvq 🧠 Project Page: bit.ly/3E6ROv6 🧵 Researcher's thread: 👇 (1/6) Reasoning is the key to unlocking true AI intelligence.🔑 Two factors that affect the reasoning capabilities are: 1⃣ Regime: how and at what stage is reasoning achieved? 2⃣ Architecture: what components are involved in the reasoning process ⚡️We present a comprehensive survey along these two dimensions, summarizing recent progress and covering: Regimes, from inference scaling (e.g., OpenAI o1) to learning to reason (e.g., DeepSeek-R1), also including learning algorithms for both the reasoner and the verifier; Architectures, ranging from standalone LLMs to agentic systems (e.g., OpenAI’s deep research). We also unify techniques from input and output perspectives, clarifying what must be customized or designed when building reasoning systems.
1
15
66
4,979
🌟 Excited to present our work at Empirical Methods in Natural Language Processing @emnlpmeeting - a leading conference in NLP and AI research! 📄 Our accepted papers: Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization 👥Authors: Chuyuan Li @ChuyuanLi, Austin Xu @austinsxu, Shafiq Joty @JotyShafiq, and Giuseppe Carenini @careninigiusepp 📝Paper: bit.ly/47L4bsV Demystifying Domain-adaptive Post-training for Financial LLMs 👥Authors: Zixuan Ke @KeZixuan, Yifei Ming @ming5_alvin, Xuan-Phi Nguyen, Caiming Xiong @CaimingXiong, Shafiq Joty @JotyShafiq 📝Paper: bit.ly/4hK6O0e CEMTM: Contextual Embedding-based Multimodal Topic Modeling 👥Authors: Amirhossein Abaskohi @AmirAbaskohi, Raymond Li, Chuyuan Li @ChuyuanLi, Shafiq Joty @JotyShafiq, and Giuseppe Carenini @careninigiusepp 📝Paper: bit.ly/3JsZFFy From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text 👥Authors: Ridwan Mahbub @mahbub_ridwan, Mohammed Saidul Islam, Mir Tafseer Nayeem @mtnayeem, Md Tahmid Rahman Laskar, Mizanur Rahman, Shafiq Joty @JotyShafiq, Enamul Hoque @Enamul_Hoque 📝Paper: bit.ly/3HNIgXC Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text 👥Authors: Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty @JotyShafiq, Enamul Hoque @Enamul_Hoque 📝Paper: bit.ly/41HuI6X Direct Judgement Preference Optimization 👥Authors: Peifeng Wang @PeifengWang3, Austin Xu @austinsxu, Yilun Zhou @YilunZhou, Caiming Xiong @CaimingXiong, Shafiq Joty @JotyShafiq 📝Paper: bit.ly/40fLkkv MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models 👥Authors: Shrey Pandit @ShreyPandit2001, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding 📝Paper: bit.ly/3JtFGGD ActionStudio: A Lightweight Framework for Data and Training of Large Action Models 👥Authors: Jianguo Zhang @JianguoZhang3, Thai Hoang, Ming Zhu @ming_zhu0527, Zuxin Liu @LiuZuxin, Shiyu Wang @shiyu04490786, Tulika Awalgaonkar @tulika614, Akshara Prabhakar @aksh_555, Haolin Chen @HaolinChen11, Weiran Yao @iscreamnearby, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles @jcniebles, Shelby Heinecke @shelbyh_ai, Huan Wang @huan__wang, Silvio Savarese @silviocinguetta, Caiming Xiong @CaimingXiong 📝Paper: bit.ly/4fSEF79 LATTE: Learning to Think with Vision Specialists 👥Authors: Zixian Ma, Jianguo Zhang @JianguoZhang3, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles @jcniebles, Shelby Heinecke @shelbyh_ai, Huan Wang @huan__wang, Caiming Xiong @CaimingXiong, Ranjay Krishna @RanjayKrishna, Silvio Savarese @silviocinguetta 📝Paper: bit.ly/3UEXlO7 Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D 👥Authors: Artemis Panagopoulou @artemispng, Le Xue @Le_Xue01, Honglu Zhou @zhou_honglu 📝Paper: bit.ly/4oXgEQH #EMNLP2025 #FutureOfAI #EnterpriseAI #LanguageModels #NLP
1
14
63
3,514
📣 From efficient key caches and multimodal embeddings to self-improving reasoning and faithful context adherence... we're thrilled to present a broad range of powerful new research at #ICLR2025! 🎉 Bookmark our accepted papers below, and we'll see you in Singapore, @iclr_conf ! 🔖 REGENESIS: LLMs can grow into reasoning generalists via self improvement 👉arxiv.org/html/2410.02108v1 🧠Becky Xiangyu Peng Congying Xia Xinyi Yang Caiming Xiong Jason Wu Chen Xing 🔖SiReRAG: Indexing Similar and Related Information for Multihop Reasoning 👉arxiv.org/abs/2412.06206 🧠 Nan Zhang, Prafulla Choubey, Alexander. Fabbri, Gabriel Bernadett-Shapiro, Jason Wu 🔖FaithEval: Can Your Language Model Stay Faithful to Context, Even If “The Moon is Made of Marshmallows'' 👉arxiv.org/abs/2410.03727 🧠 Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan Phi Nguyen, Caiming Xiong, Shafiq Joty 🔖Preference Optimization for Reasoning with Pseudo Feedback 👉arxiv.org/abs/2411.16345 🧠Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F. Chen, Shafiq Joty, Furu Wei 🔖ThinK: Thinner Key Cache by Query-Driven Pruning 👉arxiv.org/abs/2407.21018 🧠Yuhui Xu; Zhanming Jie; Hanze Dong; Lei Wang; Xudong Lu; Aojun Zhou; Amrita Saha; Caiming Xiong; Doyen Sahoo 🔖Automatic Curriculum Expert Iteration for Reliable LLM Reasoning 👉arxiv.org/abs/2410.07627 🧠Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo 🔖VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks 👉arxiv.org/abs/2410.05160 🧠Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, Wenhu Chen 🔖Integrating Expertise of Software Engineering Agents 👉arxiv.org/abs/2408.07060 🧠Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong Congrats to our researchers for the incredible body of work! #MachineLearning #AIResearch
2
5
61
8,626
Introducing COVID-19 Search, a new AI-powered search tool that equips scientists and researchers with the most relevant information about COVID-19. Learn more about this tool at sfdc.co/covid19search
3
42
58
💥 xLAM-7b beats #GPT-4 in function calling according to the The Berkeley Function Calling Leaderboard, second only to #Claude 3.5-Sonnet. Our "Tiny Giant" models are ranking [2] and [26]. Check it out: bit.ly/3WIZdY3! #tinybutmighty #SLM (and congrats, team!)
2
16
59
10,880
⚡ NEW COMPUTER-USE AI RESEARCH ⚡ Introducing: 1️⃣ Our paper, Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis 2️⃣OSWORLD-G benchmark covering fine-grained manipulation and layout understanding 3️⃣JEDI dataset, our GUI grounding dataset series with 4M examples, 3B and 7B model variants — 🔗Paper: bit.ly/45GvDHc 🧑‍💻Code & Sample Usage: bit.ly/454TWyA 💻Website: bit.ly/43lLRnN 🤗Dataset: bit.ly/45CYVGP Key contributions: → 564 expertly annotated samples across 5 capability dimensions → Multi-perspective task decomposition (icons, components, layouts) → SOTA performance: 91.7% on ScreenSpot-v2, 54.1% on OSWORLD-G → Direct impact: 5% → 27% success rate improvement on OSWorld The gap between "click this button" and true computer interaction is closing. Our models now handle precise text cursor placement to complex software navigation across desktop environments. Open sourced for the research community in partnership with The University of Hong Kong @HKUniversity #ComputerUse #GUI #AgenticAI
14
60
3,602
🚨MODEL RELEASE! We're thrilled to announce our powerful, compact xGen-small model family, now available for the research community. 🤗Download xGen-Small model: bit.ly/3ZeEUCk Key highlights: ▶️ xGen-9B: highly competitive on long-context understanding up to 128K tokens ▶️ Exceptional math reasoning: 95.3% GSM8K, 91.6% MATH, 50.0% AIME 2024 ▶️ Superior code generation: 50.6% on LiveCodeBench Our "small but long" approach proves strategic engineering beats brute-force scaling. ▶️ Full breakdown in our blog: sforce.co/3YrM5a2 ▶️ Technical report available here: bit.ly/3GUUnBp Advance your research today, and tell us what you think! #SLMs #EnterpriseAI
3
10
61
43,637
We introduce the Salesforce CausalAI Library, an open source library for causal analysis of time series and tabular data. GitHub: github.com/salesforce/causal… GitHub Documentation: opensource.salesforce.com/ca… Tech Report: arxiv.org/abs/2301.10859 Blog: blog.salesforceairesearch.co…
1
16
56
12,566
Marking our long-awaited return to Twitter with some big news; we're expanding to Singapore 🇸🇬! Excited to partner w/universities across the country in shaping the future of #AI, #NLP, #ML and beyond. salesforce.com/au/blog/2019/…
2
23
56
🚀 Introducing GIFT-Eval: 🎁The new gold standard in time series forecasting evaluation! 144K+ time series. 23 datasets. One benchmark to rule them all. Dive in: Paper 📄:bit.ly/4fIJSxe Blog: 🧠: sforce.co/3ALOcwE Github 🖥️:bit.ly/3YICI59 Dataset 💾:bit.ly/3ANNTBv Leaderboard 🏎️:bit.ly/3ACSshZ Our comprehensive GIFT-Eval tests models across ALL domains, frequencies & prediction lengths - from zero-shot to full-shot scenarios. Help us advance innovation in AI time series research! #TimeSeries #Forecasting #DataScience ...and dive into our researcher's technical thread here: 🧵👇
1
20
53
14,920
Using both natural and artificial abilities, the human relationship with tools has drastically evolved. The best tools are powerful because they’re easy to use. This is where our skill of language and AI meet. Learn more on how conversation can power AI > blog.salesforceairesearch.co…
14
52
🧠 Breaking Research! 🧠 Solving the LLM "Goldilocks Problem" Introducing Auto-CEI: A breakthrough training method that helps train LLMs find the “sweet spot” between overconfident (plausible but incorrect) hallucinations and overcautious (“I don’t know”) refusals. 🔗 Full paper: ar5iv.org/abs/2410.07627 🧵 Research review: 👇 #LLMResearch #TrustedAI
1
10
54
8,613
🆕Excited to announce SWERank, our code ranking framework for software issue localization. ➡️Paper: bit.ly/3S0x1fV ➡️GitHub Project Page: bit.ly/42SESm3 ➡️AI-Generated Podcast: bit.ly/3GMF51H ➡️Code, Data and Models: Coming soon! (1/3) 🧵 Pinpointing the exact location of a software issue in code is a critical but often time-consuming part of software development. Current agentic approaches to localization can be slow and expensive, relying on complex steps and often closed-source models. We introduce SWERank, a retrieve-and-rerank framework, that comprises SWERankEmbed, a bi-encoder code retriever and SWERankLLM, a listwise LLM code reranker. SWERank is significantly more cost-effective and considerably more performant than other Agent-based approaches. Our 7B SweRankEmbed retriever even outperforms LocAgent running with Claude-3.5! #TrustedAI
1
20
52
8,680
Editing an image using AI but want to keep the details? Check out our work EDICT (🎆CVPR 2023🎆): Gradio Demo: huggingface.co/spaces/Salesf… Code: huggingface.co/spaces/Salesf… Arxiv: arxiv.org/abs/2211.12446 Authors: @bram_wallace @nikhil_ai
Do you want to make your dog look like a golden retriever? Or get a picture of a cat surfing? Researchers at Salesforce recently developed a new editing algorithm called EDICT - here's a thread on the results and details 🧵
3
14
52
18,498
Our xLAM (#LargeActionModels) family just got an upgrade! 1️⃣ Multi-turn, natural conversation support 2️⃣ Smarter multi-step reasoning 3️⃣ Models from 1B to 70B for ultimate flexibility 🤗 HuggingFace: bit.ly/4jyj2tu 👑 BFCL Leaderboard: bit.ly/3WIZdY3 Our research models xLAM-70B-r ranks #1 and xLAM-32B-r #2 on the BFCL function-calling leaderboard—beating GPT-4o, Gemini, Qwen & more. xLAM-8B-r lands at #4, ahead of GPT-4o. And our Tiny Giant, xLAM-1B-r, plus xLAM-3B-r, outperform much larger models like Mistral-Large and DeepSeek-V3. This is just the beginning—we're building even stronger xLAM models internally to inspire future Salesforce innovation. Stay tuned!
18
53
10,788
🎉 ✍️ Our research on advancing AI-generated writing accepted to #CHI2025! ✍️ 🎉 Our paper reveals how expert edits fix AI text issues—from clichés to purple prose— creating better data for Reinforcement Learning from Human Feedback (RLHF) alignment. Thanks @acm_chi, we'll see you in Yokohama! 🇯🇵 Check it out! arxiv.org/pdf/2409.14509 #RLHFdata #AIforWriting
2
7
53
5,488
Our paper "Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits" has been awarded a Best Paper Honorable Mention and is in the Top 5% of submissions for #CHI2025! 🎉 Check it out here: arxiv.org/pdf/2409.14509 #AI #Research #AIWriting @jasonwu0731 @TuhinChakr @PhilippeLaban
1
5
50
9,404
❓Beyond "right” or “wrong": Introducing a novel RAG evaluation framework based on sub-question coverage. How do we measure if RAG systems are giving complete answers to complex questions? Enter: “Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage” #AccurateAI 📎Paper: arxiv.org/abs/2410.15531 🧵starts here 👇 1) We propose decomposing questions into sub-questions and classifying them into three types—core, background, and follow-up—to reflect their roles and importance. 💠 Core sub-questions are central to addressing the main query. 💠 Background sub-questions provide necessary context or supplementary information. 💠 Follow-up sub-questions may emerge from the original question but are optional to formulate a satisfactory answer.
1
9
48
7,417
Meet CodeT5 - the first code-aware encoder-decoder pre-trained model that achieves SoTA on 14 sub-tasks in CodeXGLUE! Learn how it’s disrupting software development. Blog: blog.einstein.ai/codet5/ Paper: arxiv.org/abs/2109.00859 GitHub: github.com/salesforce/CodeT5 #codeintelligence
3
20
47
Introducing APIGen-MT: Our agentic pipeline for multi-turn synthetic data generation that produces high-quality training data for tuning AI agents! Try our open-sourced dataset today! 📊 Paper: bit.ly/44tORzx 🤗 Dataset: bit.ly/3GHuQM5 We used APIGen-MT to train our xLAM-2 model family, including xLAM-2-70b-fc-r — still #1 on the BFCL leaderboard with 78.2% accuracy, outperforming frontier models like GPT-4o and Claude 3.5 in function-calling tasks —especially in challenging multi-turn scenarios. 🤝 We're open-sourcing 5K high-quality trajectories and trained models to advance AI agent research. 🧠 xLAM Model Family: bit.ly/4jyj2tu 🔍 BFCL: bit.ly/3WIZdY3
9
41
10,660
Exciting news! 🎊 Our models, xLAM-7B-fc and xLAM-1B-fc, are ranked #3 and #25 on the Berkeley Function Calling leaderboard. Notably, they are the smallest models on the leaderboard!🚀📊 #AI #AIModels #AIresearch Check out our suite of small agentic models, including mobile-ready, quantized versions. 🤗 @huggingface: bit.ly/4faoYaQ
1
7
41
2,778
We built SFR-Embedding-Code to bridge a critical gap: While text retrieval has advanced rapidly, code retrieval needed specialized attention. Our open-source models achieve SOTA results by learning from diverse code and text tasks and supporting 12 programming languages. See why SFR-Embedding is the Top-1 model on the CoIR Leaderboard! 🥇 #CodeRetrieval #AIforDevelopers 📖 Read more in our latest blog: sforce.co/40V3Eks For the models and more: 🤗400M Model: bit.ly/4jhDRdp 🤗2B Model: bit.ly/3PCqxmp 🏆CoIR Leaderboard: bit.ly/3CkgRKj 📄Technical Report: bit.ly/4gSZteu
1
8
37
4,651
🔬Advanced agent systems, RAG evaluation, instruction-following and more. Our team's accepted papers at #NAACL2025 span from professional CRM research to parallel in-context learning. 🎉A huge congrats to our researchers and thanks to @naacl — we're excited to share and discuss with the community this spring! 💫 👇📑Bookmark and explore the research below! 📑👇 📎CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments: ➡️arxiv.org/abs/2411.02305 👏Steeve Huang, Akshara Prabhakar, Sidharth Dhawan, Yixin Mao, Huan Wang, Silvio Savarese, Caiming Xiong, Philippe Laban, Chien-Sheng (Jason) Wu 📎Evaluating Cultural and Social Awareness in LLM Agents ➡️arxiv.org/abs/2410.23252 👏Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng (Jason) Wu 📎Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage ➡️arxiv.org/abs/2410.15531 👏Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng (Jason) Wu 📎Measuring Progress in Evaluating Instruction Following with Large Language Models ➡️arxiv.org/abs/2410.07069 👏Yixin Liu, Kejian Shi, Alex Fabbri, Yilun Zhao, Peifeng Wang, Chien-Sheng (Jason) Wu, Shafiq Rayhan Joty, Arman Cohan 📎CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models ➡️arxiv.org/abs/2411.04329 👏Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Doyen Sahoo 📎On Positional Bias of Faithfulness for Long-form Summarization ➡️arxiv.org/abs/2410.23609 👏David Wan, Jesse Vig, Mohit Bansal, Shafiq Joty 📎LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs ➡️arxiv.org/abs/2408.08656 👏Do Xuan Long, Hai Nguyen Ngoc, Tiviatis Sim, Hieu Dao, Shafiq Joty, Kenji Kawaguchi, Nancy F. Chen, Min-Yen Kan 📎ParaICL: Towards Parallel In-Context Learning ➡️arxiv.org/abs/2404.00570 👏Li Xingxuan, Xuan Phi Nguyen, Shafiq Joty, Lidong Bing 📎xLAM: A Family of Large Action Models to Empower AI Agent Systems ➡️Paper: arxiv.org/abs/2409.03215 ✍️Blog: salesforce.com/blog/ai-agent… 🧠Models: bit.ly/4faoYaQ 👏Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong 📎Tutorial: Adaptation of Large Language Models 🖥️Website coming soon! 👏Zixuan Ke, Yifei Ming, Shafiq Rayhan Joty Congrats again to our talented team!
6
40
4,252
Check out our #ICLR2024 Accepted Papers. Congratulations to all of our authors!
5
39
5,096
🔄 PerfCodeGen: When LLMs learn from their own code execution. Our training-free framework outperforms human solutions in up to 67% of coding tasks by doing what great developers do - test, analyze, refine, repeat. 📊 Paper: bit.ly/4jmH5wb 🧑‍💻 Code: bit.ly/4akP20J 📰 MarkTechPost: bit.ly/4jEVGDp 🧵 Researcher's walk-through👇 #EfficientAI #CodeGeneration
2
6
38
2,983
📣 Meet: "From AI-Slop to AI-Polish," tackling the elephant in the room 🐘 —AI writing quality is "mid" at best. Despite LLMs crushing coding, their creative writing feels pedestrian. We introduce: 1️⃣ Writing Quality Benchmark (WQ): First comprehensive testbed for writing quality assessment 2️⃣ Writing Quality Reward Models (WQRM): Outperforming GPT-4o & Claude with 74% accuracy on WQ 3️⃣ Test-time compute strategies yielding text preferred by experts 66% of the time 🖇️ Paper: bit.ly/3Gs6LbR Time to raise the bar on AI-generated text beyond "coherent but clichéd." #AIResearch #NLP #WritingQuality
1
16
38
3,199
🌳🌳🌳 Take a closer look at CodeTree! 🌳🌳🌳 1/6 Dive deep into our new framework for code generation with large language models (LLMs), combining multi-agent collaboration with an efficient tree search strategy. Code: bit.ly/3Vo0AKw Paper: bit.ly/3Vo0Au0 Technical thread :👇
1
8
37
3,998
🎉Just Announced: "ViUniT: Visual Unit Tests for More Robust Visual Programming" has been accepted at #CVPR2025! Paper Link: arxiv.org/pdf/2412.08859 Project Page: artemisp.github.io/viunit/ Researcher’s walk-through 👇 In collaboration with @UPenn, we introduce ViUniT, a framework that enhances the reliability of visual programs by automatically generating unit tests by leveraging #LLMs and #DiffusionModels. Our approach: 📊 Boosts model performance by 11.4% and outperforms gpt-4o-mini by 7.7%. 🔄 Reduces “right-for-wrong-reasons” errors by 40%. 💡 Introduces innovative applications like best program selection, answer refusal, and unsupervised reward design. Dive into the 🧵👇 (1/5) Method Overview ViUniT leverages the power of language models and diffusion models to create robust visual unit tests: 📝 Unit Test Generation: Language models generate image descriptions and expected answers for visual queries. 🎨 Image Creation: Diffusion models then synthesize corresponding images, ensuring logical coverage and diversity. ✅ Logical Verification: Visual programs are tested for logical correctness against these generated unit tests, catching errors beyond output accuracy.
2
4
37
9,276
Trying out a snazzy new LLM tonight? Take a break, download and try our Tiny Giant xLAM-1B and xLAM-7B, now on @huggingface. Your agentic AI workflows will thank you! bit.ly/4faoYaQ #tinybutmighty
12
37
10,779
We're thrilled to announce BLIP3-o, a breakthrough in unified multimodal models that excels at both image understanding and generation in a single autoregressive architecture! 💫 📊 Paper: bit.ly/3Saybpo 🤗 Models: bit.ly/4jhFaYM 🧠 Code: bit.ly/43id1uB 📽️ Learn on the go (AI Generated): bit.ly/3EWDZQp Our research reveals that using CLIP features with diffusion transformer and flow matching creates superior performance while reducing computational complexity. Most importantly, we're making this model family available to the AI Research community: ▶️ Complete model implementations ▶️ Model weights ▶️ 25M+ detailed caption pretrain dataset ▶️ 60K high-quality instruction tuning dataset Advance your multimodal AI research and share your findings in the comments. (And thanks for the shout, @_akhaliq!)
1
9
37
4,578
Discover CTRLsum, a generic summarization framework that enables users to control the content of the generated summaries along multiple dimensions. Blog: blog.einstein.ai/ctrlsum/ Paper:arxiv.org/abs/2012.04281 Code: github.com/salesforce/ctrl-s… #NLP #summarization
2
5
35
Read our blog on #ACL2022. Congrats to all our authors for their accepted papers! blog.salesforceairesearch.co…
1
5
36
📣 Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels 📣 RL for LLMs faces a critical data bottleneck: existing RL datasets are <10B tokens while pretraining uses >1T tokens. Our Webscale-RL pipeline solves this by automatically converting pretraining documents into 1.2M verifiable QA pairs across 9+ domains. 📄 Paper: bit.ly/3IFuMhf 💻 Code: bit.ly/42AVpdX 📊 Dataset: bit.ly/4h5lVBS Results: 100× more token-efficient than continual pretraining with significant performance gains on MMLU-pro, BigBench, and mathematical reasoning benchmarks 📈 Work by Zhepeng Cen (@zhepengcen), Haolin Chen (@HaolinChen11), Shiyu Wang (@shiyu04490786), Zuxin Liu (@LiuZuxin), Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong (@CaimingXiong), Huan Wang (@huan__wang), Weiran Yao (@iscreamnearby) #FutureOfAI #EnterpriseAI #ReinforcementLearning #MachineLearning
1
7
36
16,612
🏆 #ICML2025 Best Paper Award: AI Safety Should Prioritize the Future of Work 📄 Paper: arxiv.org/abs/2504.13959 🎉 Congratulations to Sanchaita Hazra @hsanchaita, Bodhisattwa Prasad Majumder @mbodhisattwa, and Tuhin Chakrabarty @TuhinChakr for winning the Outstanding Award — one of 15 top papers out of 3,260 accepted submissions! Key insights: 🔸 Comprehensive worker transition support needed 🔸 AI exacerbates income inequality through labor disruption 🔸 International copyright reforms & collective licensing required 🔸 Pro-worker AI governance for shared prosperity @icmlconf #AIethics #FutureOfWork #AIgovernance #FutureOfAI #EnterpriseAI
1
7
36
1,837
Can #AI language models learn from evolution to design proteins? Learn how Salesforce is taking a step towards enabling solutions to cure disease and clean our planet. Blog: blog.einstein.ai/learning-fr… Paper: biorxiv.org/content/10.1101/…
2
6
35
Our blog for Diffusion-DPO is now live!🚀 In this project we brought the benefits of Reinforcement Learning from Human Feedback (RLHF) to text-to-image diffusion models at scale for the first time. blog.salesforceairesearch.co…
2
9
34
3,610
Meet BLIP: Bootstrapping Language-Image Pre-training for unified Vision-Language understanding/generation. New model architecture + Dataset bootstrapping = SoTA results on a wider range of V+L tasks than other models! blog.salesforceairesearch.co…
7
34
🔬We’re so excited about TACO. our new open sourced multimodal model family that excels at complex visual reasoning tasks requiring multiple steps and external tools! 📊 The results speak for themselves: 🌮30-50% accuracy boost vs. few-shot CoTA prompting 🌮Up to 20% improvement on MMVet benchmark 🌮Consistent outperformance across 8 benchmarks …but check out our new blog that brings it to life! salesforce.com/blog/taco-mul… …and a great write-up by @Marktechpost 👊bit.ly/3E0twTa Ready to get started? 📄 Paper: bit.ly/3PufThl 💻 Code: bit.ly/3Pw8azw 📱 Demo: bit.ly/3PwrEE2 🤖 Models: bit.ly/4j2ZG0h 📚 Datasets: bit.ly/3Pxtzbv 🧵 Research thread ⤵️
🌮 Introducing 🌮 TACO - our new family of multimodal action models that combine reasoning with real-world actions to solve complex visual tasks! 📊Results: 20% gains on MMVet 3.9% average improvement across 8 benchmarks 1M+ synthetic CoTA traces in training 🔓 🔓🔓Fully open-sourced! 🔓🔓🔓 Get started with: 📄 Paper: bit.ly/3PufThl 💻 Code: bit.ly/3Pw8azw 📱 Demo: bit.ly/3PwrEE2 🤖 Models: bit.ly/4j2ZG0h 📚 Datasets: bit.ly/3Pxtzbv 🧵 ...and our Technical deep-dive starts here ⤵️ (1/4) How does TACO work? 🤔 ⛓️TACO answers complex questions by generating Chains-of-Thought-and-Action (CoTA), executing intermediate actions with external tools such as OCR, calculator, and depth estimation, then integrating both the thoughts and action outputs to produce final responses. We generate the synthetic CoTA data with two approaches: model-based generation (top) and programmatic generation (bottom).
11
33
9,934
⚡ The era of AI agents that just chat is over. @Salesforce just introduced GTA1 - Computer Use Agents that actually CLICK, SCROLL, and WORK in your enterprise software like a human would. 👉 salesforce.com/blog/computer… 🎯 The results are game-changing: ➡️ 50.1% success on enterprise UIs ➡️ Outperforms models 10x larger ➡️ Beats OpenAI's CUA in half the steps ➡️ Built with enterprise trust & security No more "sorry, I can't click that button" - these agents navigate CRMs, update records, and complete real workflows. The future of work isn't just AI that thinks. It's AI that ACTS. #EnterpriseAI #FutureOfAI
1
4
35
2,527
We have 8 accepted papers at NAACL this year! Congratulations to all of our authors on their work!
4
33
2,131
Want to build bots better? Try Converse: a new Task-Oriented Dialogue System that simplifies chatbot building while handling complex tasks and conversations. #NLP #AI Code: github.com/salesforce/conver… Paper: arxiv.org/abs/2203.12187 Blog: blog.salesforceairesearch.co…
1
11
32
🏆 Introducing MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision 🏆 💻 Project Page: bit.ly/43DLns8 📄 Paper: bit.ly/3Ftbaeo 🔗 Code: bit.ly/43wJoFR 📚 Explore 1,000+ Discovered MAS designs: bit.ly/43AoaqH 🧵 Technical walk-through 👇 (1/6) Multi-Agent Systems (MAS) can outperform single-agent approaches, however designing MAS manually is difficult, especially when LLM preferences differ from human intuition, and manually designed MAS are hard to adapt to new tasks. ❓Can we automate MAS design—even better, can we make it self-evolving without relying on a validation set? Meet MAS-Zero: a meta-level, inference-time, self-evolving framework for automatic MAS design. 🔥 Consistent outperformance of existing automatic MAS methods ⚡ Beats popular manually designed MAS baselines where other methods fail 🧠 Sets a new performance-cost frontier across multiple domains
2
12
34
18,582
Introducing the full xLAM family, our groundbreaking suite of Large Action Models! 🚀 From the 'Tiny Giant' to industrial powerhouses, xLAM is revolutionizing AI efficiency! #AIResearch #AIEfficiency 🤗 Hugging Face Collection: bit.ly/4faoYaQ 🤩 Research Blog bit.ly/3MxliCZ 🗞️ Press Release: sforce.co/3XzaOt9 Meet the family: • xLAM-1B / TINY: Our 1B parameter marvel, ideal for on-device AI. Outperforms larger models despite its compact size • xLAM-7B / SMALL: Perfect for swift academic exploration with limited GPU resources. • xLAM-8x7B / MEDIUM: Mixture-of-experts model balancing latency, resources, and performance for industrial applications. • xLAM-8x22B / LARGE: Our large-scale model for optimal performance in high-resource environments. 🎉 Huge congrats to the team of AI scientists who brought xLAM series to life! Zuxin Liu @LiuZuxin Shirley Kokane @KokaneShirley Ming Zhu @ming_zhu0527 Tian Lan @TLan001 Jianguo Zhang @JianguoZhang3 Thai Hoang @TeeH912. Caiming Xiong @CaimingXiong Silvio Savarese @silviocinguetta
3
15
31
15,235
🚀Just dropped: Reward-Guided Speculative Decoding (RSD) - our breakthrough approach that makes LLM inference up to 4.4× faster while IMPROVING accuracy. 📄Paper: arxiv.org/abs/2501.19324 💻Code: bit.ly/4gJ7Ect 👇 Key innovations in RSD: 👇 1⃣ Biased Acceleration - Unlike traditional speculative decoding methods that enforce unbiasedness, RSD incorporates a controlled bias to prioritize high-reward outputs. 2⃣ Dynamic Quality Control - Process Reward Model (PRM) acts as real-time quality gate, only engaging costly target model when needed 3⃣ Proven Optimality - Mathematically derived threshold strategy ensures optimal efficiency-performance balance 4⃣ Efficient Architecture - Demonstrated ability to merge PRM with main models, further reducing computational overhead ⚡Results: Up to 4.4× faster compared to serving the target model alone. Up to +3.5 average accuracy improvement over parallel decoding baselines—even outperforming the target model on average! Research code above with detailed experimental configurations. Join us! #EfficientAI #AccurateAI
6
31
4,134
🚨We’re proud to announce our #ACL2025NLP-accepted papers. Preview and bookmark the research below, and we’ll look forward to seeing you in Vienna. Thanks, @aclmeeting! 👉 Turning Conversations into Workflows: A Framework to Extract and Evaluate Dialog Workflows for Service AI Agents 🖇️arxiv.org/abs/2502.17321 👉 Unanswerability Evaluation for Retrieval Augmented Generation 🖇️arxiv.org/abs/2412.12300 👉 Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding 🖇️arxiv.org/abs/2502.11492 👉 Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings 🖇️arxiv.org/abs/2503.15620 👉 What Makes a Good Natural Language Prompt? Paper coming soon! 👉 Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks 🖇️arxiv.org/abs/2410.01428 👉 Beyond Output Matching: Bidirectional Alignment for Enhanced In-Context Learning Paper coming soon! 👉 Learning Auxiliary Tasks Improves Reference-Free Hallucination Detection in Open-Domain Long-Form Generation Paper coming soon! 👉 ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering 🖇️arxiv.org/abs/2504.05506 👉 Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines Paper coming soon! 👉 Relevant or Random: Can LLMs Truly Perform Analogical Reasoning? 🖇️arxiv.org/abs/2404.12728 👉 Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? 🖇️arxiv.org/abs/2505.08468 👉 Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models 🖇️arxiv.org/abs/2403.20331 👉 LAM Simulator: Advancing Data Generation for Large Action Models Trainings via Online Exploration and Feedback Simulation Paper coming soon!
1
5
32
3,462
Congrats to our ACL 2021 Accepted Paper Authors @CaimingXiong @JotyShafiq @baxterkb @jasonwu0731 @owenhaoliu @Wenpeng_Yin @huan__wang and all of our amazing collaborators!
4
30
ETSformer is a time-series forecasting model that combines the classical intuition of seasonal-trend decomposition and exponential smoothing with the Transformer framework, introducing novel exponential smoothing and frequency attention mechanisms. blog.salesforceairesearch.co…
3
30
In Loving Memory of Dragomir Radev. You will be missed. ♥️ @dragomir_radev blog.salesforceairesearch.co…
2
30
3,811
🔬 NEW BLOG DROP! Our complete technical breakdown on small language models is now available on our research blog: Read here: sforce.co/4jWlLhb 🔍 Discover our research on enterprise-ready AI that delivers powerful performance without the bloat 👀 See breakthrough results on long-context understanding at 128K tokens Math prowess revealed: 95% on GSM8K, 92.5% on MATH, 46.7% on AIME 2024! 💻 Code mastery: 41.1% on LiveCodeBench, best-in-class performance! 💪 Our "small but long" approach proves deliberate engineering beats brute-force scaling—offering predictable costs, enhanced privacy, and reduced environmental impact. Dive into the full research today: sforce.co/4jWlLhb #SmallButLong #EnterpriseAI
4
6
30
9,045
Another amazing #EMNLP2024 comes to a close, but we at @SFReseach #NeverStopLearning. Missed us in Miami? Bookmark, save and explore the research below. Thanks @emnlpmeeting -- what an incredible week! #Salesforce #AIResearch #NLP ----- Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems 🔖 Paper: bit.ly/48QI5E7 Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing 🔖 Paper: bit.ly/48UiREM DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts 🔖 Paper: bit.ly/4hQgVkM FOLIO: Natural Language Reasoning with First-Order Logic 🔖 Paper: bit.ly/3Ctw2Az Evaluating Psychological Safety of Large Language Models 🔖 Paper: bit.ly/48SYoQB Traffic Light or Light Traffic? Investigating Phrasal Semantics in Large Language Models 🔖 Paper: bit.ly/48W84tB 💻 Code: bit.ly/490V3PK P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains 🔖 Paper: bit.ly/3CAiyDa 💻 Code: bit.ly/48Sb4Ya 🤗 Training Data: bit.ly/4i36Px8 Open-RAG: Enhanced Retrieval Augmented Reasoning with Open-Source Large Language Models 🔖 Paper: bit.ly/48XnuOy Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding 🔖 Paper: bit.ly/4frX4qD A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations 🔖 Paper: bit.ly/4hOIbAb Industry track: Prompt Leakage effect and defense strategies for multi-turn LLM interactions 🔖 Paper: bit.ly/4fxBHEc
2
2
29
2,534
🏆 🏆 🏆 Our groundbreaking research on prompt leakage in multi-turn LLM interactions is amongst the top-50% industry-track papers accepted to #EMNLP2024! We propose a novel threat model, uncover social engineering vulnerabilities, measure fine-grained leakage, and apply different mitigation techniques. Learn how to build more #SecureAI systems: arxiv.org/abs/2404.16251 #LLMSecurity #AISafety #TrustedAI
1
6
29
2,703
For time series forecasting, deep learning isn’t scalable for streaming data and non-stationary data makes it hard. FSNet learns deep forecasting models on the fly and handles non-stationary data + concept drift. Learn more > blog.salesforceairesearch.co…
8
27