The fastest AI dev news on X — model releases, tools, and what they actually mean

What is trending in AI?
This AI Paper Introduces MAETok: A Masked Autoencoder-Based Tokenizer for Efficient Diffusion Models A research team from Carnegie Mellon University, The University of Hong Kong, Peking University, and AMD introduced a novel tokenizer, Masked Autoencoder Tokenizer (MAETok), to address these challenges. MAETok employs masked modeling within an autoencoder framework to develop a more structured latent space while ensuring high reconstruction fidelity. The researchers designed MAETok to leverage the principles of Masked Autoencoders (MAE), optimizing the balance between generation quality and computational efficiency. The methodology behind MAETok involves training an autoencoder with a Vision Transformer (ViT)-based architecture, incorporating both an encoder and a decoder. The encoder receives an input image divided into patches and processes them along with a set of learnable latent tokens. During training, a portion of the input tokens is randomly masked, forcing the model to infer the missing data from the remaining visible regions. This mechanism enhances the ability of the model to learn discriminative and semantically rich representations. Additionally, auxiliary shallow decoders predict the masked features, further refining the quality of the latent space. Unlike traditional VAEs, MAETok eliminates the need for variational constraints, simplifying training while improving efficiency...... Read the full article here: marktechpost.com/2025/02/08/… Paper: arxiv.org/abs/2502.03444 GitHub Page: github.com/Hhhhhhao/continuo…
49
179
10,948
ScrapeGraphAI: A Web Scraping Python Library that Uses LLMs to Create Scraping Pipelines for Websites, Documents, and XML Files Quick read: marktechpost.com/2024/04/30/… Github: github.com/VinciGit00/Scrape… Colab Notebook: colab.research.google.com/dr… @langchain #artificalintelligence
1
35
156
12,275
Researchers from MIT, Sakana AI, OpenAI and Swiss AI Lab IDSIA Propose a New Algorithm Called Automated Search for Artificial Life (ASAL) to Automate the Discovery of Artificial Life Using Vision-Language Foundation Models This innovative algorithm leverages vision-language foundation models (FMs) to automate the discovery of artificial lifeforms. Rather than designing every rule manually, researchers can define the simulation space, and ASAL explores it autonomously. ASAL integrates vision-language FMs, such as CLIP, to align visual outputs with textual prompts, enabling the evaluation of simulations in a human-like representation space. Simply describe the space of simulations to search over, and ASAL will automatically discover the most interesting and open-ended artificial lifeforms! Because of the generality of foundation models, ASAL can discover new lifeforms across a diverse range of seminal ALife simulations, including Boids, Particle Life, Game of Life, Lenia, and Neural Cellular Automata. ASAL even discovered novel cellular automata rules that are more open-ended and expressive than the original Conway’s Game of Life....... Read the full article here: marktechpost.com/2024/12/29/… Paper: arxiv.org/abs/2412.17799 GitHub Page: github.com/SakanaAI/asal/ Project Page: pub.sakana.ai/asal/ @SakanaAILabs
4
31
148
8,850
LAMBDA: A New Open-Source, Code-Free Multi-Agent Data Analysis System to Bridge the Gap Between Domain Experts and Advanced AI Models A team of researchers from Hong Kong Polytechnic University has introduced LAMBDA, a new open-source and code-free multi-agent data analysis system developed to overcome the lack of effective communication between domain experts and advanced AI models. LAMBDA provides an essential medium that allows smooth interaction between domain knowledge and AI capabilities in data science. This method solves numerous problems like removing coding barriers, integrating human intelligence with AI, and reshaping data science education, promising reliability and portability. Reliability means LAMBDA can address the tasks of data analysis stably and correctly. Portability means it is compatible with various LLMs, allowing it to be enhanced by the latest state-of-the-art models. Full read: marktechpost.com/2024/07/28/… Paper: arxiv.org/abs/2407.17535 Project: polyu.edu.hk/ama/cmfai/lambd…
1
51
144
9,424
Sea AI Lab Researchers Introduce Dr. GRPO: A Bias-Free Reinforcement Learning Method that Enhances Math Reasoning Accuracy in Large Language Models Without Inflating Responses Researchers from Sea AI Lab, the National University of Singapore, and Singapore Management University introduced a new approach called Dr. GRPO (Group Relative Policy Optimization Done Right) to address these issues. This method removes the problematic normalization terms from the GRPO formulation. Specifically, it eliminates the response length and standard deviation scaling factors that caused imbalances in model updates. The revised algorithm computes gradients more fairly across different responses and question types. They applied this method to train Qwen2.5-Math-7B, an open-source base model and demonstrated its effectiveness on multiple benchmarks. The training process used 27 hours of computing on 8× A100 GPUs, a relatively modest setup considering the results achieved. The researchers tested their method on prominent math reasoning benchmarks, including AIME 2024, AMC, MATH500, Minerva Math, and OlympiadBench. The model trained with Dr. GRPO achieved 43.3% accuracy on AIME 2024, significantly outperforming SimpleRL-Zero-7B (36.0%), Prime-Zero-7B (27.6%), and OpenReasoner-Zero-7B (16.7%). It also demonstrated strong average performance across all tasks: 40.9% on MATH500, 45.8% on Minerva, and 62.7% on OlympiadBench. These results validate the effectiveness of the bias-free RL method. Importantly, the model performed better and showed more efficient token usage. Incorrect responses became shorter and more focused, a notable shift from previous training methods encouraging overextended answers regardless of correctness....... Read full article: marktechpost.com/2025/03/22/… Paper: github.com/sail-sg/understan… GitHub Page: github.com/sail-sg/understan… @zzlccc
3
37
131
7,612
NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in Robotics Researchers from NVIDIA, Carnegie Mellon University, UC Berkeley, UT Austin, and UC San Diego introduced HOVER, a unified neural controller aimed at enhancing humanoid robot capabilities. This research proposes a multi-mode policy distillation framework, integrating different control strategies into one cohesive policy, thereby making a notable advancement in humanoid robotics. The researchers formulate humanoid control as a goal-conditioned reinforcement learning task where the policy is trained to track real-time human motion. The state includes the robot’s proprioception and a unified target goal state. Using these inputs, they define a reward function for policy optimization. The actions represent target joint positions that are fed into a PD controller. The system employs Proximal Policy Optimization (PPO) to maximize cumulative discounted rewards, essentially training the humanoid to follow target commands at each timestep..... Read full article here: marktechpost.com/2025/04/04/… Paper: pxl.to/ds6aqqk8 GitHub Page: pxl.to/ds6aqqk8 @nvidia @NVIDIAAI #ArtificialIntelligence #Robotics
2
34
115
27,069
Lavita AI Introduces Medical Benchmark for Advancing Long-Form Medical Question Answering with Open Models and Expert-Annotated Datasets A team of researchers from Lavita AI, Dartmouth Hitchcock Medical Center, and Dartmouth College introduced a publicly accessible benchmark designed to evaluate long-form medical QA systems comprehensively. The benchmark includes over 1,298 real-world consumer medical questions annotated by medical professionals. This initiative incorporates various performance criteria, such as correctness, helpfulness, reasoning, harmfulness, efficiency, and bias, to assess the capabilities of both open and closed-source models. The benchmark ensures a diverse and high-quality dataset by including annotations from human experts and utilizing advanced clustering techniques. The researchers also employed GPT-4 and other LLMs for semantic deduplication and question curation, resulting in a robust resource for model evaluation. The creation of this benchmark involved a multi-phase approach. The researchers collected over 4,271 user queries across 1,693 conversations from Lavita Medical AI Assist, filtering and deduplicating them to produce 1,298 high-quality medical questions. Using semantic similarity analysis, they reduced redundancy and ensured that the dataset represented a wide range of scenarios. Queries were categorized into three difficulty levels, basic, intermediate, and advanced, based on the complexity of the questions and the medical knowledge required to answer them. The researchers then created annotation batches, each containing 100 questions, with answers generated by various models for pairwise evaluation by human experts.... Read the full article here: marktechpost.com/2024/12/09/… Paper: arxiv.org/abs/2411.09834 GitHub Page: github.com/lavita-ai/medical… @LavitaAI
8
38
126
15,193
NVIDIA AI Introduces Omni-RGPT: A Unified Multimodal Large Language Model for Seamless Region-level Understanding in Images and Videos Researchers from NVIDIA and Yonsei University developed Omni-RGPT, a novel multimodal large language model designed to achieve seamless region-level comprehension in images and videos to address these challenges. This model introduces Token Mark, a groundbreaking method that embeds region-specific tokens into visual and text prompts, establishing a unified connection between the two modalities. The Token Mark system replaces traditional RoI-based approaches by defining a unique token for each target region, which remains consistent across frames in a video. This strategy prevents temporal drift and reduces computational costs, enabling robust reasoning for static and dynamic inputs. Including a Temporal Region Guide Head further enhances the model’s performance on video data by classifying visual tokens to avoid reliance on complex tracking mechanisms. Omni-RGPT leverages a newly created large-scale dataset called RegVID-300k, which contains 98,000 unique videos, 214,000 annotated regions, and 294,000 region-level instruction samples. This dataset was constructed by combining data from ten public video datasets, offering diverse and fine-grained instructions for region-specific tasks. The dataset supports visual commonsense reasoning, region-based captioning, and referring expression comprehension. Unlike other datasets, RegVID-300k includes detailed captions with temporal context and mitigates visual hallucinations through advanced validation techniques..... Read the full article here: marktechpost.com/2025/01/17/… Paper: arxiv.org/abs/2501.08326 Project Page: miranheo.github.io/omni-rgpt… @miran_heo @RHachiuma @subhashree_r @deanh_tw @CMHungSteven @nvidia @NVIDIAAI
30
113
7,725
Ant Group Releases Ling 2.0: A Reasoning-First MoE Language Model Series Built on the Principle that Each Activation Enhances Reasoning Capability How do you build a language model that grows in capacity but keeps the computation for each token almost unchanged? The Inclusion AI team from the Ant Group is pushing sparse large models in a methodical way by releasing Ling 2.0. Ling 2.0 is a reasoning based language model family built on the idea that each activation should translate directly into stronger reasoning behavior. It is one of the latest approaches that shows how to keep activation small while moving from 16B to 1T without rewriting the recipe. The series has three versions, Ling mini 2.0 at 16B total with 1.4B activated, Ling flash 2.0 in the 100B class with 6.1B activated, and Ling 1T with 1T total and about 50B active per token...... Full analysis: marktechpost.com/2025/10/30/… Paper: pxllnk.co/khvhb2h Model weights: pxllnk.co/viv0tgm Repo: pxllnk.co/7zl4f8o @AntGroup
11
81
12,108
Researchers at NVIDIA AI Introduce ‘VILA’: A Vision Language Model that can Reason Among Multiple Images, Learn in Context, and Even Understand Videos Quick read: marktechpost.com/2024/05/04/… Researchers from NVIDIA and MIT have introduced a novel visual language model (VLM) pre-training framework, VILA, which emphasizes effective embedding alignment and utilizes dynamic neural network architectures. This research differs by leveraging a combination of interleaved corpora and joint supervised fine-tuning (SFT) to enhance visual and textual learning capabilities. The VILA framework is distinct for its emphasis on preserving in-context learning abilities while improving generalization, ensuring that models retain the ability to handle complex tasks efficiently. To improve visual and textual alignment, the methodology involved pre-training VILA on large-scale datasets, such as Coyo-700m. Researchers used a base LLaVA model to test different pre-training strategies, comparing freezing and updating the large language model (LLM) during training. They introduced Visual Instruction Tuning to fine-tune the models using visual language datasets with prompt-based instruction tuning. The evaluation process included testing the pre-trained models on benchmarks like OKVQA and TextVQA to assess visual question-answering capabilities, specifically measuring VILA’s accuracy and context-learning ability. Paper: arxiv.org/abs/2312.07533 GitHub: github.com/Efficient-Large-M… @NVIDIAAI #Artificiallntelligence
30
100
6,499
Exclusive Talk with Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models MarkTechPost team had the pleasure of interviewing Joey Conway from NVIDIA to discuss their exciting work on open-source large language models, including Llama Nemotron Ultra & Parakeet. Watch the full interview here: piped.video/watch?v=Q-iJiiUW… Read the full interview article: marktechpost.com/2025/05/15/… @nvidia @NVIDIAAI
2
10
33
9,538
Fin-R1: A Specialized Large Language Model for Financial Reasoning and Decision-Making Researchers from Shanghai University of Finance & Economics, Fudan University, and FinStep have developed Fin-R1, a specialized LLM for financial reasoning. With a compact 7-billion-parameter architecture, Fin-R1 reduces deployment costs while addressing key economic challenges: fragmented data, lack of reasoning control, and weak generalization. It is trained on Fin-R1-Data, a high-quality dataset containing 60,091 CoT sourced from authoritative financial data. A two-stage training approach—Supervised Fine-Tuning (SFT) followed by RL—Fin-R1 enhances accuracy and interpretability. It performs well in financial benchmarks, excelling in financial compliance and robo-advisory applications. The study presents a two-stage framework for constructing Fin-R1. The data generation phase involves creating a high-quality financial reasoning dataset, Fin-R1-Data, through data distillation with DeepSeek-R1 and filtering using an LLM-as-judge approach. In the model training phase, Fin-R1 is fine-tuned on Qwen2.5-7B-Instruct using SFT and Group Relative Policy Optimization (GRPO) to enhance reasoning and output consistency. The dataset combines open-source and proprietary financial data, refined through rigorous filtering. Training integrates supervised learning and reinforcement learning, incorporating structured prompts and reward mechanisms to improve financial reasoning accuracy and standardization....... Read full article: marktechpost.com/2025/03/22/… Paper: arxiv.org/abs/2503.16252 Model on Hugging Face: huggingface.co/SUFE-AIFLM-La…
3
26
91
4,539
This AI Paper Proposes a NeRF-based Mapping Method that Enables Higher-Quality Reconstruction and Real-Time Capability Even on Edge Computers Quick Read: marktechpost.com/2023/10/15/… Paper: arxiv.org/abs/2306.03207 Github: github.com/SYSU-STAR/H2-Mapp… If you like our work, you will love our newsletter: marktechpost-newsletter.beeh… #ArtificialIntelligence #MachineLearning
24
85
9,451
UC Berkeley Researchers Introduce Learnable Latent Codes as Bridges (LCB): A Novel AI Approach that Combines the Abstract Reasoning Capabilities of Large Language Models with Low-Level Action Policies Researchers from the University of California, Berkeley, introduced Latent Codes as Bridges (LCB), a robust policy architecture for control. LCB combines the strengths of modular hierarchical architectures with end-to-end learning. It allows direct utilization of LLMs for high-level reasoning alongside pre-trained skills for low-level control, enhancing them through end-to-end learning. By incorporating a <ACT> token at the interface layer to modulate low-level policies, LCB surpasses the limitations of relying solely on language, which struggles to describe certain behaviors. Also, by employing a separate <ACT> token, LCB preserves the core language generation and reasoning capabilities of LLMs during fine-tuning. Quick read: marktechpost.com/2024/05/11/… Paper: fredshentu.github.io/LCB_sit… @philippswu @YideShentu #ArtificialIntelligence
1
24
83
7,241
1/4 🧵 A new research introduces AttrPrompt, a Language Model as Training Data Generator. This is a game-changer for Zero-Shot Learning, a paradigm that allows AI to understand tasks it's never seen before. 🚀 @yue___yu Quick Read: marktechpost.com/2023/07/02/…
3
33
82
14,191
LLMs No Longer Require Powerful Servers: Researchers from MIT, KAUST, ISTA, and Yandex Introduce a New AI Approach to Rapidly Compress Large Language Models without a Significant Loss of Quality The Yandex Research team, together with researchers from the Massachusetts Institute of Technology (MIT), the Austrian Institute of Science and Technology (ISTA) and the King Abdullah University of Science and Technology (KAUST), developed a method to rapidly compress large language models without a significant loss of quality. Previously, deploying large language models on mobile devices or laptops involved a quantization process — taking anywhere from hours to weeks and it had to be run on industrial servers — to maintain good quality. Now, quantization can be completed in a matter of minutes right on a smartphone or laptop without industry-grade hardware or powerful GPUs. HIGGS lowers the barrier to entry for testing and deploying new models on consumer-grade devices, like home PCs and smartphones by removing the need for industrial computing power....... Read full article: marktechpost.com/2025/04/11/… Paper: arxiv.org/abs/2411.17525
3
22
85
4,251
Google AI Introduces CodecLM: A Machine Learning Framework for Generating High-Quality Synthetic Data for LLM Alignment Quick read: marktechpost.com/2024/04/13/… Researchers at Google Cloud AI have developed CodecLM, an innovative framework designed to align LLMs with specific user instructions through tailored synthetic data generation. CodecLM distinguishes itself by utilizing an encode-decode mechanism to produce highly customized instructional data, ensuring that LLMs perform optimally across diverse tasks. This methodology leverages Self-Rubrics and Contrastive Filtering techniques, enhancing the relevance and quality of synthetic instructions and significantly improving the models’ ability to follow complex instructions accurately. CodecLM employs an encode-decode approach, transforming initial seed instructions into concise metadata that captures essential instruction characteristics. This metadata then guides the generation of synthetic instructions tailored to specific user tasks. To enhance instruction quality and relevance, the framework utilizes Self-Rubrics to add complexity and specificity and Contrastive Filtering to select the most effective instruction-response pairs based on performance metrics. The effectiveness of CodecLM is validated across several open-domain instruction-following benchmarks, demonstrating significant improvements in LLM alignment compared to traditional methods without relying on extensive manual data annotation. @GoogleAI #artificalintelligence #DataScientist #computer
1
29
82
6,192
LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data Researchers from Tsinghua University and Shanghai AI Lab introduced Test-Time Reinforcement Learning (TTRL). TTRL is a training framework that applies RL during inference, using only unlabeled test data. It leverages the intrinsic priors of pre-trained language models to estimate pseudo-rewards through majority voting across sampled outputs. Instead of relying on explicit labels, TTRL constructs reward functions by aggregating multiple model-generated responses to a given query. A consensus answer, obtained via majority voting, is treated as a pseudo-label. Model responses that align with this pseudo-label are positively reinforced. This formulation transforms test-time inference into an adaptive, self-supervised learning process, allowing LLMs to improve over time without additional supervision...... Read full article: marktechpost.com/2025/04/22/… Paper: arxiv.org/abs/2504.16084 GitHub Page: github.com/PRIME-RL/TTRL
1
19
79
6,160
ZeroSearch from Alibaba Uses Reinforcement Learning and Simulated Documents to Teach LLMs Retrieval Without Real-Time Search Researchers from Tongyi Lab at Alibaba Group introduced an innovative solution called ZeroSearch. This reinforcement learning framework removes the need for live API-based search entirely. Instead, it uses another language model to simulate the behavior of a search engine. The simulation model is fine-tuned through supervised training to generate documents that either help or mislead the policy model, depending on whether the content is designed to be relevant or noisy. This allows complete control over the document quality and cost while enabling a realistic retrieval training experience. A key innovation lies in using curriculum-based learning during training, which means gradually introducing harder retrieval tasks by adjusting how much noise is present in the generated documents. This progression helps the policy model develop resilience and better reasoning skills over time without ever making a real search query..... Read full article: marktechpost.com/2025/05/10/… Paper: arxiv.org/abs/2505.04588 Model on Hugging Face: huggingface.co/collections/s… Also, don't forget to check miniCON Agentic AI 2025- free registration: minicon.marktechpost.com
1
19
79
3,332
How Can Robots Make Better Decisions? MIT and Stanford Researchers Introduce Diffusion-CCSP for Advanced Robotic Reasoning and Planning Quick Read: marktechpost.com/2023/09/09/… Paper: arxiv.org/abs/2309.00966 Project: diffusion-ccsp.github.io/ If you like our work, you will love our newsletter: marktechpost-newsletter.beeh… #DataScience #MachineLearning #DataScience
2
29
78
9,569
Microsoft Researchers Introduce PromptBench: A Pytorch-based Python Package for Evaluation of Large Language Models (LLMs) Quick read: marktechpost.com/2023/12/23/… Paper: arxiv.org/abs/2312.07910v1 Github: github.com/microsoft/promptb… #ArtificialInteligence #MachineLearning #neural #DataScience @MSFTResearch
27
74
7,241
NVIDIA A Releases Introduce UltraLong-8B: A Series of Ultra-Long Context Language Models Designed to Process Extensive Sequences of Text (up to 1M, 2M, and 4M tokens) Researchers from UIUC and NVIDIA have proposed an efficient training recipe for building ultra-long context LLMs from aligned instruct models, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. The method utilizes efficient, continued pretraining strategies to extend the context window while using instruction tuning to maintain instruction-following and reasoning abilities. Moreover, their UltraLong-8B model achieves state-of-the-art performance across diverse long-context benchmarks. Models trained with this approach maintain competitive performance on standard benchmarks, showing balanced improvements for long and short context tasks. The research provides an in-depth analysis of key design choices, highlighting impacts of scaling strategies and data composition. The proposed method consists of two key stages: continued pretraining and instruction tuning. Together, these stages enable the effective processing of ultra-long inputs while maintaining strong performance across tasks. A YaRN-based scaling approach is adopted for context extension with fixed hyperparameters as α = 1 and β = 4 rather than NTK-aware scaling strategies. The scale factors are computed based on target context length and employ larger scaling factors for RoPE embeddings to accommodate extended sequences and mitigate performance degradation at maximum lengths. Researchers subsample high-quality SFT datasets spanning general, mathematics, and code domains for training data and further utilize GPT-4o and GPT-4o-mini to refine responses and perform rigorous data decontamination...... Read full article: marktechpost.com/2025/04/12/… Paper: arxiv.org/abs/2504.06214 Models on Hugging Face: huggingface.co/collections/n… @nvidia @_weiping @xuchejian
2
23
77
3,103
Cerebras Introduces the Bittensor Language Model Named BTLM-3B-8K: A New State-of-The-Art 3B Parameter Open-Source Language Model Quick Read: marktechpost.com/2023/09/30/… Paper: arxiv.org/abs/2309.11568 Project: huggingface.co/cerebras/btlm… If you like our work, you will love our newsletter: marktechpost-newsletter.beeh… #artificialintelligence #DataScience #NeuralNetwork
6
23
76
15,626
LightRAG: A Dual-Level Retrieval System Integrating Graph-Based Text Indexing to Tackle Complex Queries and Achieve Superior Performance in Retrieval-Augmented Generation Systems A research team from Beijing University of Posts and Telecommunications and the University of Hong Kong introduced LightRAG in response to these challenges. This novel framework integrates graph structures into RAG systems. The core innovation of LightRAG is its use of a graph-based text indexing paradigm combined with a dual-level retrieval system. Graph structures allow the model to capture complex relationships between different entities in the data, providing a more comprehensive understanding of the information. By adding graph representations, LightRAG can retrieve related entities and their relationships efficiently, improving the retrieval process’s speed and accuracy. This approach also reduces computational costs, eliminating the need to rebuild entire data structures when incorporating new data. LightRAG combines detailed (low-level) and conceptual (high-level) information retrieval. Low-level retrieval retrieves specific entities and their attributes, ensuring precise and focused information. Meanwhile, high-level retrieval captures broader topics and themes, enabling the system to understand the bigger picture. This dual-level strategy allows LightRAG to answer complex queries by combining detailed and abstract information. Also, LightRAG includes an incremental update algorithm that facilitates real-time updates without reprocessing the entire database. This feature makes the system more responsive and capable of handling fast-paced changes in data, a vital capability in dynamic environments... Read full article here: marktechpost.com/2024/10/12/… Paper: arxiv.org/abs/2410.05779 GitHub: github.com/HKUDS/LightRAG?ta…
16
75
4,924
MedGraphRAG: An AI Framework for Improving the Performance of LLMs in the Medical Field through Graph Retrieval Augmented Generation (RAG) A team of researchers from the University of Oxford has developed a unique AI framework called MedGraphRAG to improve Large Language Models’ performance in the medical field. The evidence-based outcomes that this framework produces are essential for enhancing the security and dependability of LLMs when handling sensitive medical data. Hybrid static-semantic document chunking is a unique document processing approach that forms the basis of the MedGraphRAG system. This strategy records context better than standard techniques. Rather than just dividing documents into fixed-size sections or pieces, this method considers the semantic content, making context preservation more successful. This is a crucial step in domains such as medicine since correct information retrieval and response production depend on a thorough grasp of context..... Read our full take: marktechpost.com/2024/08/12/… Paper: arxiv.org/abs/2408.04187
2
20
71
3,541
Why Small Language Models (SLMs) Are Poised to Redefine Agentic AI: Efficiency, Cost, and Practical Deployment Small language models (SLMs) are emerging as a compelling alternative to large language models (LLMs) in agentic AI systems. Researchers from NVIDIA and Georgia Tech demonstrate that SLMs can handle the majority of repetitive and specialized tasks performed by AI agents, offering significant advantages in efficiency, cost, and deployment flexibility. These models can operate on consumer devices, reducing latency, energy consumption, and reliance on costly cloud infrastructure. By leveraging SLMs for targeted agentic operations, organizations can build more modular, maintainable, and sustainable AI systems without sacrificing core performance for focused use cases. While LLMs still hold value for complex reasoning and open-domain conversational needs, the paper highlights that a hybrid approach—using SLMs for routine tasks and reserving LLMs for higher-level operations—maximizes both efficiency and capability. The transition to SLM-based architectures requires careful data collection, task clustering, and specialized fine-tuning, but promises to democratize access to AI and enable broader innovation. The authors argue that shifting to SLMs not only cuts operational costs but also drives a more responsible, resource-conscious AI ecosystem for the future...... 📄 Full breakdown here: marktechpost.com/2025/06/18/… 📝 Paper: arxiv.org/abs/2506.02153
1
19
68
3,465
Researchers from the University of Maryland and Adobe Introduce DynaSaur: The LLM Agent that Grows Smarter by Writing its Own Functions Researchers from the University of Maryland and Adobe introduce DynaSaur: an LLM agent framework that enables the dynamic creation and composition of actions online. Unlike traditional systems that rely on a fixed set of predefined actions, DynaSaur allows agents to generate, execute, and refine new Python functions in real-time whenever existing functions prove insufficient. The agent maintains a growing library of reusable functions, enhancing its ability to respond to diverse scenarios. This dynamic ability to create, execute, and store new tools makes AI agents more adaptable to real-world challenges.... Read the full article here: marktechpost.com/2024/11/23/… Paper: arxiv.org/abs/2411.01747
20
69
5,936
NVIDIA Research Introduces ChipAlign: A Novel AI Approach that Utilizes a Training-Free Model Merging Strategy, Combining the Strengths of a General Instruction-Aligned LLM with a Chip-Specific LLM NVIDIA’s ChipAlign merges the strengths of a general instruction-aligned LLM and a chip-specific LLM. This approach avoids the need for extensive retraining and instead employs a training-free model merging strategy. At its core is geodesic interpolation, a method that treats model weights as points on a geometric space, enabling smooth integration of their capabilities. Unlike traditional multi-task learning, which requires large datasets and computational resources, ChipAlign directly combines pre-trained models. This method ensures that the resulting model retains the strengths of both inputs, offering a practical solution for integrating specialized knowledge with instruction alignment. Benchmark results demonstrate the effectiveness of ChipAlign: ✅ On the IFEval benchmark, ChipAlign shows a 26.6% improvement in instruction alignment. ✅ In domain-specific tasks, such as the OpenROAD QA benchmark, it achieves up to 6.4% higher ROUGE-L scores compared to other model-merging techniques. ✅ In industrial chip QA, ChipAlign outperforms baseline models by up to 8.25%, excelling in both single-turn and multi-turn scenarios....... Read the full article here: marktechpost.com/2025/01/02/… Paper: arxiv.org/abs/2412.19819 @nvidia @NVIDIAAI
20
61
2,454
Google DeepMind Introduces Tandem Transformers for Inference Efficient Large Language Models LLMs Quick read: marktechpost.com/2024/03/02/… Paper: arxiv.org/abs/2402.08644 #ArtificialIntelligence
2
22
63
4,260
Transform Your Understanding of Attention: EPFL’s Cutting-Edge Research Unlocks the Secrets of Transformer Efficiency! Quick read: marktechpost.com/2024/02/21/… A groundbreaking study conducted by researchers from the Statistical Physics of Computation Laboratory and the Information Learning & Physics Laboratory at EPFL, Switzerland, sheds new light on the dynamics of dot-product attention layers. The team meticulously examines how these layers learn to prioritize input tokens based on their positional relationships or semantic connections. This exploration is particularly significant as it taps into the foundational aspects of learning mechanisms within transformers, offering insights into their adaptability and efficiency in handling diverse tasks. Paper: arxiv.org/abs/2402.03902 #ArtificialIntelligence @EPFL_en @zdeborova
18
64
5,290
This AI Paper by Meta FAIR Introduces MoMa: A Modality-Aware Mixture-of-Experts Architecture for Efficient Multimodal Pre-training Researchers at Meta introduced MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed to pre-train mixed-modal, early-fusion language models. MoMa processes text and images in arbitrary sequences by dividing expert modules into modality-specific groups. Each group exclusively handles designated tokens, employing learned routing within each group to maintain semantically informed adaptivity. This architecture significantly improves pre-training efficiency, with empirical results showing substantial gains. The research, conducted by a team at Meta, showcases the potential of MoMa to advance mixed-modal language models. The technology behind MoMa involves a combination of mixture-of-experts (MoE) and mixture-of-depths (MoD) techniques. In MoE, tokens are routed across a set of feed-forward blocks (experts) at each layer. These experts are divided into text-specific and image-specific groups, allowing for specialized processing pathways. This approach, termed modality-aware sparsity, enhances the model’s ability to capture features specific to each modality while maintaining cross-modality integration through shared self-attention mechanisms. Furthermore, MoD allows tokens to selectively skip computations at certain layers, further optimizing the processing efficiency. Read our take on this: marktechpost.com/2024/08/03/… @AIatMeta
1
21
65
2,631
Aloe: A Family of Fine-tuned Open Healthcare LLMs that Achieves State-of-the-Art Results through Model Merging and Prompting Strategies Researchers from the Barcelona Supercomputing Center (BSC) and Universitat Politècnica de Catalunya – Barcelona Tech (UPC) have developed the Aloe models, a new series of healthcare LLMs. These models employ innovative strategies such as model merging and instruct tuning, leveraging the best features of existing models and enhancing them through sophisticated training regimens on both public and proprietary synthesized datasets. The Aloe models are trained using a novel dataset that includes a mixture of public data sources and synthetic data generated through advanced Chain of Thought (CoT) techniques. The technological backbone of the Aloe models involves integrating various new data processing and training strategies. For instance, they use an alignment phase with Direct Preference Optimization (DPO) to align the models ethically, and their performance is tested against numerous bias and toxicity metrics. The models also undergo a rigorous red teaming process to assess potential risks and ensure their safety in deployment. Quick read: marktechpost.com/2024/05/11/… Paper: arxiv.org/abs/2405.01886 Model: huggingface.co/HPAI-BSC/Llam… #ai #ArtificialIntelligence @hpai_bsc #NeuralNetworks
3
15
62
4,756
Researchers at Stanford University Introduce Octopus v2: Empowering On-Device Language Models for Super Agent Functionality Quick read: marktechpost.com/2024/04/06/… Researchers from Stanford University have introduced Octopus v2, an advanced on-device language model aimed at addressing the prevalent issues of latency, accuracy, and privacy concerns associated with current LLM applications. Unlike previous models, Octopus v2 significantly reduces latency and enhances accuracy for on-device applications. Its uniqueness lies in the fine-tuning method with functional tokens, enabling precise function calling and surpassing GPT-4 in efficiency and speed while dramatically cutting the context length by 95%. The methodology for Octopus v2 involved fine-tuning a 2 billion parameter model derived from Google DeepMind’s Gemma 2B on a tailored dataset focusing on Android API calls. This dataset was constructed with positive and negative examples to enhance function calling precision. The training incorporated full model and Low-Rank Adaptation (LoRA) techniques to optimize performance for on-device execution. The key innovation was the introduction of functional tokens during fine-tuning, significantly reducing latency and context length requirements. This process allowed Octopus v2 to achieve high accuracy and efficiency in function calling on edge devices without extensive computational resources. #ArtificialInteligence @Stanford
1
10
60
4,851
AgentLite by Salesforce AI Research: Transforming LLM Agent Development with an Open-Source, Lightweight, Task-Oriented Library for Enhanced Innovation Quick read: marktechpost.com/2024/03/24/… A research team from Salesforce AI Research presents AgentLite, an open-source AI Agent library that simplifies the design and deployment of LLM agents. This innovative tool strips away the complexity that has previously troubled the development process, offering a streamlined path for researchers to pioneer new strategies and architectures in LLM agent systems. Although advanced, traditional frameworks often have a steep learning curve and a bulky codebase that can stifle creativity and slow experimentation. By contrast, AgentLite stands out with its lean code architecture and task-oriented design, encouraging rapid prototyping and iterative testing. With less than 1,000 lines of code, it starkly contrasts with existing libraries that can have upwards of 8,966 to 248,650 lines, according to comparisons made within the research. This compact yet powerful approach enables researchers to focus more on innovation and less on navigating the intricacies of the tool they are using. #ArtificialIntelligence #DataScientist @SFResearch
1
18
62
4,411
Meet LocAgent: Graph-Based AI Agents Transforming Code Localization for Scalable Software Maintenance A team of researchers from Yale University, University of Southern California, Stanford University, and All Hands AI developed LocAgent, a graph-guided agent framework to transform code localization. Rather than depending on lexical matching or static embeddings, LocAgent converts entire codebases into directed heterogeneous graphs. These graphs include nodes for directories, files, classes, and functions and edges to capture relationships like function invocation, file imports, and class inheritance. This structure allows the agent to reason across multiple levels of code abstraction. The system then applies tools like SearchEntity, TraverseGraph, and RetrieveEntity to allow LLMs to explore the system step-by-step. The use of sparse hierarchical indexing ensures rapid access to entities, and the graph design supports multi-hop traversal, which is essential for finding connections across distant parts of the codebase. LocAgent performs indexing within seconds and supports real-time usage, making it practical for developers and organizations. The researchers fine-tuned two open-source models, Qwen2.5-7B, and Qwen2.5-32B, on a curated set of successful localization trajectories. These models performed impressively on standard benchmarks. For instance, on the SWE-Bench-Lite dataset, LocAgent achieved 92.7% file-level accuracy using Qwen2.5-32B, compared to 86.13% with Claude-3.5 and lower scores from other models. On the newly introduced Loc-Bench dataset, which contains 660 examples across bug reports (282), feature requests (203), security issues (31), and performance problems (144), LocAgent again showed competitive results, achieving 84.59% Acc@5 and 87.06% Acc@10 at the file level. Even the smaller Qwen2.5-7B model delivered performance close to high-cost proprietary models while costing only $0.05 per example, a stark contrast to the $0.66 cost of Claude-3.5...... Read full article: marktechpost.com/2025/03/23/… Paper: arxiv.org/abs/2503.09089 GitHub: github.com/gersteinlab/LocAg… @XiangruTang
1
16
63
3,537
Researchers from SynthLabs and Stanford Propose Meta Chain-of-Thought (Meta-CoT): An AI Framework for Improving LLM Reasoning Researchers from SynthLabs and Stanford have proposed Meta Chain-of-Thought (Meta-CoT), a framework designed to model the latent steps necessary for solving complex problems. Unlike classical CoT, which focuses on linear reasoning, Meta-CoT incorporates a structured approach inspired by cognitive science’s dual-process theory. This framework seeks to emulate deliberate, logical, and reflective thinking, often referred to as “System 2” reasoning. Meta-CoT integrates instruction tuning, synthetic data generation, and reinforcement learning to help models internalize these reasoning processes. By doing so, it bridges the gap between conventional reasoning methods and the complexities of real-world problem-solving. The framework employs algorithms such as Monte Carlo Tree Search (MCTS) and A* search to generate synthetic data that reflects latent reasoning processes. This data, combined with process supervision, enables models to move beyond simplistic left-to-right token prediction and better approximate the true reasoning pathways required for complex tasks...... Read the full article here: marktechpost.com/2025/01/08/… Paper: arxiv.org/abs/2501.04682 @synth_labs @Stanford
16
60
2,674
Apple Researchers Propose a Multimodal AI Approach to Device-Directed Speech Detection with Large Language Models Quick read: marktechpost.com/2024/03/24/… Paper: arxiv.org/abs/2403.14438 #ArtificialIntelligence
1
18
61
2,002
Meta AI Introduces Searchformer for Improving Planning Efficiency: A Transformer Model for Complex Decision-Making Tasks Quick read: marktechpost.com/2024/03/02/… The research team at Meta has introduced Searchformer, a novel Transformer model that significantly improves planning efficiency in complex tasks like Sokoban puzzles. Unlike traditional approaches, Searchformer combines the strengths of Transformers with the structured search dynamics of symbolic planners, leading to a more efficient planning process. Searchformer can solve complex planning tasks more efficiently than traditional planning algorithms like A* search. It is trained in two steps: first, it is trained to imitate the search procedure of A* search using synthetic datasets generated from randomly generated planning task instances. In the second step, the model is further improved using expert iteration, encouraging the Transformer to generate fewer search steps while finding optimal solutions. Two token sequences were produced: one with augmented search dynamics and another focusing solely on solutions. By training Transformer models to predict these sequences, researchers aimed to capture the computational process of A*. Further improvements involved fine-tuning these models on datasets of progressively shorter sequences that still led to optimal outcomes, significantly enhancing efficiency by reducing the necessary search steps for problem-solving. Paper: arxiv.org/abs/2402.14083 #ArtificialIntelligence #DataScience @AIatMeta
23
60
5,445
OpenResearcher: An Open-Source Project that Harnesses AI to Accelerate Scientific Research Researchers from Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Fudan University, The Hong Kong Polytechnic University, Hong Kong University of Science and Technology, Westlake University, Tsinghua University, and Generative AI Research Lab (GAIR) have proposed OpenResearcher, an open-source project designed to accelerate scientific research through AI. This unified application handles diverse researcher questions, competing with industry tools while remaining open-source. The OpenResearcher differentiates itself as an active assistant, asking guiding questions to understand user queries better. It uses retrieval augmentation from the Internet and the arXiv corpus to deliver current, domain-specific knowledge. The system also features custom tools, such as one for refining initial results, and supports in-depth discussions through follow-up questions, generating a complete solution for AI-assisted research. The performance of OpenResearcher is evaluated using a diverse set of 109 research questions gathered from over 20 graduate students. These questions spanned various research areas, including scientific paper recommendation, scientific text summarization, multimodal learning, agent systems, LLM alignment, tool learning, LLM safety, and RAG. The evaluation used a pairwise comparison method for a given complexity and length of the answers needed, which often requires reviewing multiple papers, rather than relying on annotated ground truths. The comparison included recent industry applications like Perplexity AI, iAsk, You.com, and Phind, and a basic RAG system that only used hybrid retrieval and LLM generation tools. Read our full take on this: marktechpost.com/2024/08/17/… Paper: arxiv.org/abs/2408.06941 GitHub: github.com/GAIR-NLP/OpenRese…
1
20
58
3,999
We have just released our latest magazine report on the hottest topic for the year 2024: 'Small Language Models' Download the E-Copy here: embeds.beehiiv.com/f2b6898f-… @activeloop @predibase @arcee_ai @AporiaAI @arizeai @GoogleAI @fiddlerlabs @itsArthurAI @AMD @huggingface @LaminiAI @LambdaAPI @neuralmagic @runailabs @nebiusai
2
13
56
206,148
MIT Researchers Propose Boltz-1: The First Open-Source AI Model Achieving AlphaFold3-Level Accuracy in Biomolecular Structure Prediction A team of MIT researchers has introduced Boltz-1, the first open-source and commercially accessible model that matches AlphaFold3-level accuracy in predicting biomolecular complexes. Unlike its predecessors, Boltz-1 is fully accessible to the public, with the model weights, training, and inference code released under the MIT license. This openness aims to foster global collaboration and advance biomolecular modeling. Boltz-1 follows the general framework used in AlphaFold3 but introduces several architectural and procedural innovations, including new multiple sequence alignment (MSA) pairing algorithms, a unified cropping approach for efficient training, and an enhanced confidence model. These innovations allow Boltz-1 to deliver high accuracy while remaining accessible and significantly lowering the computational burden. The researchers demonstrated Boltz-1’s capabilities through various benchmarks. On CASP15, a competition for protein structure prediction, Boltz-1 showcased strong performance in protein-ligand and protein-protein prediction tasks, achieving an LDDT-PLI of 65%, compared to Chai-1’s 40%. Moreover, Boltz-1 had a DockQ success rate of 83%, surpassing Chai-1’s 76%. These results highlight Boltz-1’s reliability and robustness in predicting biomolecular interactions, especially in protein-ligand complex prediction, where it excelled in aligning small molecules with their respective binding pockets.... Read the full article here: marktechpost.com/2024/11/17/… Technical report: gcorso.github.io/assets/bolt… Code/Model: github.com/jwohlwend/boltz @GabriCorso @jeremyWohlwend @pas_saro @KenLeidal @WojTechnology @Itamarchinn @BarzilayRegina @enfeinberg , @alshedivat , @sokrypton , @HannesStaerk , @json_yim , @WengongJin
20
59
4,542
Comparative Evaluation of SAM2 and SAM1 for 2D and 3D Medical Image Segmentation: Performance Insights and Transfer Learning Potential Researchers from the University Health Network and the University of Toronto have comprehensively evaluated the Segment Anything Model 2 (SAM2) across 11 medical image modalities and videos. They compared SAM2 with SAM1 and MedSAM, identifying both strengths and weaknesses. They developed a transfer learning pipeline to adapt SAM2 for medical use and successfully fine-tuned the model. Additionally, they integrated SAM2 into a 3D Slicer plugin. They implemented a Gradio API, enabling efficient 3D image and video segmentation for medical data like CT, MR, and PET, which the official SAM2 interface does not support. The study used public datasets from the CVPR 2024 Medical Image Segmentation on Laptop Challenge for evaluation, excluding any data from the MedSAM training set. CT images were preprocessed with intensity cutoffs, MR and PET images were clipped and normalized, while other modalities remained unchanged. All images were converted to npz format for batch inference. SAM2, an extension of SAM1, incorporates Hiera for multi-scale feature extraction and a memory attention module for consistent video segmentation across frames. The fine-tuning of SAM2-Tiny involved freezing the prompt encoder, updating the image encoder and mask decoder, and using Dice and cross-entropy losses for robust segmentation.... Read our full take on this: marktechpost.com/2024/08/08/… Paper: arxiv.org/abs/2408.03322 GitHub: github.com/bowang-lab/MedSAM… @UHN @UofTCompSci @UofT_LMP @UofT_TCAIREM @VectorInst @BoWang87 @BaharoonMS
1
15
55
7,917
Llama-3-based OpenBioLLM-Llama3-70B and 8B: Outperforming GPT-4, Gemini, Meditron-70B, Med-PaLM-1 and Med-PaLM-2 in Medical-Domain Quick read: marktechpost.com/2024/04/29/… Open Medical-LLM Leaderboard: huggingface.co/spaces/openli… OpenBioLLM-70B project page: huggingface.co/aaditya/Llama… OpenBioLLM-8B project page: huggingface.co/aaditya/Llama… #ArtificialIntelligence
16
57
3,350
Google DeepMind Presents Mixture-of-Depths: Optimizing Transformer Models for Dynamic Resource Allocation and Enhanced Computational Sustainability Quick read: marktechpost.com/2024/04/06/… Researchers from Google DeepMind, McGill University, and Mila have introduced a groundbreaking method called Mixture-of-Depths (MoD), which diverges from the traditional uniform resource allocation model. MoD empowers transformers to dynamically distribute computational resources, focusing on the most pivotal tokens within a sequence. This method represents a paradigm shift in managing computational resources and promises substantial efficiency and performance improvements. MoD’s innovation lies in its ability to adjust computational focus within a transformer model dynamically, applying more resources to parts of the input sequence that are deemed more critical for the task at hand. The technique operates under a fixed computational budget, strategically selecting tokens for processing based on a routing mechanism that evaluates their significance. This approach drastically reduces unnecessary computations, effectively slashing the transformer’s operational demands while maintaining or enhancing its performance. @GoogleDeepMind
1
11
54
2,417
This AI Paper Introduces DSPy: A Programming Model that Abstracts Language Model Pipelines as Text Transformation Graphs Quick Read: marktechpost.com/2023/10/14/… Paper: arxiv.org/abs/2310.03714 Github: github.com/stanfordnlp/dspy If you like our work, you will love our newsletter: marktechpost-newsletter.beeh… #ArtificialInteligence #DataScientists #programming
1
20
56
11,159
Ovis-1.6: An Open-Source Multimodal Large Language Model (MLLM) Architecture Designed to Structurally Align Visual and Textual Embeddings Researchers team from Alibaba Group and Nanjing University introduced a new version of Ovis: Ovis 1.6 is a new multimodal large language model (MLLM) that structurally aligns visual and textual embeddings to address this challenge. Ovis employs a unique visual embedding look-up table, similar to the one used for textual embeddings, to create structured visual representations. This table enables the visual encoder to produce embeddings compatible with textual embeddings, resulting in more effective visual and textual information integration. The model also utilizes probabilistic tokens for visual patches mapped into the visual embedding table multiple times. This approach mirrors the structured representation used in textual data, facilitating a coherent combination of visual and textual inputs. Ovis’s core innovation lies in using a visual embedding table that aligns visual tokens with their textual counterparts. A probabilistic token represents each image patch and indexes the visual embedding table multiple times to generate a final visual embedding. This process captures the rich semantics of each visual patch and results in embeddings structurally similar to textual tokens. In contrast to conventional methods, which rely on linear projections to map visual embeddings into a joint space, Ovis adopts a probabilistic approach to generate more meaningful visual embeddings. This method enables Ovis to overcome the limitations of connector-based architectures and achieve better performance in multimodal tasks... Read our full take on this: marktechpost.com/2024/09/29/… Paper: arxiv.org/abs/2405.20797 HF Model: huggingface.co/AIDC-AI/Ovis1…
14
54
2,105
Microsoft Researchers Combine Small and Large Language Models for Faster, More Accurate Hallucination Detection Researchers from Microsoft Responsible AI present a robust workflow to address the challenges of hallucination detection in LLMs. This approach aims to balance latency and interpretability by combining a small classification model, specifically a small language model (SLM), with a downstream LLM module called a “constrained reasoner.” The SLM performs initial hallucination detection, while the LLM module explains the detected hallucinations. This method utilizes the relatively infrequent occurrence of hallucinations in practical use, making the average time cost of using LLMs for reasoning on hallucinated texts manageable. Additionally, the approach capitalizes on LLMs’ pre-existing reasoning and explanation capabilities, eliminating the need for extensive domain-specific data and the significant computational cost associated with fine-tuning. This framework mitigates a potential issue in combining SLMs and LLMs: inconsistency between the SLM’s decisions and the LLM’s explanations. This problem is particularly relevant in hallucination detection, where alignment between detection and explanation is crucial. The study focuses on resolving this issue within the two-stage hallucination detection framework. Additionally, the researchers analyze LLM reasonings about SLM decisions and ground truth labels, exploring the potential of LLMs as feedback mechanisms for improving detection processes. The study makes two primary contributions: introducing a constrained reasoner for hallucination detection that balances latency and interpretability and providing a comprehensive analysis of upstream-downstream consistency, along with practical solutions to enhance alignment between detection and explanation. The effectiveness of this approach is demonstrated across multiple open-source datasets..... Read our full take on this: marktechpost.com/2024/08/31/… Paper: arxiv.org/abs/2408.12748 @Microsoft @MSFTnews
9
54
2,523
1/ F5-TTS: A Fully Non-Autoregressive Text-to-Speech System based on Flow Matching with Diffusion Transformer (DiT) Researchers from Shanghai Jiao Tong University, the University of Cambridge, and Geely Automobile Research Institute introduced F5-TTS, a non-autoregressive text-to-speech (TTS) system that utilizes flow matching with a Diffusion Transformer (DiT). Unlike many conventional TTS models, F5-TTS does not require complex elements like duration modeling, phoneme alignment, or a dedicated text encoder. Instead, it introduces a simplified approach where text inputs are padded to match the length of the speech input, leveraging flow matching for effective synthesis. F5-TTS is designed to address the shortcomings of its predecessor, E2 TTS, which faced slow convergence and alignment issues between speech and text. Notable improvements include a ConvNeXt architecture to refine text representation and a novel Sway Sampling strategy during inference, significantly enhancing performance without retraining. Structurally, F5-TTS leverages ConvNeXt and DiT to overcome alignment challenges between the text and generated speech. The input text is first processed by ConvNeXt blocks to prepare it for in-context learning with speech, allowing smoother alignment. The character sequence, padded with filler tokens, is fed into the model alongside a noisy version of the input speech. The Diffusion Transformer (DiT) backbone is used for training, employing flow matching to map a simple initial distribution to the data distribution effectively. Additionally, F5-TTS includes an innovative inference-time Sway Sampling technique that helps control flow steps, prioritizing early-stage inference to improve the alignment of generated speech with the input text.... 1/ ⤵️ ⤵️
1
12
53
2,885
Google DeepMind Researchers Introduce Diffusion Augmented Agents: A Machine Learning Framework for Efficient Exploration and Transfer Learning Researchers from Imperial College London and Google DeepMind have introduced the Diffusion Augmented Agents (DAAG) framework to address these challenges. This framework integrates large language models, vision language models, and diffusion models to enhance sample efficiency and transfer learning. The research team developed this framework to operate autonomously, minimizing the need for human supervision. By combining these advanced models, DAAG aims to make RL more practical and effective for real-world applications, particularly in robotics and complex task environments. The DAAG framework utilizes a large language model to orchestrate the agent’s behavior and interactions with vision and diffusion models. The diffusion models transform the agent’s past experiences by modifying video data to align with new tasks. This process, called Hindsight Experience Augmentation, allows the agent to repurpose its experiences effectively, improving learning efficiency and enabling the agent to tackle new tasks more rapidly. The vision language model, CLIP, is fine-tuned using this augmented data, allowing it to act as a more accurate reward detector. The large language model breaks down tasks into manageable subgoals, guiding the diffusion model in creating relevant data modifications. Read the full Article: marktechpost.com/2024/08/02/… Paper: arxiv.org/abs/2407.20798
18
52
1,916
Meta AI Introduces Collaborative Reasoner (Coral): An AI Framework Specifically Designed to Evaluate and Enhance Collaborative Reasoning Skills in LLMs Meta AI introduces Collaborative Reasoner (Coral)—a framework specifically designed to evaluate and enhance collaborative reasoning skills in LLMs. Coral reformulates traditional reasoning problems into multi-agent, multi-turn tasks, where two agents must not only solve a problem but reach consensus through natural conversation. These interactions emulate real-world social dynamics, requiring agents to challenge incorrect conclusions, negotiate conflicting viewpoints, and arrive at joint decisions. The framework spans five domains, including mathematics (MATH), STEM multiple-choice (MMLU-Pro, GPQA), and social cognition (ExploreToM, HiToM). These tasks serve as testbeds for evaluating whether models can apply their reasoning abilities in a cooperative, dialogue-driven context....... Read full article: marktechpost.com/2025/04/19/… Paper: ai.meta.com/research/publica… @AIatMeta @Meta
1
18
55
2,177
Adaptive-RAG: Enhancing Large Language Models by Question-Answering Systems with Dynamic Strategy Selection for Query Complexity Quick read: marktechpost.com/2024/03/30/… Researchers from the School of Computing and Graduate School of AI, Korea Advanced Institute of Science and Technology, propose a novel adaptive QA framework, Adaptive-RAG, designed to bridge this gap. Adaptive-RAG utilizes a classifier to predict the complexity level of incoming queries, allowing the model to select the most apt strategy for information retrieval and integration. This adaptability streamlines the process for simpler questions, eliminating undue computational overhead and ensuring that complex queries receive the meticulous attention required. The model’s classifier, trained on a dataset with automatically assigned complexity labels, is the linchpin in this adaptive approach. Adaptive-RAG’s efficacy was validated on various open-domain QA datasets that spanned a wide range of query complexities. It demonstrated a notable enhancement in the efficiency and accuracy of QA systems across the board. For instance, in benchmarks involving the FLAN-T5 series models, Adaptive-RAG achieved a striking balance between computational efficiency and response accuracy. It outperformed traditional methods by reducing the time per query by up to 27.18 seconds for the most complex queries while ensuring high accuracy across simple, single-step, and multi-step questions. #ArtificialIntelligence
1
16
53
2,703
Crab Framework Released: An AI Framework for Building LLM Agent Benchmark Environments in a Python-Centric Way Researchers from KAUST, Eigent.AI, UTokyo, CMU, Stanford, Harvard, Tsinghua, SUSTech, and Oxford have developed the Crab framework, a novel benchmarking tool designed to evaluate cross-environment tasks. This framework stands out by supporting functions that span multiple devices and platforms, such as desktops and mobile phones, and by incorporating a graph-based evaluation method that offers a more detailed and nuanced assessment of an agent’s performance. Unlike traditional benchmarks, the Crab framework allows for the simultaneous operation of agents across different environments, making it more reflective of the complexities agents face in real-world scenarios. The Crab framework introduces an innovative approach to task evaluation by decomposing complex tasks into smaller, manageable sub-tasks, each represented as nodes in a directed acyclic graph (DAG). This graph-based structure enables the sequential and parallel execution of sub-tasks, evaluated at multiple points rather than just at the end. This approach allows for assessing an agent’s performance at each task step, providing a more accurate picture of how well the agent functions across different environments. The flexibility of this method also accommodates multiple valid pathways to completing a task, ensuring a fairer and more comprehensive evaluation. Read our full take on CRAB: marktechpost.com/2024/08/10/… GitHub: github.com/camel-ai/crab Paper: arxiv.org/abs/2407.01511 @CamelAIOrg @CamelAIOrg @TianqiXu233 @zechengzh @Zhiqiang_Xie @YongchaoC @atasteoff @philiptorr @BernardSGhanem @guohao_li
1
12
50
4,450
Meta AI Introduces AdaCache: A Training-Free Method to Accelerate Video Diffusion Transformers (DiTs) Researchers from Meta AI and Stony Brook University introduced an innovative solution called Adaptive Caching (AdaCache), which accelerates video diffusion transformers without additional training. AdaCache is a training-free technique that can be integrated into various video DiT models to streamline processing times by dynamically caching computations. By adapting to the unique needs of each video, this approach allows AdaCache to allocate computational resources where they are most effective. AdaCache is built to optimize latency while preserving video quality, making it a flexible, plug-and-play solution for improving performance across different video generation models. AdaCache operates by caching certain residual computations within the transformer architecture, allowing these calculations to be reused across multiple steps. This approach is particularly efficient because it avoids redundant processing steps, a common bottleneck in video generation tasks. The model uses a caching schedule tailored for each video to determine the best points for recomputing or reusing residual data. This schedule is based on a metric that assesses the data change rate across frames. Further, the researchers incorporated a Motion Regularization (MoReg) mechanism into AdaCache, which allocates more computational resources to high-motion scenes that require finer attention to detail. By using a lightweight distance metric and a motion-based regularization factor, AdaCache balances the trade-off between speed and quality, adjusting computational focus based on the video’s motion content.... Read the full article: marktechpost.com/2024/11/06/… Paper: arxiv.org/abs/2411.02397 Code: github.com/AdaCache-DiT/AdaC… Project: adacache-dit.github.io/ @AIatMeta @Meta @kkahatapitiy @HaoZhe65347 @menglin_jia @ryoo_michael
18
52
2,051
Google DeepMind Researchers Propose GenRM: Training Verifiers with Next-Token Prediction to Leverage the Text Generation Capabilities of LLMs Researchers from Google DeepMind, University of Toronto, MILA and UCLA have introduced a novel approach called Generative Reward Modeling (GenRM). This method redefines the verification process by framing it as a next-token prediction task, a fundamental capability of LLMs. Unlike traditional discriminative RMs, GenRM integrates the text-generation strengths of LLMs into the verification process, allowing the model to generate and evaluate potential solutions simultaneously. This approach also supports Chain-of-Thought (CoT) reasoning, where the model generates intermediate reasoning steps before arriving at a final decision. The GenRM method, therefore, not only assesses the correctness of solutions but also enhances the overall reasoning process by enabling more detailed and structured evaluations. The GenRM methodology employs a unified training approach combining solution generation and verification. This is achieved by training the model to predict the correctness of a solution through next-token prediction, a technique that leverages the inherent generative abilities of LLMs. In practice, the model generates intermediate reasoning steps—CoT rationales—which are then used to verify the final solution. This process integrates seamlessly with existing AI training techniques, allowing for the simultaneous improvement of generation and verification capabilities. Furthermore, the GenRM model benefits from additional inference-time computation, such as majority voting aggregating multiple reasoning paths to arrive at the most accurate solution.... Read our full take on this: marktechpost.com/2024/09/02/… Paper: arxiv.org/abs/2408.15240 @GoogleDeepMind
9
46
2,192
CMU Researchers Propose New Web AI Agents that Use APIs Instead of Traditionally Browsers Researchers from Carnegie Mellon University have introduced two innovative types of agents to enhance web task performance: ✅ API-calling agent: The API-calling agent completes tasks solely through APIs, interacting directly with data in formats like JSON or XML, which bypasses the need for human-like browsing actions. ✅ Hybrid Agent: Due to the limitations of API-only methods, the team also developed a Hybrid Agent, which can seamlessly alternate between API calls and traditional web browsing based on task requirements. This hybrid approach allows the agent to leverage APIs for efficient, direct data retrieval when available and switch to browsing when API support is limited or incomplete. By integrating both methods, this flexible model enhances speed, precision, and adaptability, allowing agents to navigate the web more effectively and tackle various tasks across diverse online environments. The technology behind the hybrid agent is engineered to optimize data retrieval. By relying on API calls, agents can bypass traditional navigation sequences, retrieving structured data directly. This method also supports dynamic switching, where agents transition to GUI navigation when encountering unstructured or undocumented online content. This adaptability is particularly useful on websites with inconsistent API support, as the agent can revert to browsing to perform actions where APIs are absent. The dual-action capability improves agent versatility, enabling it to handle a wider array of web tasks by adapting its approach based on the available interaction formats.... Read the full article here: marktechpost.com/2024/10/25/… Paper: arxiv.org/abs/2410.16464 Project: yueqis.github.io/API-Based-A… Code: github.com/yueqis/API-Based-… Listen to the podcast on this research---- created with the help of NotebookLM and, of course, with the help of our team, who generated the prompts and entered the right information: piped.video/watch?v=YM7zwcSl… @yueqi_song @frankxu2004 @shuyanzhxyc @gneubig #AI @CarnegieMellon
1
21
49
5,311
MegaAgent: A Practical AI Framework Designed for Autonomous Cooperation in Large-Scale LLM Agent Systems Researchers from the National University of Singapore, Shanghai Jiao Tong University, the University of California, Berkeley, and the South China University of Technology introduced MegaAgent—a framework designed to revolutionize LLM-MA systems by enhancing their autonomy and scalability. MegaAgent distinguishes itself by enabling dynamic task splitting and parallel execution among agents, a significant departure from the traditional sequential models. This framework operates without predefined SOPs, allowing it to adapt to the needs of each task and manage a much larger number of agents effectively. By introducing system-level parallelism, MegaAgent facilitates real-time communication and coordination among agents, ensuring that even complex tasks are completed efficiently. MegaAgent’s architecture is built around a hierarchical structure that divides tasks into smaller sub-tasks, each managed by different agent groups. The framework employs a ‘boss’ agent responsible for receiving the main task, dividing it into sub-tasks, and assigning these to ‘admin’ agents. These admin agents then generate groups of agents to complete the sub-tasks, ensuring that each task is handled with a high degree of specialization. This multi-level approach allows MegaAgent to operate in parallel, significantly reducing the time required to complete tasks. For instance, in one experiment, MegaAgent successfully generated and coordinated 590 agents within 3000 seconds to simulate national policy development, a feat unmatched by other existing models.... Read our full take on this: marktechpost.com/2024/08/21/… Paper: arxiv.org/abs/2408.09955 Code: anonymous.4open.science/r/Me…
2
21
47
2,173
Optimizing Agent Planning: A Parametric AI Approach to World Knowledge Quick read: marktechpost.com/2024/05/27/… Paper: arxiv.org/abs/2405.14205
9
49
3,129
Internet of Agents (IoA): A Novel Artificial Intelligence AI Framework for Agent Communication and Collaboration Inspired by the Internet 🌐 Internet-Inspired Architecture: Just like how the internet connects people, IoA can connect different AI agents across different environments. 🤝 Autonomous Nested Team Formation: Agents can form teams and sub-teams on their own, adapting to complex tasks. 🧩 Heterogeneous Agent Integration: Brings together agents with different skills and backgrounds, kind of like assembling an all-star team. ⏳ Asynchronous Task Execution: Agents can multitask, making the whole system more efficient. 🗣️ Adaptive Conversation Flow: The conversation flow is autonomously managed to keep agent conversations structured but flexible. 🔄 Scalable and Extensible: Easy to add new types of agents or tackle different kinds of tasks. Researchers from Tsinghua University, Peking University, Beijing University of Posts and Telecommunications, and Tencent propose the Internet of Agents (IoA) framework to enhance LLM-based multi-agent collaboration. IoA overcomes existing limitations by integrating diverse third-party agents across multiple devices, using an instant messaging-like architecture for dynamic teaming and flexible communication. Inspired by Speech Act Theory, IoA employs a finite-state machine for conversation flow control. Experiments show IoA outperforms state-of-the-art baselines in general tasks, embodied AI, and retrieval-augmented generation benchmarks, achieving superior performance and highlighting its potential for sophisticated, distributed multi-agent systems. Article: marktechpost.com/2024/07/11/… Paper: arxiv.org/abs/2407.07061
13
47
1,406
1/2 Critic-RM: A Self-Critiquing AI Framework for Enhanced Reward Modeling and Human Preference Alignment in LLMs Critic-RM, developed by researchers from GenAI, Meta, and Georgia Institute of Technology, enhances reward models through self-generated critiques, eliminating the need for strong LLM teachers. It employs a two-stage process: generating critiques with discrete scores and filtering them using consistency-based methods aligned with human preferences. A weighted training strategy balances critique modeling and reward prediction, ensuring accuracy and robustness. Critic-RM improves reward modeling accuracy by 3.7%–7.3% on benchmarks like RewardBench and CrossEval and enhances reasoning accuracy by 2.5%–3.2%. This framework demonstrates strong performance across diverse tasks, leveraging high-quality critiques to refine predictions and correct flawed reasoning. The Critic-RM framework enhances reward model training by incorporating critiques as intermediate variables between responses and final rewards. It involves critique generation using an instruction-finetuned LLM, followed by filtering and refinement to ensure high-quality critiques aligned with human preferences. The reward model is trained on preference modeling and critique generation objectives, with a dynamic weighting scheme to balance both during training. During inference, the model generates critiques and predicts rewards based on responses augmented with these critiques. Inference-time scaling improves performance by averaging rewards over multiple generated critiques with non-zero temperatures. Read the full article here: marktechpost.com/2024/12/08/… Paper: arxiv.org/abs/2411.16646 @AIatMeta @yue___yu @astonzhangAZ @ChenguangZhu2 @yzpang_ @ssgrn @chaozhangcs @magpie_rayhou
1
19
48
4,331
Google DeepMind Researchers Introduce TacticAI: A New Deep Learning System that is Reinventing Football Strategy Quick read: marktechpost.com/2024/03/23/… Football has always been a game of tactical brilliance and strategic genius. From the dugouts of your local parks to the hallowed turf of the biggest stadiums, coaches are constantly tinkering with formations, set-piece routines, and game plans – all in pursuit of that elusive winning edge. But in the modern era, the battle for footballing supremacy is no longer just about the intuition of brilliant minds. It’s being reshaped by an unexpected force: artificial intelligence. For years, football clubs at the highest levels have turned to data analytics to squeeze every advantage from reams of match footage and player tracking data. AI researchers are taking the game to a new level with geometric deep learning. DeepMind Researchers introduce TacticAI, an AI assistant designed to optimize one of football’s biggest set-piece weapons: the corner kick. To the untrained eye, a corner kick is organized chaos – players swarming the box, bodies jostling for position, the whipped delivery causing a brief movement. However, for the algorithms of TacticAI, it’s a complex physics problem that is just waiting to be solved through data and prediction. By analyzing countless examples of corner kick situations and outcomes, TacticAI’s deep learning models have learned to predict multiple vital factors, such as where attackers are likely to dart towards to receive the ball, which opponents pose the biggest threat for a counter-attack, and perhaps most crucially – where the attacking team’s players should position themselves for the optimal chance of scoring. #ArtificialInteligence @GoogleDeepMind
1
16
41
2,400
Meet MAGVIT: A Novel Masked Generative Video Transformer To Address AI Video Generation Tasks Quick Read: marktechpost.com/2023/01/22/… #artificalintelligence #ArtificialIntelligence #bigdata #MachineLearning #TechNews #Trending
1
12
45
8,297
🚀 Exciting news from the #AI world! Researchers from UC Berkeley and Google have introduced a groundbreaking AI framework that reimagines visual question answering as modular code generation. 📖 Quick read: marktechpost.com/2023/06/16/… 🔬 Dive deeper into the paper: arxiv.org/abs/2306.05392 💻 Explore the code on Github: github.com/sanjayss34/codevq… For more cool AI tools, don't forget to visit aitoolsclub.com. Stay curious, stay informed! 🧠💡 #ArtificialIntelligence #MachineLearning @sanjayssub
1
12
45
9,092
Gradformer: A Machine Learning Method that Integrates Graph Transformers (GTs) with the Intrinsic Inductive Bias by Applying an Exponential Decay Mask to the Attention Matrix Quick read: marktechpost.com/2024/04/30/… Researchers from Wuhan University China, JD Explore Academy China, The University of Melbourne, and Griffith University, Brisbane, proposed Gradformer, a novel method that innovatively integrates GTs with inductive bias. Gradformer includes a special feature called exponential decay mask into the GT self-attention architecture. This approach helps to control each node’s attention weights relative to other nodes by multiplying the mask with the attention score. The gradual reduction in attention weights due to exponential decay helps the decay mask effectively guide the learning process within the self-attention framework. Gradformer achieves state-of-the-art results on five datasets, highlighting the efficiency of this proposed method. When tested on small datasets like NC11 and PROTEINS, it outperforms all 14 methods with improvements of 2.13% and 2.28%, respectively. This shows that Gradformer effectively incorporates inductive biases into the GT model, which becomes important if available data is limited. Moreover, it performs well on big datasets such as ZINC, which shows that it applies to datasets of different sizes. #artificialintelligence #ai #datascience
1
9
46
2,539
iAsk Ai Outperforms ChatGPT and All Other AI Models on MMLU Pro Test iAsk Ai has quickly become a leader in AI search. iAsk Ai’s search engine is powered by iAsk Pro, their latest model that has outperformed top competitors like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini Pro, as shown by its record-breaking results on the MMLU Pro benchmark test. In less than two years, iAsk Ai has processed 325 million searches and now handles 1.5 million searches daily, proving its efficiency in delivering fast and accurate answers. One of iAsk Ai’s most significant achievements is its outstanding performance on the MMLU Pro benchmark test, where its Pro version scored an impressive 85.85% accuracy. This result outperformed the previous best score set by GPT-4o by 12 percentage points, showcasing iAsk Pro’s superiority. Additionally, iAsk Pro achieved a superhuman performance of 93.89% on the traditional MMLU benchmark, surpassing the accuracy of the top 10% of human experts..... Read our full take on this: marktechpost.com/2024/08/28/… Details: iask.ai/
2
7
43
1,857
Meet Open Deep Search (ODS): A Plug-and-Play Framework Democratizing Search with Open-source Reasoning Agents Researchers from the University of Washington, Princeton University, and UC Berkeley have introduced Open Deep Search (ODS)—an open-source search AI framework designed for seamless integration with any user-selected LLM in a modular manner. ODS comprises two central components: the Open Search Tool and the Open Reasoning Agent. Together, these components substantially improve the capabilities of the base LLM by enhancing content retrieval and reasoning accuracy. The Open Search Tool distinguishes itself through an advanced retrieval pipeline, featuring an intelligent query rephrasing method that better captures user intent by generating multiple semantically related queries. This approach notably improves the accuracy and diversity of search results. Furthermore, the tool employs refined chunking and re-ranking techniques to systematically filter search results according to relevance. Complementing the retrieval component, the Open Reasoning Agent operates through two distinct methodologies: the Chain-of-thought ReAct agent and the Chain-of-code CodeAct agent. These agents interpret user queries, manage tool usage—including searches and calculations—and produce comprehensive, contextually accurate responses..... Read full article: marktechpost.com/2025/03/27/… Paper: arxiv.org/abs/2503.20201 GitHub Page: github.com/sentient-agi/Open…
15
45
12,757
Turing-Complete-RAG (TC-RAG): A Breakthrough Framework Enhancing Accuracy and Reliability in Medical LLMs Through Dynamic State Management and Adaptive Retrieval Researchers from Peking University, Zhongnan University of Economics and Law, University of Chinese Academy of Science, and University of Electronic Science and Technology of China have introduced a novel Turing-Complete-RAG (TC-RAG) framework. This system is designed to address the shortcomings of traditional RAG methods by incorporating a Turing Complete approach to manage state variables dynamically. This innovation allows the system to control and halt the retrieval process effectively, preventing the accumulation of erroneous knowledge. By leveraging a memory stack system with adaptive retrieval and reasoning capabilities, TC-RAG ensures that the retrieval process reliably converges on an optimal conclusion, even in complex medical scenarios. The TC-RAG system employs a sophisticated memory stack that monitors and manages the retrieval process through actions like push and pop, which are integral to its adaptive retrieval and reasoning capabilities. This stack-based approach allows the system to selectively remove irrelevant or harmful information selectively, thereby avoiding the accumulation of errors. By maintaining a dynamic and responsive memory system, TC-RAG enhances the LLM’s ability to plan and reason effectively, similar to how medical professionals approach complex cases. The system’s ability to adapt to the evolving context of a query and make real-time decisions based on the current state of knowledge marks a significant improvement over existing methods. Read our full take on this: marktechpost.com/2024/08/24/… Paper: arxiv.org/abs/2408.09199
12
45
3,398
LLaMA-Omni: A Novel AI Model Architecture Designed for Low-Latency and High-Quality Speech Interaction with LLMs Researchers from the University of Chinese Academy of Sciences introduced LLaMA-Omni, an innovative model architecture, that has been proposed to overcome the challenge of achieving low-latency and high-quality speech interaction with LLMs. This innovative approach integrates a speech encoder, speech adaptor, LLM, and streaming speech decoder to enable seamless speech-to-speech communication. The model processes speech input directly through the encoder and adaptor before feeding it into the LLM, bypassing the need for intermediate text transcription. A non-autoregressive streaming Transformer serves as the speech decoder, utilizing connectionist temporal classification to predict discrete units corresponding to the speech response. This architecture allows for the simultaneous generation of text and speech outputs, significantly reducing response latency. To support the development and evaluation of this model, researchers created the InstructS2S-200K dataset, tailored specifically for speech interaction scenarios. LLaMA-Omni’s architecture consists of four main components: a speech encoder, a speech adaptor, an LLM, and a speech decoder. The speech encoder, based on Whisper-large-v3, extracts meaningful representations from the user’s speech input. These representations are then processed by the speech adaptor, which maps them into the LLM’s embedding space through downsampling and a two-layer perceptron. The LLM, based on Llama-3.1-8B-Instruct, generates text responses directly from the speech instruction. The speech decoder, a non-autoregressive streaming Transformer, takes the LLM’s output hidden states and uses connectionist temporal classification (CTC) to predict discrete units corresponding to the speech response.... Read our take on this: marktechpost.com/2024/09/15/… Paper: arxiv.org/abs/2409.06666
10
44
1,393
NVIDIA Researchers Introduce Flextron: A Network Architecture and Post-Training Model Optimization Framework Supporting Flexible AI Model Deployment Researchers from NVIDIA and the University of Texas at Austin introduced FLEXTRON, a novel flexible model architecture and post-training optimization framework. FLEXTRON is designed to support adaptable model deployment without requiring additional fine-tuning, thus addressing the inefficiencies of traditional methods. This architecture employs a nested elastic structure, allowing it to adjust dynamically to specific latency and accuracy targets during inference. This adaptability makes using a single pre-trained model across various deployment scenarios possible, significantly reducing the need for multiple model variants. FLEXTRON transforms a pre-trained LLM into an elastic model through a sample-efficient training method and advanced routing algorithms. The transformation process includes ranking and grouping network components and training routers that manage sub-network selection based on user-defined constraints such as latency and accuracy. This innovative approach enables the model to automatically select the optimal sub-network during inference, ensuring efficient and accurate performance across different computational environments. Quick read: marktechpost.com/2024/07/17/… Paper: arxiv.org/abs/2406.10260 @nvidia @ccccrs_0908 @srv_m @gLeHeinrich @yin_hongxu @VITAGroupUT @jankautz @PavloMolchanov
15
44
1,382
Researchers from Yale and Google Introduce HyperAttention: An Approximate Attention Mechanism Accelerating Large Language Models for Efficient Long-Range Sequence Processing Quick Read: marktechpost.com/2023/10/15/… Paper: arxiv.org/abs/2310.05869 If you like our work, you will love our newsletter: marktechpost-newsletter.beeh… #ArtificialIntelligence
1
10
39
9,467
Prometheus 2: An Open Source Language Model that Closely Mirrors Human and GPT-4 Judgements in Evaluating Other Language Models The research team from KAIST AI, LG AI Research, Carnegie Mellon University, MIT, Allen Institute for AI, and the University of Illinois Chicago introduced Prometheus 2, a novel open-source evaluator designed to assess language models to resolve it. This model was developed to provide transparent, scalable, and controllable assessments while matching the evaluation quality of proprietary models. Prometheus 2 was developed by merging two evaluator LMs: one trained exclusively for direct assessment and another for pairwise ranking. The merging of these models created a unified evaluator that excels in both evaluation formats. The researchers utilized the newly developed Preference Collection dataset, which features 1,000 evaluation criteria, to refine the model’s capabilities further. By effectively combining the two training formats, Prometheus 2 can evaluate LM responses using direct assessment and pairwise ranking methods. The merged model leverages a linear merging approach to blend the strengths of both evaluation formats, achieving high performance across evaluation tasks. Quick read: marktechpost.com/2024/05/04/… Paper: arxiv.org/abs/2405.01535 GitHub Page: github.com/prometheus-eval/p… #artificalintelligence @seungonekim @scott_sjy @ShayneRedford @billyuchenlin @jay_shin @wellecks @gneubig @Kyungjae__Lee @seo_minjoon
1
13
41
3,600
Symbolic Chain-of-Thought ‘SymbCoT’: A Fully LLM-based Framework that Integrates Symbolic Expressions and Logic Rules with CoT Prompting Researchers from the National University of Singapore, the University of California, and the University of Auckland introduce the Symbolic Chain-of-Thought (SymbCoT) framework, which combines symbolic expressions with CoT prompting to enhance logical reasoning in LLMs. SymbCoT overcomes the challenges of existing methods by incorporating symbolic representation and rules, leading to significant reasoning enhancement. The innovative design of SymbCoT offers a more versatile and efficient solution for complex reasoning tasks, surpassing existing baselines like CoT and Logic-LM in performance metrics. SymbCoT uses symbolic structures and rules to guide reasoning processes, enhancing the model’s ability to tackle complex logical tasks. The framework employs a plan-then-solve approach, dividing questions into smaller components for efficient reasoning. It details the computational resources required for implementation, showcasing the scalability and practicality of the proposed method. Quick read: marktechpost.com/2024/06/02/… Paper: arxiv.org/abs/2405.18357
16
42
3,100
Meet CodeMind: A Machine Learning Framework Designed to Gauge the Code Reasoning Abilities of LLMs Quick read: marktechpost.com/2024/03/03/… A team of researchers from the University of Illinois at Urbana-Champaign introduced CodeMind, a groundbreaking framework meticulously designed to evaluate the code reasoning abilities of LLMs. CodeMind diverges from the traditional test-passing rate benchmarks, offering a nuanced approach to assess models’ proficiency in understanding complex code structures, debugging, and optimization. This framework heralds a new era in the computational assessment of LLMs, emphasizing the importance of reasoning in programming tasks beyond mere code generation Paper: arxiv.org/abs/2402.09664 #ArtificialIntelligence #DataScience
12
42
2,294
Multimodal AI on Developer GPUs: Alibaba Releases Qwen2.5-Omni-3B with 50% Lower VRAM Usage and Nearly-7B Model Performance Alibaba has released Qwen2.5-Omni-3B, a 3-billion parameter variant of its Qwen2.5-Omni model family. Designed for use on consumer-grade GPUs—particularly those with 24GB of memory—this model introduces a practical alternative for developers building multimodal systems without large-scale computational infrastructure. Qwen2.5-Omni-3B is a transformer-based model that supports multimodal comprehension across text, images, and audio-video input. It shares the same design philosophy as its 7B counterpart, utilizing a modular approach where modality-specific input encoders are unified through a shared transformer backbone. Notably, the 3B model reduces memory overhead substantially, achieving over 50% reduction in VRAM consumption when handling long sequences (~25,000 tokens)..... Read full article here: marktechpost.com/2025/04/30/… GitHub: github.com/QwenLM/Qwen2.5-Om… Hugging Face Page: huggingface.co/Qwen/Qwen2.5-… Modelscope: modelscope.cn/models/Qwen/Qw… @Alibaba_Qwen
1
14
43
1,592
Microsoft AI Proposes CoT-Influx: A Novel Machine Learning Approach that Pushes the Boundary of Few-Shot Chain-of-Thoughts (CoT) Learning to Improve LLM Mathematical Reasoning Quick read: marktechpost.com/2024/03/26/… A research team from Hong Kong University and Microsoft has proposed CoT-Influx. This novel approach introduces a more effective use of few-shot learning to boost LLM math reasoning capabilities. Leveraging a coarse-to-fine pruning mechanism, CoT-Influx aims to maximize the input of effective and concise CoT examples within the confines of existing context windows. This approach allows for more helpful CoT examples and ensures that each example comprises informative tokens. The development of CoT-Influx involved the creation of a specialized math reasoning dataset, MRD3, featuring problems that span over a wide range of difficulty levels and reasoning steps. This dataset is the foundation for training a specialized pruner tailored for math reasoning tasks. The pruner operates in two pivotal stages—initially selecting the quintessential CoT examples from a vast pool and subsequently pruning the superfluous tokens to conform to the original context window’s constraints. By adopting this dual-phase pruning strategy, CoT-Influx effectively doubles the context window’s capacity for useful CoT examples without incurring additional computational overhead or complexity. #ArtificialIntelligence
1
15
43
1,729
Microsoft Introduces AutoDev: A Fully Automated Artificial Intelligence-Driven Software Development Framework Quick read: marktechpost.com/2024/03/19/… Microsoft researchers present AutoDev, which empowers AI agents to tackle a broad spectrum of software engineering tasks autonomously, from intricate code editing and comprehensive testing to advanced git operations. This framework is designed to focus on autonomy, efficiency, and security. By housing operations within Docker containers, AutoDev ensures that development processes are streamlined and secure, safeguarding user privacy and project integrity through meticulously designed guardrails. AutoDev’s approach is underpinned by its capacity to delegate complex software engineering objectives to AI agents. These agents, equipped with diverse tools and operations, navigate through tasks with remarkable autonomy. Whether it involves editing files, compiling code, or executing tests, AutoDev’s AI agents manage these operations seamlessly, providing a comprehensive solution that addresses the multifaceted needs of modern software development. This level of automation introduces a new paradigm in software engineering, where AI takes on a more central role, enabling developers to concentrate on higher-level strategic tasks. #ArtificialIntelligence #SoftwareEngineering #SoftwareDevelopment @MSFTResearch
17
44
2,431
Meta AI Proposes Multi-Token Attention (MTA): A New Attention Method which Allows LLMs to Condition their Attention Weights on Multiple Query and Key Vectors MTA integrates convolution operations over queries, keys, and attention heads, thus enhancing the precision and efficiency of contextual information retrieval. Specifically, the MTA framework consists of two convolutional components: key-query convolution, which aggregates multiple token signals within individual attention heads, and head mixing convolution, which facilitates information sharing among different attention heads. Additionally, the implementation employs group normalization with depth-dependent scaling to stabilize gradient flow, further improving model training stability and efficacy. At a technical level, MTA modifies conventional attention calculations by incorporating a two-dimensional convolution operation on the attention logits prior to softmax normalization. This convolution allows adjacent queries and keys to influence attention scores mutually, thus enabling the attention mechanism to identify contextual relationships involving multiple tokens more precisely. Consequently, the model efficiently aggregates local token interactions without substantially increasing the number of parameters or the dimensionality of attention vectors. Moreover, head convolution promotes effective knowledge transfer among attention heads, selectively amplifying relevant context signals while mitigating less pertinent information. Collectively, these enhancements yield a more robust attention mechanism capable of capturing complex multi-token interactions....... Read full article: marktechpost.com/2025/04/01/… Paper: arxiv.org/abs/2504.00927 @AIatMeta @Meta
13
44
2,728
Protein Annotation-Improved Representations (PAIR): A Flexible Fine-Tuning Framework that Employs a Text Decoder to Guide the Fine-Tuning Process of the Encoder Researchers from the University of Toronto and the Vector Institute conducted a study that enhanced PLMs by fine-tuning them with text annotations from UniProt, focusing on nineteen types of expert-curated data. They introduced the Protein Annotation-Improved Representations (PAIR) framework, which uses a text decoder to guide the model’s training. PAIR significantly improved the models’ performance on function prediction tasks, even outperforming the BLAST search algorithm, especially for proteins with low sequence similarity to training data. This approach highlights the potential of incorporating diverse text-based annotations to advance protein representation learning. The PAIR framework enhances protein function prediction by fine-tuning pre-trained transformer models, like ESM and ProtT5, on high-quality annotations from databases like Swiss-Prot. By integrating a cross-attention module, PAIR allows text tokens to attend to amino acid sequences, improving the relationship between protein sequences and their annotations. PAIR significantly outperforms traditional methods like BLAST, especially for proteins with low sequence similarity, and shows strong generalization to new tasks. Its ability to handle limited data scenarios makes it a valuable tool in bioinformatics and protein function prediction. Read our full take on PAIR: marktechpost.com/2024/08/04/… Paper: biorxiv.org/content/10.1101/… Model: huggingface.co/h4duan/PAIR-e… @cjmaddison @SergeiIakhnin @KevinKaichuang @kchonyc @mmbronstein @befcorreia @andrewwhite01 @nc_frey @jchodera @ianfoster @kmjablonka @pschwllr @VectorInst @TorontoSRI @UofT @acceleration_c @UofTCompSci
6
40
1,597
NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video In a groundbreaking new paper, researchers at NVIDIA, University of Toronto, Vector Institute and the University of Illinois Urbana-Champaign have unveiled a framework that directly tackles this challenge. DiffusionRenderer represents a revolutionary leap forward, moving beyond mere generation to offer a unified solution for understanding and manipulating 3D scenes from a single video. It effectively bridges the gap between generation and editing, unlocking the true creative potential of AI-driven content. DiffusionRenderer treats the “what” (the scene’s properties) and the “how” (the rendering) in one unified framework built on the same powerful video diffusion architecture that underpins models like Stable Video Diffusion..... Read full article here: marktechpost.com/2025/07/10/… Paper: pxl.to/wpq77e8 GitHub Page: pxl.to/911aijj @nvidia @NVIDIAAI @nvidianewsroom @NVIDIAAIDev
1
13
43
105,950
TamGen: A Generative AI Framework for Target-Based Drug Discovery and Antibiotic Development Researchers from Microsoft Research AI for Science and other institutions developed TamGen, a target-aware molecular generation method using a GPT-like chemical language model. TamGen generates drug-like compounds by representing molecules in a sequential SMILES format, integrating modules for target protein-encoding and compound refinement. Applied to tuberculosis drug discovery, TamGen identified 14 compounds targeting the ClpP protease, with the most effective showing an IC50 of 1.9 μM. This approach improves molecular quality, balancing pharmacological activity and synthetic accessibility, demonstrating TamGen’s potential to generate novel candidates for antibiotic development and therapeutic innovation. TamGen is a framework designed to map protein binding pockets, represented by amino acid sequences and their 3D coordinates, to ligand SMILES strings. The model processes 3D input using embedding layers for amino acids and their coordinates, incorporating data augmentation for rotation and translation invariance. A protein encoder, utilizing distance-aware attention, generates continuous representations, while a contextual encoder based on VAE facilitates diverse ligand generation. Pretrained chemical language models refine the outputs. Training minimizes ligand generation error and enforces latent space regularization. Experiments with datasets like CrossDocked and PDB validated its effectiveness in generating compounds, including tuberculosis inhibitors.... Read the full article here: nature.com/articles/s41467-0… Details: microsoft.com/en-us/research… @Microsoft @MSFTnews @MSFTResearch
12
41
1,665
🧵🧵 Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System IntellAgent is an advanced multi-agent framework that transforms the evaluation and optimization of conversational agents. By simulating thousands of realistic, challenging interactions, IntellAgent stress-tests agents to uncover hidden failure points. These insights enhance agent performance, reliability, and user experience. Key Features: 🔬 Generate Thousands of Edge-Case Scenarios: Automatically generate highly realistic edge-case scenarios tailored specifically to your agent. 🤖 Simulate Diverse User Interactions: Evaluate your agent across a wide spectrum of scenarios with varying complexity levels. 📊 Comprehensive Performance Evaluations: Access detailed analysis to identify performance gaps, prioritize improvements, and compare outcomes across experiments. 💪 Simple integration: Simple integration to your conversational agent. Github Page: pxl.to/82homag
15
40
1,703
Meet ResFields: A Novel AI Approach to Overcome the Limitations of Spatiotemporal Neural Fields in Effectively Modeling Long and Complex Temporal Signals Quick Read: marktechpost.com/2023/09/10/… Paper: arxiv.org/abs/2309.03160 Github: github.com/markomih/ResField… Project: markomih.github.io/ResFields… #computersciencestudents #ArtificialInteligence
10
41
3,074
Theory of Mind Meets LLMs: Hypothetical Minds for Advanced Multi-Agent Tasks In the ever-evolving landscape of artificial intelligence (AI), the challenge of creating systems that can effectively collaborate in dynamic environments is a significant one. Multi-agent reinforcement learning (MARL) has been a key focus, aiming to teach agents to interact and adapt in such settings. However, these methods often grapple with complexity and adaptability issues, particularly when faced with new situations or other agents. In response to these challenges, this paper from Stanford introduces a novel approach-the ‘Hypothetical Minds’ model. This innovative model leverages large language models (LLMs) to enhance performance in multi-agent environments by simulating how humans understand and predict others’ behaviors. Traditional MARL techniques often find it hard to deal with ever-changing environments because the actions of one agent can unpredictably affect others. This instability makes learning and adaptation challenging. Existing solutions, like using LLMs to guide agents, have shown some promise in understanding goals and making plans but still need the nuanced ability to interact effectively with multiple agents..... Quick read: marktechpost.com/2024/07/26/… Paper: arxiv.org/abs/2407.07086 @Stanford
1
18
39
1,623
FastGen: Cutting GPU Memory Costs Without Compromising on LLM Quality Researchers from the University of Illinois Urbana-Champaign and Microsoft proposed FastGen, a highly effective technique to enhance the inference efficiency of LLMs without any loss in visible quality, using lightweight model profiling and adaptive key-value caching. FastGen evicts long-range contexts on attention heads by the KV cache construction in an adaptive manner. Moreover, it is deployed using lightweight attention profiling, which has been used to guide the construction of the adaptive KV cache without resource-intensive fine-tuning or re-training. FastGen is capable of reducing GPU memory usage with negligible generation quality loss. Quick read: marktechpost.com/2024/05/12/… Paper: arxiv.org/abs/2310.01801 #ai #ArtificialIntelligence #LLMs
10
41
2,425
ProVision: A Scalable Programmatic Approach to Vision-Centric Instruction Data for Multimodal Language Models Researchers from the University of Washington, Salesforce Research, and the University of Southern California introduced PROVISION. This scalable programmatic system uses scene graphs as symbolic image representations to generate vision-centric instruction data. By combining human-written programs with automatically or manually created scene graphs, PROVISION ensures interpretability, accuracy, and scalability while avoiding hallucinations and licensing constraints common in LLM/MLM-driven methods. The system generates over 10 million data points (PROVISION-10M) from Visual Genome and DataComp, covering diverse tasks like object, attribute, and depth-based queries. This data improves MLM performance, yielding up to 8% gains on benchmarks like CVBench, QBench2, and Mantis-Eval across pretraining and fine-tuning stages. The study introduces a method for generating vision-centric instruction data using augmented scene graphs enhanced with depth and segmentation labels. For single-image scenarios, 24 generators create diverse question-answer pairs using pre-defined templates, focusing on object attributes, relations, and spatial depth. Multi-image generators enable advanced reasoning tasks like comparison and aggregation across scene graphs. The scene graph generation pipeline integrates object detection (YOLO-world), segmentation (SAM-2), attribute detection (finetuned CoCa and LLaVA-1.5), relation extraction (Osprey), and depth estimation (Depth Anything V2). The modular framework supports customization, enabling users to create diverse data for visual reasoning and multimodal AI applications...... Read the full article: marktechpost.com/2025/01/11/… Paper: arxiv.org/abs/2412.07012
10
40
1,533
Google AI Announces Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Researchers from UC Berkeley, and Google DeepMind propose an adaptive “compute-optimal” strategy for scaling test-time computing in LLMs. This approach selects the most effective method for utilizing additional computation based on the specific prompt and question difficulty. By utilizing a measure of question difficulty from the base LLM’s perspective, the researchers can predict the efficacy of test-time computation and implement this compute-optimal strategy in practice. This adaptive allocation of test-time compute significantly improves scaling performance, surpassing best-of-N baselines while using approximately 4 times less computation for both revision and search methods. The researchers then compare the effectiveness of their improved test-time compute scaling strategy against the alternative of pretraining larger models. The use of additional test-time computation in LLMs can be viewed through a unified perspective of modifying the model’s predicted distribution adaptively at test-time. This modification can be achieved through two main approaches: altering the proposal distribution and optimizing the verifier. To improve the proposal distribution, researchers have explored methods such as RL-inspired finetuning (e.g., STaR, ReSTEM) and self-critique techniques. These approaches enable the model to enhance its own outputs at test time by critiquing and revising its initial responses iteratively. Finetuning models on on-policy data with Best-of-N guided improvements have shown promise in complex reasoning tasks. Read our full take on this: marktechpost.com/2024/08/17/… Paper: arxiv.org/abs/2408.03314 @GoogleDeepMind
9
40
1,335
The AI Scientist: The World’s First AI System for Automating Scientific Research and Open-Ended Discovery Researchers from Sakana AI, FLAIR, the University of Oxford, the University of British Columbia, Vector Institute, and Canada CIFAR have developed “The AI Scientist,” a groundbreaking framework that aims to automate the scientific discovery fully. This innovative system leverages large language models (LLMs) to autonomously generate research ideas, conduct experiments, and produce scientific manuscripts. The AI Scientist represents a significant advancement in the quest for fully autonomous research, integrating all aspects of the scientific process into a single, seamless workflow. This approach enhances efficiency and democratizes access to scientific research, making it possible for cutting-edge studies to be conducted at a fraction of the traditional cost.... Read our full take: marktechpost.com/2024/08/14/… Paper: arxiv.org/abs/2408.06292 @SakanaAILabs
12
43
1,727
Researchers at NTU Singapore Propose PointHPS: An AI Framework for Accurate Human Pose and Shape Estimation from 3D Point Clouds Quick Read: marktechpost.com/2023/09/03/… Paper: arxiv.org/abs/2308.14492 Project Page: caizhongang.github.io/projec… Github: github.com/caizhongang/Point… If you like our work, you will love our newsletter: pxl.to/nmngmk @liuziwei7 #ArtificialInteligence #DataScience #Trending
14
41
5,129
MIT and Harvard Researchers Propose (FAn): A Comprehensive AI System that Bridges the Gap between SOTA Computer Vision and Robotic Systems- Providing an End-to-End Solution for Segmenting, Detecting, Tracking, and Following any Object Quick Read: marktechpost.com/2023/08/20/… Paper: arxiv.org/abs/2308.05737 Github: github.com/alaamaalouf/Follo… #ArtificialIntelligence #MachineLearning #DataScience
16
40
3,001
Jina AI Introduced ‘Late Chunking’: A Simple AI Approach to Embed Short Chunks by Leveraging the Power of Long-Context Embedding Models The Late Chunking method represents a significant advancement in utilizing the rich contextual information provided by 8192-length embedding models. This innovative technique offers a more effective way to embed chunks, potentially bridging the gap between the capabilities of long-context models and the practical needs of various applications. By exploring this approach, researchers seek to demonstrate the untapped potential of extended context lengths in embedding models. The conventional RAG pipeline, which involves chunking, embedding, retrieving, and generating, faces significant challenges. One of the most pressing issues is the destruction of long-distance contextual dependencies. This problem arises when relevant information is distributed across multiple chunks, causing text segments to lose their context and become ineffective when taken in isolation..... Read our full take on this: marktechpost.com/2024/08/27/… Details: jina.ai/news/late-chunking-i… Colab Notebook: colab.research.google.com/dr… @JinaAI_
1
7
38
1,123
Fine-Tuning Llama 3.2 3B Instruct for Python Code: A Comprehensive Guide with Unsloth (Colab Notebook Included) In this tutorial, we’ll walk through how to set up and perform fine-tuning on the Llama 3.2 3B Instruct model using a specially curated Python code dataset. By the end of this guide, you’ll have a better understanding of how to customize large language models for code-related tasks and practical insight into the tools and configurations needed to leverage Unsloth for fine-tuning.... Full Tutorial: marktechpost.com/2025/02/04/… Colab Notebook: colab.research.google.com/dr… @UnslothAI
10
41
2,001
Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning Researchers from the University of Southern California developed the Synthetic Unanswerable Math (SUM) dataset. SUM introduces implicitly unanswerable math problems by modifying existing questions through criteria such as missing key information or creating logical inconsistencies. The researchers used DeepScaleR as the base dataset and employed the o3-mini model to generate high-quality unanswerable questions. This synthetic dataset aims to teach models to recognize when a problem lacks sufficient information and respond accordingly. SUM’s core technique is to mix answerable and unanswerable problems during training. Questions are modified to become ambiguous or unsolvable while maintaining plausibility. The training prompts instruct models to say “I don’t know” for unanswerable inputs. By introducing only 10% of the SUM data into reinforcement finetuning, models begin to leverage inference-time reasoning to evaluate uncertainty. This structure allows them to refuse answers more appropriately without impairing their performance on solvable problems..... Read full article: marktechpost.com/2025/06/05/… Paper: arxiv.org/abs/2505.13988 Dataset on Hugging Face: huggingface.co/datasets/lime… @linxins2
8
43
3,563