AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗) dm for promo ,submit papers here: huggingface.co/papers/submit

ChatGPT playing rock paper scissors
171
3,593
68,321
3,575,943
Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold paper page: huggingface.co/papers/2305.1…
310
5,428
20,685
3,407,502
Fixing things with AI
142
2,060
17,449
1,884,029
ChatGPT, Bro just kept going?
206
1,624
15,986
1,647,681
72
1,997
15,559
777,016
Hey ChatGPT, finish this building...
105
1,219
12,755
1,440,155
chatgpt gives a random youtube link
91
354
12,358
1,221,465
Fake Apple Products, midjourney AI 1. Apple Jetpack
141
1,239
11,898
2,163,478
AI is taking over
161
1,209
11,020
1,608,148
SpatialLM just dropped on Hugging Face Large Language Model for Spatial Understanding
115
1,215
10,375
673,799
AI Generative fill with memes
39
1,353
9,316
991,540
real life Simpsons, midjourney AI 1. Flanders
343
724
8,364
6,166,341
Dogs being Human, midjourney AI 1. Golden Retriever
127
792
7,025
2,020,032
Harry Potter Anime using stable diffusion by u/Inner-Reflections
86
1,428
6,352
730,151
DeepSeek-R1 write a script for a bouncing yellow ball within a Rhombicosidodecahedron, make sure to handle collision detection properly. make the Rhombicosidodecahedron slowly rotate. make sure ball stays within the Rhombicosidodecahedron. implement it in p5.js
113
679
6,163
1,478,656
o3-mini prompt: make a app called chatgpt ad maker that takes in a image and does a black and white dotted image effect with sliders to adjust dot size
What do you want to create next?
86
263
6,090
1,395,000
What did you do with generative AI?
93
276
5,430
523,887
stable diffusion img2img web UI + workflow video github: github.com/hlky/stable-diffu… reddit thread: teddit.net/r/StableDiffusion…
35
927
5,283
stylegan2 finetuning ffhq to metfaces
39
1,134
5,014
ADOP: Approximate Differentiable One-Pixel Point Rendering abs: arxiv.org/abs/2110.06635
62
973
4,912
Celebrities if They Worked Normal Jobs, Midjourney AI 1. Tom Cruise
101
281
4,213
2,149,336
Mubert-Text-to-Music 🎵🎵🎵 Colab notebooks demonstrating prompt-based music generation via Mubert API GitHub: github.com/MubertAI/Mubert-T…
76
1,140
4,318
OpenAI o3-mini just one shotted this prompt: write a script for 100 bouncing yellow balls within a sphere, make sure to handle collision detection properly. make the sphere slowly rotate. make sure balls stays within the sphere. implement it in p5.js
137
399
4,237
814,801
6. Apple Orange
47
439
4,102
368,330
This is HUGE The AI App store is here Ask anything you want to do with AI With ~400k Apps, this is the best place to find the AI apps you need developers can build apps, users can try them out and find new apps with AI search
151
550
4,170
662,764
Monster Mash: A Single-View Approach to Casual 3D Modeling and Animation pdf: dcgi.fel.cvut.cz/home/sykora… project page: dcgi.fel.cvut.cz/home/sykora…
36
1,024
3,899
AI generative fill extending scenes, movies shot in portrait format by @Alex_Cerrato
Alex Cerrato
94
763
3,977
683,527
Text to image with midjourney and image to video with gen2 by @commonstyle
Creative.Edge CL+
71
758
3,782
668,122
Meme Legends, Photoshop generative fill AI by savvydone
29
593
3,722
458,049
Fork blender, call it cursor for 3D Raise $100M
76
84
3,526
432,729
Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields abs: arxiv.org/abs/2304.06706 project page: jonbarron.info/zipnerf/
91
588
3,391
942,495
BREAKING OpenAI released a implementation of Consistency Models consistency models, a new family of generative models that achieve high sample quality without adversarial training. They support fast one-step generation by design, while still allowing for few-step sampling to trade compute for sample quality github: github.com/openai/consistenc…
29
755
3,265
1,379,809
Stable Diffusion AI Deepfake De-Aged Harrison Ford SD+ControlNet+EbSynth+Fusion reddit thread: teddit.net/r/StableDiffusion…
81
619
3,148
551,548
StarVector is out on Hugging Face StarVector is a foundation model for generating Scalable Vector Graphics (SVG) code from images and text. It utilizes a Vision-Language Modeling architecture to understand both visual and textual inputs, enabling high-quality vectorization and text-guided SVG creation.
55
492
3,287
254,290
Scaling Transformer to 1M tokens and beyond with RMT Recurrent Memory Transformer retains information across up to 2 million tokens. During inference, the model effectively utilized memory for up to 4,096 segments with a total length of 2,048,000 tokens—significantly exceeding the largest input size reported for transformer models (64K tokens for CoLT5 (Ainslie et al., 2023), and 32K tokens for GPT-4 (OpenAI, 2023)). This augmentation maintains the base model’s memory size at 3.6 GB in our experiments abs: arxiv.org/abs/2304.11062 github: github.com/booydar/t5-experi…
89
748
3,192
1,722,421
midjourney version 5.2 zoom out feature: Unleashing the Potential of A Broader View
49
443
3,108
502,037
Training AI to Play Pokemon with Reinforcement Learning by @computerender github: github.com/PWhiddy/PokemonRe… youtube: piped.video/watch?v=DcYLT37I…
32
660
3,097
837,741
Celebrity Mortal Kombat, Midjourney AI + gen2 + ElevenLabs by u/fignewtgingrich
76
693
2,997
457,289
5. Apple Teleport
53
159
2,943
411,085
Eyes Tell All: Irregular Pupil Shapes Reveal GAN-generated Faces pdf: arxiv.org/pdf/2109.00162.pdf abs: arxiv.org/abs/2109.00162
28
703
2,962
DALL·E: Introducing Outpainting Extend creativity and tell a bigger story with DALL-E images of any size blog: openai.com/blog/dall-e-intro…
22
715
2,961
3D-aware Conditional Image Synthesis abs: arxiv.org/abs/2302.08509 project page: cs.cmu.edu/~pix2pix3D/
16
632
2,914
301,477
Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.
97
557
2,914
366,980
vibe coding AI apps for free has never been easier 100% open source app, DeepSite on Hugging Face
53
319
2,945
395,071
41
299
2,684
289,726
DeepSeek-V3-0324 is next level 🤯 Someone made DeepSite, letting you vibe code your own AI app or game and host it for FREE ⬇️ Results are insane, its like cursor in the browser
59
353
2,731
423,945
Midjourney AI recreating the Original 151 Pokémon - Part 1: The Starters by u/OfficialKnockout 1. Bulbasaur #001
33
266
2,517
917,765
Agent Laboratory Using LLM Agents as Research Assistants
20
363
2,568
204,538
MDM: Human Motion Diffusion Model abs: arxiv.org/abs/2209.14916 project page: guytevet.github.io/mdm-page/
16
590
2,500
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding project page: gweb-research-imagen.appspot… sota FID(7.27 on COCO), without ever training on COCO, human raters find Imagen samples to be on par with the COCO data itself in image-text alignment
28
635
2,501
Another Meme Legends, Photoshop generative fill AI by @SavvyDone
24
433
2,431
373,649
Apple announces LLM in a flash: Efficient Large Language Model Inference with Limited Memory paper page: huggingface.co/papers/2312.1… Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their intensive computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory but bringing them on demand to DRAM. Our method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this flash memory-informed framework, we introduce two principal techniques. First, "windowing'" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.
25
448
2,475
698,084
grok 3 prompt: I’d like to make a p5.js simulation of a sphere made up of ASCII numbers, rotating. The closest numbers should be pure white, and the farthest ones should fade to gray, on a black background
⚡ Simulation with o3-mini high! Ever since I was 9 and messing around with BASIC on my MSX, I’ve loved any kind of visual simulation. Now, with o3, we have the power to create anything that comes to mind with just a couple of prompts. It’s mind blowing 🤯 The prompt was: "I’d like to make a JS simulation of a sphere made up of ASCII numbers, rotating. The closest numbers should be pure white, and the farthest ones should fade to gray, on a black background." After just a few interactions, this is the result. What a time to be alive!
76
203
2,460
1,325,639
DeepSeek-R1 Coder its like cursor in the browser
12
190
2,470
285,606
Riffusion, real-time music generation with stable diffusion @huggingface model: huggingface.co/riffusion/rif… project page: riffusion.com/about
53
577
2,415
Microsoft presents The Era of 1-bit LLMs All Large Language Models are in 1.58 Bits Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
43
578
2,398
433,763
Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.
75
499
2,284
684,303
Track Anything: Segment Anything Meets Videos Track-Anything is a flexible and interactive tool for video object tracking and segmentation suitable for: - Video object tracking and segmentation with shot changes. - Visualized development and data annnotation for video object tracking and segmentation. - Object-centric downstream video tasks, such as video inpainting and editing. abs: arxiv.org/abs/2304.11968 github: github.com/gaomingqi/Track-A…
35
459
2,295
578,613
Generative Agents: Interactive Simulacra of Human Behavior abs: arxiv.org/abs/2304.03442 project page: reverie.herokuapp.com/arXiv_…
61
474
2,253
902,565
Block-NeRF: Scalable Large Scene Neural View Synthesis abs: arxiv.org/abs/2202.05263 project page: waymo.com/research/block-ner…
28
511
2,275
TikTok presents Depth Anything Unleashing the Power of Large-Scale Unlabeled Data paper page: huggingface.co/papers/2401.1… demo: huggingface.co/spaces/LiheYo… Depth Anything is trained on 1.5M labeled images and 62M+ unlabeled images jointly, providing the most capable Monocular Depth Estimation (MDE) foundation models with the following features: zero-shot relative depth estimation, better than MiDaS v3.1 (BEiTL-512) zero-shot metric depth estimation, better than ZoeDepth optimal in-domain fine-tuning and evaluation on NYUv2 and KITTI
32
385
2,265
600,083
Alibaba just dropped TaoAvatar on Hugging Face Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting
54
296
2,252
329,263
Wan2.2-Animate-14B just dropped on Hugging Face Unified Character Animation and Replacement with Holistic Replication
53
303
2,295
157,398
Apple presents Ferret-UI Grounded Mobile UI Understanding with Multimodal LLMs Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with
30
375
2,187
680,537
stylegan3-projector Mario github: github.com/ouhenio/stylegan3…
61
342
2,057
alpaca-lora: Code for reproducing the Stanford Alpaca InstructLLaMA result on consumer hardware github: github.com/tloen/alpaca-lora
24
447
2,114
1,363,903
chatgpt search vs perplexity
97
107
2,088
375,405
Microsoft just dropped OmniParser V2, looks incredible Turning Any LLM into a Computer Use Agent
48
319
2,131
226,353
Google presents Diffusion Models Are Real-Time Game Engines discuss: huggingface.co/papers/2408.1… We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories.
59
448
2,095
493,090
Meta releases Llama 2: Open Foundation and Fine-Tuned Chat Models paper: ai.meta.com/research/publica… blog: ai.meta.com/llama/ develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
32
544
2,086
637,638
everyone on ML twitter right now
22
160
2,028
Vid2Player: Controllable Video Sprites that Behave and Appear like Professional Tennis Players pdf: arxiv.org/pdf/2008.04524.pdf abs: arxiv.org/abs/2008.04524 project page: cs.stanford.edu/~haotianz/re…
38
515
2,017
this looks insane, MatAnyone Stable Video Matting with Consistent Memory Propagation
39
202
2,068
310,421
make-a-video: text-to-video generation without text-video data paper: makeavideo.studio/Make-A-Vid… project page: makeavideo.studio/
32
460
2,010
AI will take over the world?
63
279
1,922
222,557
stylegan3 is out github: github.com/NVlabs/stylegan3
7
377
1,981
3. Apple Jeans
22
87
1,851
251,132
One is Midjourney 5.1, the other is real. Which one is which? reddit thread: teddit.net/r/midjourney/comm…
441
204
1,892
1,029,024
10. Marge
196
83
1,813
5,552,719
Language Modeling Is Compression paper page: huggingface.co/papers/2309.1… It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
41
364
1,930
774,788
Tracking Anything with Decoupled Video Segmentation paper page: huggingface.co/papers/2309.0… Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.
19
398
1,945
305,622
34
332
1,910
188,849
.@Gradio Demo for AnimeGANv2 Face Portrait v2 now on @huggingface Spaces demo: huggingface.co/spaces/akhali… github: github.com/bryandlee/animega…
45
298
1,903
#StyleGAN2 interps
18
364
1,874
6. Apu
20
50
1,774
682,725
ByteDance just announced InfiniteYou available on Hugging Face Flexible Photo Recrafting While Preserving Your Identity
24
220
1,959
222,303
Dreamix: Video Diffusion Models are General Video Editors abs: arxiv.org/abs/2302.01329 project page: dreamix-video-editing.github… present diffusion-based method that is able to perform text-based motion and appearance editing of general videos
33
414
1,868
398,162
Got married 💍
221
18
1,882
138,184
JPMorgan announces DocLLM A layout-aware generative language model for multimodal document understanding paper page: huggingface.co/papers/2401.0… Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
23
341
1,896
352,903
A implementation of text-to-3D dreamfusion, powered by stable diffusion github: github.com/ashawkey/stable-d…
21
397
1,829
Imagic: Text-Based Real Image Editing with Diffusion Models abs: arxiv.org/abs/2210.09276
18
396
1,850
DreamFusion: Text-to-3D using 2D Diffusion paper: openreview.net/pdf?id=FjNys5… abs: openreview.net/forum?id=FjNy… project page: dreamfusionpaper.github.io/ DeepDream on a pretrained 2D diffusion model enables text-to-3D synthesis
27
382
1,808
20 second tutorial on making apps with Grok 3 and deploying on Hugging Face example showing gradio app with halftone effect
83
310
1,799
4,896,039
ByteDance announces Doubao-1.5-pro - Includes a "Deep Thinking" mode, surpassing O1-preview and O1 models on the AIME benchmark. - Outperforms deepseek-v3, gpt4o, and llama3.1-405B on popular benchmarks. - Built on a MoE architecture, with activated parameters far fewer than those in the above models. - Achieves a 7x MoE performance leverage—delivering dense model performance with just 1/7 of the activated parameters (e.g., 20B activated params = 140B dense performance). - Engineering-wise, features heterogeneous system design for prefill-decode and attn-fffn, maximizing throughput under low-latency requirements.
52
267
1,831
389,914
Meta just released MusicGen, a simple and controllable model for music generation MusicGen is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods like MusicLM, MusicGen doesn't not require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, can predict them in parallel, thus having only 50 auto-regressive steps per second of audio try out the @Gradio demo: huggingface.co/spaces/facebo… Models on @huggingface: huggingface.co/models?other=… github: github.com/facebookresearch/…
40
386
1,772
627,444
7. Apple Ship
9
64
1,712
288,774
Alibaba just released LHM on Hugging Face Large Animatable Human Reconstruction Model from a Single Image in Seconds
25
242
1,754
170,407
GeoCode: Interpretable Shape Programs abs: arxiv.org/abs/2212.11715 project page: threedle.github.io/GeoCode/
16
267
1,720
1,429,189