COO at UniCourt | Structured Litigation Data for law, insurance and more. Interested in all things data, LLM and AI

@ChatGPTapp @OpenAI @tszzl @emollick @voooooogel Wild result. gpt-4-turbo over the API produces (statistically significant) shorter completions when it "thinks" its December vs. when it thinks its May (as determined by the date in the system prompt). I took the same exact prompt over the API (a code completion task asking to implement a machine learning task without libraries). I created two system prompts, one that told the API it was May and another that it was December and then compared the distributions. For the May system prompt, mean = 4298 For the December system prompt, mean = 4086 N = 477 completions in each sample from May and December t-test p < 2.28e-07 To reproduce this you can just vary the date number in the system message. Would love to see if this reproduces for others.
131
319
2,231
2,245,706
Small but important clarification. The distribution is labeled tokens but the measure, and analysis, is actually done on character length *not* tokens. Thought for an effect this size I think it seems to be a good proxy.
3
3
108
41,473
I wanted to do a month by month comparison but the effect is such that you need a fairly high N (because the standard deviation in completion lengths is already pretty high in the first place), and it gets pricey fast, haha. I published my code though, so others can try!
2
1
32
15,824
Posted code here. Please note the analysis was done at N=477 and based on character count, not tokens as in the label: github.com/robalynch1122/Ope… And also please note its parallelized so it'll run fast an not cheap 😅 Around $28 per run to exactly repro!
19
7,822
Wow, so cool. I'm glad someone else has seen it in the wild! Thanks for checking it out @voooooogel
Reproduced! There were some small bugs in the original test code (lack of zero padding for May (h/t @gwern) and one of those pervasive """-string indentation issues), but still reproduces without them, to the best of my stats knowledge 🎉
1
13
1,387
Yes, needs to be reproduced for sure to make sure I'm not missing something or going crazy!
1
12
16,382
Replying to @voooooogel
DMs opened, give it another try!
1
11
19,725
Interesting, I ran at N = 80 again just now and got p-value of 0.089 (two-tailed) but I did it on character count, not tokens (see my clarification to the original post). I ran it a few times over the weekend (making sure *not* to p-hack), and the effect definitely grew as samples did. I wonder why that would be impacted by tokenization though?
4
11
12,238
I deny any influence from the May December PR team on my experiment 🤣
We live in a simulation
1
9
1,730
Are @yoheinakajima’s BabyAGI and other paired loop LLM experiments like @SigGravitas Auto-GPT a precursor to truly agentic “conscious” AI? Julian Jaynes might think so. In 1976, Jaynes wrote a book called “The Origin of Consciousness in the Breakdown of the Bicameral Mind". Jaynes posited ancient humans operated under a 'bicameral' mind model, where one part 'spoke' and another obeyed, mimicking internal dialogue. Jaynes theorized that the 'bicameral' structure in humans started as an external “voice of God” but over time this structure broke down and the instructing voice was internalized and evolved towards consciousness, where dialogues within became introspective, self-aware thoughts. This theory intersects intriguingly with modern Large Language Models (LLMs) experiments like Auto-GPT and BabyAGI. Consider an LLM as an 'executor', processing and responding to inputs, paired with another as a 'critic' or 'instructor', guiding or evaluating responses. We know that pairing LLMs in an executor-critic loop like this can lead to a very rudimentary form of 'agenticness' or self-guided operation. As models scale and have modalities added, could such loops “breakdown” as Jaynes’ suggested where executor and critic roles transition over time towards autonomous, conscious-like behavior? Ultimately, Jaynes theories are controversial and not broadly accepted but the parallels between a ‘bicameral’ mind and rudimentary agenticness in paired LLMs is intriguing at the least. en.m.wikipedia.org/wiki/The_…
1
3
9
2,846
Adding a link to the GitHub repo with code I'm using to (hopefully) train an LLM to play text-adventure games (github.com/robalynch1122/Zor…). Starting with the core of a PPO loop (no policy model yet) to interact with Zork in terminal and train a reward model. And also a logit inspector to see what Mistral 7B's initial generations would be when put into the text environment.
2
1
7
676
@voooooogel was able to repro on character counts, I retweeted. Looks like a higher N is needed.
1
5
1,282
Looking at the logit outputs of Mistral 7B (not instruct) when "placed" into a text-adventure environment is really encouraging. Verbs like look, touch, take, go, open and get were in the top ten, and petting the dog was in position 24. Lots of potential here to be tuned.
3
209
Answering the question objectively and accurately of whether or not an AI model (or system of AI models) is independently and successfully goal-seeking seems of pretty key concern to ⏸️/⏹️-ers, ⏩-ers, people who believe we're close, people who believe we're far off and everyone in-between. I'd like to propose a potentially useful AGI test called the "Zork-Like test". What is Zork and why Zork-Like? Zork (en.wikipedia.org/wiki/Zork) is a text adventure game and a defined environment with an achievable goal that is not stated but needs to be discovered by the player, and conveniently also has a score that can be maximized and a score/turn ratio that can be measured. It's the perfect "toy" test of human-level goal discovery, recognition and action in a simple simulated text-only environment accessible to LLMs. But we can't use Zork itself as there are walkthroughs of it in the training set, so the Zork-Like game is of similar complexity but with a different map and some different actions that don't make it into training sets, or can be recreated easily if one suspects it has. Here's the Zork-Like Test: Before "using superhuman persuasion to build a bioweapon using human agents" or "inventing new physics and solving antibiotic resistance", an AI needs to be able to understand the goal of a Zork-Like game (not in the training set) and max out its Zork-Like score efficiently to human level without human feedback and only bootstrapping from interacting with the text environment. [Side note: I'm curious on the thoughts of anyone including @sama, @ylecun, @roon, @emollick, and others in e/acc (@beffjezos) and Safety (@robertskmiles) circles on this test and if the p(+bootstrapped Zork completion by AI models) - > p(+doom) / p(+utopia) logic rings true. I do at least think even the most dedicated and imaginative⏹️-ers (@AISafetyMemes) would concede that evil AGI induced bioweapons manufacturing in the real world is a greater challenge than getting high Zork scores in a narrower/defined world and is therefore at very least useful as a canary of goal-seeking AGI.] The Zork test (nevermind about Zork-like test) fails miserably today, even with a full Zork walkthrough apparently in the GPT-4 training set. State of the art ML Zork score is about 50, and human unassisted GPT-4 is about 10 (arxiv.org/pdf/2304.02868.pdf, arxiv.org/pdf/2107.08408.pdf). This shouldn't be too shocking, LLMs are missing a lot of key components like memory, self-reflection and the ability to take independent action without prompting, but on the other hand, an LLM seems to be the perfect "type" of model to be able to play text games. I think the reason for the failure reduces to one of the key problems in AI research in general, that of out-of-distribution generalization. It's becoming clear (arxiv.org/abs/2312.16337) that even if large scale LLMs do have emergent abilities, that far more of their success is down to incredibly good generalization on the training set which at this point for GPT-4 is approaching "the entire set of written human knowledge (up until the training knowledge cutoff)". So why can't LLMs play Zork, what's missing in training set and does that show potential paths forward? I think that a lot of LLMs goal-directed weaknesses (even when placed in loops/pairs or "teams" and given in-context "memory" like @yoheinakajima's BabyAGI and AutoGPT) are due to the fact that pre-trained LLMs have "read" everything but they have "done" nothing. Their training on knowledge is deep but their first-hand training on taking action is non-existent. Likely the biggest leap in capabilities on the language-side in the AI Spring came from RLHF which led to instruction tuned models. It seems like RL, without the HF is a good bet on the way forward, i.e. reinforcement learning, with environmental (not human) feedback happening online/actively as the model attempts to do something and gets feedback. If its possible for current stage LLMs to build a world model (a big if) it may come from a setup like this. (@ylecun has spoken about how embodiment is a likely needed for true human-level abilities). If the above is true this seems to lead to another rule of thumb: a system (consisting of some number of AI models interacting) cannot be called an AGI unless it is able to dynamically update its own weights as it acts in a way that improves its later actions vs. prior actions without any human labeling or intervention (except as a part of the environment). This is because the real world will always contain out-of-distribution to text training data obstacles which no fixed weight model could overcome. Our brains do this in real-time, LLMs do not (today) at all. It could also mean that "hallucinations" (and models that have baked in knowledge like LLMs) may actually have outsized value, they provide a form of knowledge directed exploration that can only be refined with environmental feedback on which ones are good (advance the overall goal) and which ones are bad (do not advance the overall goal). This seems to complete the circle of why the Zork-Like test "works", why models fail it today and why current LLMs cannot be thought of as having any risk on goal-seeking without passing this test first. In light of the above, I'll lay out my (almost certainly failed) path to moving forward the Zork-Like test and 2024 pet project (which will proceed extremely slowly, or potentially not at all, due to time and cost constraints). 1) Play a Python Zork reference (like github.com/KadenBiel/Python-…) multiple times to set a human baseline of total number of playthroughs allowed and score/move ratio (or find the same) 2) Take two 7B LLMs, one to be used as a policy/player model and one to be used as a reward model 3) Create a Zork environment that can interact with the policy and reward LLM 4) Use RL techniques like DPO/PPO and techniques borrowed from RLHF (but with no human labeling) to fine-tune both models as they play. 5) See if any "sparks" of generalization occur Super interested in anyone's thoughts, holes in my logic, ways to add rigor or greater definition or interest in participating! Happy New Year all.
5
712
If others can reproduce, I'd definitely be encouraged to start looking for experiments that would disprove that hypothesis. It's a big claim!!
1
5
3,809
Text adventure games are said to have “sparse rewards” which is one of the things make it hard for RL algorithms to solve. However, they’re very rewarding to play. Where is the reward coming from? It seems to me like discovering new states (rooms you can visit, things you can pick up), is intrinsically rewarding. This is a great paper about using “curiosity” (i.e. finding new states to visit) as a reward in sparse reward environments (pathak22.github.io/noreward-…)
4
282
Exactly, and I am not a statistician by far!! It's a big claim with many possible explanations!
1
4
1,235
Super interesting development from @a_karvonen. A 50M parameter model trained on chess move sequences not only learns how to play chess (making sequences of moves not in the training set) but can also be shown with probing to have developed a world model of the board. So any question of at least the possibility of the emergence of world models in LLMs seems answered.
I trained Chess-GPT, a 50M parameter LLM, to play at 1500 ELO. We can visualize its internal state of the board. In addition, to better predict the next character it estimates the ELO of the players involved. 🧵
3
472
I reproduced the system prompt from ChatGPT in the system message over the API. That was the only difference. One said the current date was in May, one in December.
3
126
Wondering what other context window enhancers and diminishers might be out there and how to quantify them (without breaking the OpenAI budget!) h/t @voooooogel
so a couple days ago i made a shitpost about tipping chatgpt, and someone replied "huh would this actually help performance" so i decided to test it and IT ACTUALLY WORKS WTF
3
856
Takeaways: -- OpenAI's Ada embeddings can underperform open-source embeddings and are way more expensive -- Embeddings capture so much information that even the simplest of models are able to do solid text/sentiment classification -- Here's some code to compare embedding models for your task: github.com/robalynch1122/Com…
Embedding models are so good at capturing content and semantics of text that even a basic logistic regression model trained on them can get surprisingly good results on text classification and sentiment analysis tasks (saving the need for heavy model training and loading). Even though a fine-tuned BERT model or LLM will likely exceed them, 90% f1-scores are totally achievable. However, not all embedding models are created equal and OpenAI's Ada embeddings can underperform other fully open-source embeddings like BERT and Sentence Transformers (while being a lot more expensive). Check out the performance of three embedding models on a text categorization and sentiment analysis task in the attached image. Incidentally, experiments like these can be useful for quickly benchmarking different embedding models on categorization tasks so you can choose the best (not sure if this extends to other types of tasks like clustering). I included the code so you can benchmark and compare your own tasks quickly. github.com/robalynch1122/Com…
1
3
744
Wow. This blew up. Thanks @arstechnica and @benjedwards. Thanks too to @IanArawjo efforts in trying (and failing) to replicate! Super interested to see what (if anything) others find on this and on similar experiments like @voooooogel’s tipping finding.
Is ChatGPT becoming lazier because it’s December? People run tests to find out trib.al/F7m0zSc
2
750
Embedding models are so good at capturing content and semantics of text that even a basic logistic regression model trained on them can get surprisingly good results on text classification and sentiment analysis tasks (saving the need for heavy model training and loading). Even though a fine-tuned BERT model or LLM will likely exceed them, 90% f1-scores are totally achievable. However, not all embedding models are created equal and OpenAI's Ada embeddings can underperform other fully open-source embeddings like BERT and Sentence Transformers (while being a lot more expensive). Check out the performance of three embedding models on a text categorization and sentiment analysis task in the attached image. Incidentally, experiments like these can be useful for quickly benchmarking different embedding models on categorization tasks so you can choose the best (not sure if this extends to other types of tasks like clustering). I included the code so you can benchmark and compare your own tasks quickly. github.com/robalynch1122/Com…
1
3
1,177
I didn’t and @gwern caught this too, but @voooooogel fixed and was able to repro on character count (higher N than 80 was needed), I retweeted that. Looks like it really is there.
3
1,495
7/ Check out this paper for more depth on this topic: Attention in Psychology, Neuroscience, and Machine Learning buff.ly/3V7NsHU
2
759
I am very much not a statistician. 😅That's why I threw out the distribution and repro steps too.
1
48
Spent some time playing Zork on my phone (see prior very long tweet), shout out to Frotz on App Store for making classic text adventure games accessible on mobile. First takeaway, it’s not easy out of the gate at all. Spent time stuck in a maze and building a picture of the map in your own head takes a while. Good news is that the state, action, new state, … sequence provides the perfect type of data for LLM fine tuning and RL in general. Some adjustments to the original proposed path of action, the rewards given by the inbuilt game score are way too sparse relative to moving around in the environment as a whole. So will need to find a way to model progress and rewards with no human labeling/bootstrapping only if I’m going to start with PPO. This is a challenge…
1
2
294
9/ Want to learn more about attention mechanisms in the brain? Check out these resources: Brain mechanisms associated with internally directed attention and self-generated thought buff.ly/41VMxxb The Dorsal Attention Network buff.ly/448e8MQ
1
2
401
Robot soma 🤣 Using PPO reinforcement learning I fine-tuned Llama2 13B (with only 20GB of VRAM!) to produce hugely more positive responses using a BERT sentiment measure as a reward function. Blue is distribution of sentiment before training, red after. Next will put the same base model into the text-adventure game and train with a reward function that attempts to maximize curious exploration. (Code example from here: github.com/huggingface/trl/b…, modified to use Llama2 13B with PEFT)
1
411
9/ Want to dive deeper into transformers and self-attention? Check out these resources: "Attention is All You Need" (original paper): buff.ly/2H7XB0v Illustrated Transformer: buff.ly/2AJhvNq
1
2
251
1/ Are there connections between AI models like transformers and the human brain? We previously discussed self-attention in transformers and attention mechanisms in the brain. Now, let's focus on the fascinating similarities between these systems!
1
1
1,175
1/ Have you ever wondered how AI can understand and generate human-like text? The secret lies in an architecture called transformers, which use a powerful mechanism called self-attention to learn context and dependencies in the input data. Let's dive in!
1
1
304
8/ In summary, the idea of life as an unbroken algorithmic run, driven by energy gradients, free energy minimization, and entropy, provides an interesting perspective on how our complex world came to be.
1
1
208
1/ Is all life on Earth a single long running algoritm?: Ever considered that life on Earth might be an unbroken, continuous process from its origin to the present day? Let's dive into this intriguing idea and explore how energy gradients might have driven the emergence of life.
1
237
We’ll come back to coordinates a bit later but now we’re going to jump over to machine learning. Specifically, machine learning on words which is a type of “natural language processing”.
1
55
6/ The brain also relies on Hebbian learning to strengthen synaptic connections over time. This principle, often summed up as "neurons that fire together wire together," allows our brains to learn, adapt, and create associations between related stimuli.
1
1
18
7/ The attention mechanisms in our brains are intricate and interconnected, enabling us to process the vast amounts of sensory information we encounter every day. These neural networks play a critical role in our perception, decision-making, and learning.
1
1
56
Yes way more research needed as it seems totally odd and unlikely to me too, but after I reproed it three times over the weekend with no p-hacking, I figured I should throw it out so others can try and replicate (or fail to!)
1
53
Replying to @highlightmv_
Rob Thomas - Lonely No More or Ace of Base - Beautiful Life is the closest I’ve found. It’s been driving me nuts too.
1
57
tl;dr summary: The Turing Test is likely no longer a useful measure of human-level AI capabilities, but being able to complete a text adventure game (like Zork) to human-level could be a good goal/canary of goal-seeking AGI. LLMs seem unable do this as they're deeply trained on knowledge but not action and "text-embodiment" combined with traditional forms of RL on LLMs may be a path to overcoming this.
Answering the question objectively and accurately of whether or not an AI model (or system of AI models) is independently and successfully goal-seeking seems of pretty key concern to ⏸️/⏹️-ers, ⏩-ers, people who believe we're close, people who believe we're far off and everyone in-between. I'd like to propose a potentially useful AGI test called the "Zork-Like test". What is Zork and why Zork-Like? Zork (en.wikipedia.org/wiki/Zork) is a text adventure game and a defined environment with an achievable goal that is not stated but needs to be discovered by the player, and conveniently also has a score that can be maximized and a score/turn ratio that can be measured. It's the perfect "toy" test of human-level goal discovery, recognition and action in a simple simulated text-only environment accessible to LLMs. But we can't use Zork itself as there are walkthroughs of it in the training set, so the Zork-Like game is of similar complexity but with a different map and some different actions that don't make it into training sets, or can be recreated easily if one suspects it has. Here's the Zork-Like Test: Before "using superhuman persuasion to build a bioweapon using human agents" or "inventing new physics and solving antibiotic resistance", an AI needs to be able to understand the goal of a Zork-Like game (not in the training set) and max out its Zork-Like score efficiently to human level without human feedback and only bootstrapping from interacting with the text environment. [Side note: I'm curious on the thoughts of anyone including @sama, @ylecun, @roon, @emollick, and others in e/acc (@beffjezos) and Safety (@robertskmiles) circles on this test and if the p(+bootstrapped Zork completion by AI models) - > p(+doom) / p(+utopia) logic rings true. I do at least think even the most dedicated and imaginative⏹️-ers (@AISafetyMemes) would concede that evil AGI induced bioweapons manufacturing in the real world is a greater challenge than getting high Zork scores in a narrower/defined world and is therefore at very least useful as a canary of goal-seeking AGI.] The Zork test (nevermind about Zork-like test) fails miserably today, even with a full Zork walkthrough apparently in the GPT-4 training set. State of the art ML Zork score is about 50, and human unassisted GPT-4 is about 10 (arxiv.org/pdf/2304.02868.pdf, arxiv.org/pdf/2107.08408.pdf). This shouldn't be too shocking, LLMs are missing a lot of key components like memory, self-reflection and the ability to take independent action without prompting, but on the other hand, an LLM seems to be the perfect "type" of model to be able to play text games. I think the reason for the failure reduces to one of the key problems in AI research in general, that of out-of-distribution generalization. It's becoming clear (arxiv.org/abs/2312.16337) that even if large scale LLMs do have emergent abilities, that far more of their success is down to incredibly good generalization on the training set which at this point for GPT-4 is approaching "the entire set of written human knowledge (up until the training knowledge cutoff)". So why can't LLMs play Zork, what's missing in training set and does that show potential paths forward? I think that a lot of LLMs goal-directed weaknesses (even when placed in loops/pairs or "teams" and given in-context "memory" like @yoheinakajima's BabyAGI and AutoGPT) are due to the fact that pre-trained LLMs have "read" everything but they have "done" nothing. Their training on knowledge is deep but their first-hand training on taking action is non-existent. Likely the biggest leap in capabilities on the language-side in the AI Spring came from RLHF which led to instruction tuned models. It seems like RL, without the HF is a good bet on the way forward, i.e. reinforcement learning, with environmental (not human) feedback happening online/actively as the model attempts to do something and gets feedback. If its possible for current stage LLMs to build a world model (a big if) it may come from a setup like this. (@ylecun has spoken about how embodiment is a likely needed for true human-level abilities). If the above is true this seems to lead to another rule of thumb: a system (consisting of some number of AI models interacting) cannot be called an AGI unless it is able to dynamically update its own weights as it acts in a way that improves its later actions vs. prior actions without any human labeling or intervention (except as a part of the environment). This is because the real world will always contain out-of-distribution to text training data obstacles which no fixed weight model could overcome. Our brains do this in real-time, LLMs do not (today) at all. It could also mean that "hallucinations" (and models that have baked in knowledge like LLMs) may actually have outsized value, they provide a form of knowledge directed exploration that can only be refined with environmental feedback on which ones are good (advance the overall goal) and which ones are bad (do not advance the overall goal). This seems to complete the circle of why the Zork-Like test "works", why models fail it today and why current LLMs cannot be thought of as having any risk on goal-seeking without passing this test first. In light of the above, I'll lay out my (almost certainly failed) path to moving forward the Zork-Like test and 2024 pet project (which will proceed extremely slowly, or potentially not at all, due to time and cost constraints). 1) Play a Python Zork reference (like github.com/KadenBiel/Python-…) multiple times to set a human baseline of total number of playthroughs allowed and score/move ratio (or find the same) 2) Take two 7B LLMs, one to be used as a policy/player model and one to be used as a reward model 3) Create a Zork environment that can interact with the policy and reward LLM 4) Use RL techniques like DPO/PPO and techniques borrowed from RLHF (but with no human labeling) to fine-tune both models as they play. 5) See if any "sparks" of generalization occur Super interested in anyone's thoughts, holes in my logic, ways to add rigor or greater definition or interest in participating! Happy New Year all.
327
So this is why people are excited about, and will to pay for embeddings. They allow us to do math and geometry with meaning that allow all kind of powerful features like semantic search.
1
194
6/ Although there are fundamental differences between self-attention in transformers and attention mechanisms in the human brain, the intriguing similarities between these systems can shed light on the nature of intelligence.
1
1
1,076