Announcing our BFCL benchmark results for OpenAI's structured output, which also tests the *contents* of generated outputs.
1.
@OpenAI "strict" function-calling (FC) is slightly worse on the 2024-08-06 model, but better on every prior model.
2. OpenAI handedly beats Anthropic at Function Calling (+15% improvement).
3. Prompt Engineering on Anthropic's Claude 3.5 Sonnet is just as good as OpenAI FC
4.
@boundaryML 's BAML is still SOTA at FC on every model. With BAML, even GPT-3.5/Haiku perform at par with GPT4o / Sonnet (-2%)
The models are not bad. The way you prompt and JSON.parse is. Here's our breakdown, and how BAML outperforms other structured generation techniques.