Austin ML Journal Club

The Austin ML Journal Club brings together ML/AI practitioners for monthly deep dives into seminal research. We read papers in advance, meet virtually under Chatham House Rule, and critically examine methodologies, experimental design, and real-world applicability.

Learn more → Join our Google Group LinkedIn

LLMs Get Lost in Multi-Turn Conversation

The authors shard fully specified benchmark instructions into pieces, reveal one piece per simulated turn, and measure an average 39% performance drop across 15 LLMs, driven mostly by a collapse in reliability rather than ability. We valued the rigor of the within-instruction design but found the sharded user simulation rigid and unrealistic, and we wanted more from the paper on how models fail once degradation begins.

Why Do Multi-Agent LLM Systems Fail?

MAST is an empirically grounded taxonomy of 14 failure modes for multi-agent LLM systems, built from over a thousand annotated traces. We questioned how many of these modes are genuinely multi-agent, since most are reducible to single-agent pathologies. The dataset and taxonomy are carefully built, though the methodology has caveats and the case for multi-agent systems is thin.

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

This paper compares supervised fine-tuning (SFT) versus reinforcement learning (RL) for foundation model post-training, demonstrating that RL with outcome-based rewards generalizes across rule-based textual and visual tasks while SFT tends to memorize training data and struggles with out-of-distribution scenarios.

Efficient GPT-4V Level Multimodal Large Language Model for Deployment on Edge Devices

MiniCPM-V presents a series of lightweight multimodal large language models that achieve GPT-4V level performance while being deployable on edge devices like mobile phones, addressing computational costs and privacy concerns of cloud-based models.

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Apple Research tests reasoning capabilities using controlled logic puzzles, finding dramatic performance drops and models failing even when given exact solutions. We examine methodological concerns about difficulty definitions, missing qualitative analysis, and troubling reliability implications for practitioners.

AI as Normal Technology

This article challenges both AI doomsday narratives and hype by framing AI as a normal transformative technology comparable to electricity or aviation. Drawing from historical technology adoption patterns, the authors argue for pragmatic policy approaches focused on deployment-level safeguards, sector-specific regulation, and institutional resilience rather than speculative fears about autonomous superintelligence. A refreshing, evidence-based perspective for practitioners tired of extremes.

On the Theoretical Limitations of Embedding-Based Retrieval

This preprint explores theoretical limits of single-vector embeddings through communication complexity theory, proving that embedding dimension bounds the number of representable top-k document combinations. Our discussion focuses on the gap between elegant theory and problematic experiments, embedding truncation, and tasks designed for memory not semantics. An interesting theoretical framework that needs more careful empirical validation.

Modeling Tabular Data using Conditional GAN

Tabular data synthesis (data augmentation) is an under-studied area compared to unstructured data. This paper uses GAN to model unique properties of tabular data such as mixed data types and class imbalance. This technique has many potentials for model improvement and privacy. The technique is currently available under the Synthetic Data Vault library in Python.

Gorilla: Large Language Model Connected with Massive APIs

The power of LLMs, in a commercial setting, comes from its ability to use other tools and integrate business domain knowledge too. Unfotunately, it is still challenging getting the LLMs to work well when hooking up custom APIs that interface with custom processes and data. This paper is interesting, largely because it may be a step forward in getting the LLM to accurately use custom tooling.

Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations

Amidst the LLM hype, algorithmic bias in a critical domain such as healthcare is continued being overlooked. This algorithm-audit paper found racial bias in a widely used healthcare system and discussed the problem of using a wrong target variable. The paper is a few years old but the message is still relevant, and we discussed what’s happened since then.

Are Emergent Abilities of Large Language Models a Mirage?

Recent research observed that the largest models exhibit incredible increases in performance on a wide variety of tasks compared to smaller models. This paper argues that such so-called emergence is actually more reflective of the evaluation metric used. Switching to metrics that are known to scale smoothly with the per-token error reveals a much more predictable picture.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

This paper proposes a new technique to align LLMs with human preferences without using RL. This method is more robust and shows better performance over models trained with RLHF.

Kshitij Aggarwal

Why Do Tree-based Models Still Outperform Deep Learning on Typical Tabular Data?

This paper compares the performance of deep learning techniques to traditional tree-based methods for a novel set of 45 tabular datasets. The inductive biases of tree-based versus neural network models was analyzed as a proposed method to guide development of improved tabular specific neural network models.

Kathrine Behrman

Neural Machine Translation by Jointly Learning to Align and Translate

This paper marks an important step in the development of machine translation (MT). It came out just as Neural Machine Translation (NMT) was taking off, extending from Statistical Machine Translation (SMT), and as a milestone along the way to Transformer-based NMT. The authors introduce a novel attention mechanism applied to MT and show that it improves performance on long sentences, in particular, from prior recurrent neural network NMT approaches.

Meghann Agarwal, Hongsup Shin, Claude

Constitutional AI: Harmlessness from AI Feedback

There is an arms race of large language models (LLMs) in industry where companies use different approaches and techniques. Anthropic claims to adopt a more cautious approach that minimizes harm by LLMs than others. Let’s look into constitutional AI, the core algorithm of their LLM, to understand how this harm mitigation works.

AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types

In e-commerce, it is challenging to organize and categorize products that are described by merchants in various ways. Finding a unified language and taxonomy has always been an underlying effort with commerce. This paper uses various ML algorithms to address this challenge.

Saina Lajevardi, Hongsup Shin

Let’s Verify Step by Step

This is an interesting new paper by OpenAI that discusses how we can apply the same principles we use to solve math problems to AI. The paper evaluates different approaches to solving a dataset comprising of math problems. With this approach, they trained the model to get the right answer and but also “think” through the problem to arrive at the right answer.

Akshata Mohan, Hongsup Shin

Visualization in Bayesian workflow

This paper summarizes types of data visualization that we can use in Bayesian modeling and inference. It also provides a good overview of how to do Bayesian data analysis properly, including model validation such as prior and posterior predictive checks.

Reviwing a Case Study About Real Estate Market Prediction

The paper reviewed here attempts to predict outcomes about the real estate market with binary classification. Though the paper’s research design and results were lacking, it gave us a chance to have a discussion about practices for experimental design.

Athula Pudhiyidath

Leakage and the Reproducibility Crisis in ML-based Science

Data leakage in a common problem in ML-based science leading to reproducibility failures and overly optimistic conclusions. We discussed 8 types of data leakage and the use of model info sheets to identify and reduce all leakage types.

Zero-Shot Text-to-Image Generation

There is so much hype in generative AI. But how does it actually work? We discuss OpenAI’s DALL-E paper to understand model architecture but more importantly, whether their model validation is solid and reasonable.

“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI

Garbage in, garbage out. It seems like a lot of people in the ML community still don’t understand this logic. We discuss poor data-handling practices and their critical ramifications.

Why ML Journal Club

Welcome to our journal club! I talk about why I organized an in-person journal club with my fellow ML practitioner friends in Austin, TX.