Causality & Tabular Foundation Modeling

Landscape

Two fields converging: causal inference — the formal study of cause-and-effect under intervention — and tabular foundation modeling — scaling transformers and in-context learners to the structured, heterogeneous data that most real-world problems actually run on. The bet: combining causal structure with foundation model scale unlocks the kind of reliable prediction and decision-making that purely correlational models can't deliver.

Sub-areas

Structural Causal Models (SCMs) — DAG-based frameworks for encoding cause-effect relationships (Pearl's do-calculus, potential outcomes)
Causal discovery — learning causal graphs from observational or interventional data (PC algorithm, GES, NOTEARS, DiBS)
Tabular deep learning — attention-based models for heterogeneous structured data: FT-Transformer, SAINT, TabNet, TabPFN
In-context causal reasoning — can transformer-based in-context learners perform causal inference at inference time without explicit graph estimation?
Treatment effect estimation — CATE, ITE, and counterfactual prediction (TarNet, DragonNet, CausalForest)
Foundation models for tabular data — cross-dataset pretraining, zero/few-shot generalization, unified tokenization of heterogeneous features (XTab, AnyLearn, CARTE)
Distribution shift and invariant learning — IRM, DRO, causal representation as OOD defense

Landmark papers

Elements of Causal Inference — Peters, Mooij, Schölkopf, 2017. The textbook.
Revisiting Deep Learning Models for Tabular Data (FT-Transformer) — Gorishniy et al., 2021. Honest benchmark; attention can match GBDTs at scale.
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second — Hollmann et al., 2022. In-context learning applied to tabular; landmark for the "tabular prior" idea.
SAINT: Improved Neural Networks for Tabular Data — Somepalli et al., 2021. Row- and column-attention for tabular structure.
Why do tree-based models still outperform deep learning on tabular data? — Grinsztajn et al., 2022. Calibrated skepticism — most DL still loses to XGBoost on real-world tabular data without careful engineering.
Towards Foundational Models for Tabular Data — survey of the emerging field, 2024.
Causally-motivated Shortcut Removal Using Auxiliary Labels — empirical causality applied to neural nets.

Key figures

Bernhard Schölkopf (causal representation learning, SCMs), Judea Pearl (do-calculus, counterfactuals), Jonas Peters (causal discovery, invariant risk minimization), Mihaela van der Schaar (ML for healthcare + causality), Léo Grinsztajn (tabular DL benchmarks), Noah Hollmann (TabPFN, in-context tabular learning).

Open Problems

Can foundation models perform in-context causal reasoning? TabPFN shows in-context learning on tabular classification. The open question: can you preload a causal prior into a foundation model so it reasons about interventions — not just correlations — at inference time?
Causal discovery at scale. Classical structure learning (PC, GES) breaks down above a few hundred variables. High-dimensional tabular data (EHR, financial, genomic) needs scalable, approximate causal graph estimation. Differentiable structure learning (NOTEARS, DiBS) is promising but fragile.
Heterogeneous schema generalization. Real tabular foundation models must handle feature spaces that differ across datasets — different column names, types, semantics. How do you build a unified tokenization that transfers meaningfully?
OOD robustness for tabular data. Tabular distributions shift aggressively (hospital policies change, markets move, populations drift). Causal invariance (IRM, anchor regression) is theoretically sound but underperforms in practice. Why, and how to fix it?
When does tabular DL beat GBDTs? Grinsztajn et al. show GBDTs win on most real benchmarks. The conditions under which transformers win (very large N, text-heavy features, cross-table pretraining) are not yet fully mapped.
Treatment effect estimation with foundation models. CATE estimation (TarNet, DragonNet) works in relatively clean experimental setups. Can a pretrained tabular foundation model generalize to new treatment effect estimation tasks without retraining?

Questions & Ideas

Does in-context learning in TabPFN encode anything like causal structure, or is it purely pattern matching over training priors?
Can you construct a "causal tabular prior" for pretraining — synthetic SCM-generated data with known causal graphs — and transfer causal reasoning to real datasets?
Is there a natural way to encode interventional data (do(X=x)) alongside observational data in a shared tabular token space?
What is the right inductive bias for a tabular foundation model that needs to generalize across hospitals, cohorts, or geographies under distribution shift?
Can causal discovery algorithms be replaced by attention: does the attention matrix in a trained transformer on tabular data approximate the causal adjacency matrix?
How should a tabular model handle missing-not-at-random data — where the missingness mechanism is itself causally structured?
What role does scale play in tabular foundation models? The "scaling laws for tabular data" question is completely open.

My Take

This section evolves as thinking develops.

The tabular domain is chronically underserved by deep learning research — most benchmark work focuses on vision and language, and the practical reality (XGBoost still often wins on structured data) has made it intellectually unfashionable. I think this is a mistake.

The intersection with causality is where the real prize sits. Most tabular applications — clinical decisions, policy evaluation, economic forecasting — fundamentally need causal answers, not just predictive accuracy. A model that achieves 94% AUC on a held-out test set is not the same as a model that correctly answers "what would happen if we changed X?" These are different questions, and confusing them is expensive in high-stakes settings.

The TabPFN line of work excites me most: if you can build a strong enough tabular prior through synthetic causal data, you might get causal reasoning essentially for free at inference time — without ever running a causal discovery algorithm on the target dataset. That's a genuinely new approach.

The practical angle: tabular foundation models could be the "BERT moment" for structured data in industry. Getting there requires solving the schema generalization problem and the causal reliability problem simultaneously — hard, but tractable.

Journal

2026-02-27 — Adding this research area as a fifth cluster alongside RL, world models, continual learning, and representation learning. The motivation: most AI research defaults to unstructured data (images, text). My background and interests pull toward structured, real-world data — EHR, financial signals, policy evaluation — where the causal question is unavoidable.

The plan: start with TabPFN and the Grinsztajn benchmark paper to get an honest picture of where tabular DL actually stands, then work backwards from the causal inference side (Peters et al., IRM) to understand what "causal guarantees" would even mean in a foundation model context.