HEP-JEPA: A foundation model for collider physics using joint embedding predictive architecture

Abstract

We present a transformer architecture-based foundation model for tasks at high-energy particle colliders such as the Large Hadron Collider. We train the model to classify jets using a self-supervised strategy inspired by the Joint Embedding Predictive Architecture. We use the JetClass dataset containing 100M jets of various known particles to pre-train the model with a data-centric approach — the model uses a fraction of the jet constituents as the context to predict the embeddings of the unseen target constituents. Our pre-trained model fares well with other datasets for standard classification benchmark tasks. We test our model on two additional downstream tasks: top tagging and differentiating light-quark jets from gluon jets. We also evaluate our model with task-specific metrics and baselines and compare it with state-of-the-art models in high-energy physics.

Experiments

Few-Shot Learning Evaluations

The model was evaluated on the JetClass dataset, where it consistently outperformed models trained from scratch, particularly in low-label regimes:

Two regimes, frozen (pretrained backbone not updated) and fine-tuned were evaluated
Evaluated at label fractions: 0.05%, 0.5%, 2%, 10%, and 100%
Compared pre-trained HEP-JEPA model with model trained from scratch

Downstream Task Evaluations

The model was tested on two critical tasks:

Top tagging using the Top Tagging Reference dataset
Quark-gluon jet differentiation using the quark-gluon tagging dataset

Ablation Studies

The study explored various design choices, including:

Masking strategies (random vs. contiguous token selection)
Number of target tokens to predict
Physics bias in the attention mechanism
Integration of register tokens
Impact of physics-inspired data augmentations

Results and Findings

Key Findings

Physics bias improved performance by approximately 2%
Register tokens increased performance by around 2%
The contiguous masking strategy with one target token performed best
Physics-inspired augmentations did not significantly improve performance

JetClass Metrics

% of Labels (Size)	Model	Accuracy
0.05% (5K)	From Scratch	0.505
0.05% (5K)	HEP-JEPA, Fine-Tuning	0.564
0.5% (50K)	From Scratch	0.586
0.5% (50K)	HEP-JEPA, Fine-Tuning	0.624
2% (2M)	From Scratch	0.668
2% (2M)	HEP-JEPA, Fine-Tuning	0.669
10% (10M)	From Scratch	0.683
10% (10M)	HEP-JEPA, Fine-Tuning	0.685
100% (100M)	From Scratch	0.698
100% (100M)	HEP-JEPA, Fine-Tuning	0.698

Validation Loss Performance

Validation loss vs. training step for the two benchmark odels training in a few-shot learning setting for jet classification n the JetClass dataset with 0.5% labels (i.e., 50000 training sam- les). One model is trained from scratch, whereas the pre-trained EP-JEPA model is fine-tuned.

The validation loss falls quickly or the HEP-JEPA model — it achieves the same minimum valida- ion loss as the model trained from scratch three times faster

Visualization

We visualise the representation learned by HEP-JEPA on 50k samples of JetClass sampled uniformly from each class. We construct the embedding for a sample by concatenating the max and mean pooling of the outputs of the context encoder and apply t-SNE on the pooled embedding.

We observe that events that contain lepton(s) are pushed to the right, while hadronic events are more towards the left.

Related Links

Several recent works have explored foundation models and self-supervised learning in high-energy physics (HEP).

OmniLearn and Particle Transformer use transformer-based architectures for HEP tasks, relying on supervised learning with simulated data and generative modelling.

Masked Particle Modelling (MPM) and OmniJet-α adapt masked modeling and generative pre-training from natural language processing to collider physics.

Concurrent to our work, J-JEPA adapts the JEPA paradigm for the task of top tagging — the authors pre-train the model on 1% of the top jet and light jet samples from JetClass and evaluate downstream performance on the Top Tagging Reference dataset. However, unlike our data-centric approach, the authors generate context / target tokens through clustering subjets. We also show more comprehensive evaluations on the entire JetClass dataset and downstream applications with better performance on the tasks.

Contrastive learning methods, such as those in Dillon et al. (2022), follow frameworks like SimCLR, but require carefully selected negative samples. In contrast, Joint Embedding Predictive Architectures (JEPA) have shown promising results in images, videos, and point clouds by learning in latent space without a decoder.

For an extensive survey on foundation models in HEP, see this and this.

BibTeX

@misc{bardhan2025hepjepafoundationmodelcollider,
      title={HEP-JEPA: A foundation model for collider physics using joint embedding predictive architecture}, 
      author={Jai Bardhan and Radhikesh Agrawal and Abhiram Tilak and Cyrin Neeraj and Subhadip Mitra},
      year={2025},
      eprint={2502.03933},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.03933}, 
}