Coffea accelerates the analysis of particle physics data
Analyzing the mountains of data generated by the Large Hadron Collider at the European CERN laboratory takes so long that even computers need coffee. Or rather, Coffea — Columnar Object Framework for Effective Analysis.
A package in the Python programming language, Coffea (pronounced like the stimulating drink) speeds up the analysis of massive datasets in high-energy physics research. Although Coffea streamlines the calculations, the main objective of the software is to optimize the scientists’ time.
“A human’s efficiency in producing scientific results is of course affected by the tools you have,” said Matteo Cremonesi, postdoctoral fellow at the US Department of Energy’s Fermi National Accelerator Laboratory. “If it takes me more than a day to get a single number out of a calculation – which often happens in high-energy physics – it’s going to hurt my effectiveness as a scientist.”
Frustrated by the tedious manual labor they faced when writing computer code to analyze LHC data, Cremonesi and Fermilab scientist Lindsey Gray assembled a team of Fermilab researchers in 2018 to adapt the techniques of big data to solve the most difficult questions of high energy physics. . Since then, a dozen research groups on the CMS experiment, one of the two large general-purpose detectors of the LHC, have adopted Coffea for their work.
Using information about particles generated in collisions, Coffea enables broad statistical analyzes that refine researchers’ understanding of the underlying physics. (The LHC data processing facilities perform the initial conversion of the raw data into a format that particle physicists can use for analysis.) A typical analysis of the current LHC dataset involves the processing of approximately 10 billion particle events that can total over 50 terabytes of data. That’s the data equivalent of about 25,000 hours of streaming video on Netflix.
At the heart of Fermilab’s analysis tool is the move from a method known as event loop analysis to one called column analysis.
“You have a choice if you want to iterate over each row and do an operation in the columns or if you want to iterate over the operations you do and attack all the rows at once,” explained Fermilab postdoctoral researcher Nick Smith, the main developer of Coffea. “It’s kind of an order of operations.”
For example, imagine that for each row, you wanted to sum the numbers in three columns. In the event loop analysis, you will start by adding the three numbers in the first line. Then you will add the three numbers in the second row, then move on to the third row, and so on. With a columnar approach, on the other hand, you’ll start by adding the first and second columns for all rows. Then you would add this result to the third column for all rows.
“Either way, the end result would be the same,” Smith said. “But there are trade-offs you make under the hood, in the machine, that have a big impact on efficiency.”
In datasets with many rows, columnar analysis performs about 100 times faster than event loop analysis in Python. Yet before Coffea, particle physicists primarily used event loop analysis in their work, even for datasets with millions or billions of collisions.
The Fermilab researchers decided to pursue a columnar approach, but they faced a daunting challenge: high-energy physics data cannot easily be represented in tabular form with rows and columns. One particle collision might produce a multitude of muons and few electrons, while the next might produce no muons and many electrons. Using a library of Python code called Awkward Array, the team devised a way to convert the jagged, nested structure of LHC data into arrays compatible with columnar analysis. Generally, each row corresponds to a collision, and each column corresponds to a property of a particle created during the collision.
The benefits of Coffea extend beyond faster execution times (minutes instead of hours or days when it comes to interpreted Python code) and more efficient use of computing resources. The software takes mundane coding decisions out of the hands of scientists, allowing them to work at a more abstract level with less chance of making mistakes.
“Researchers aren’t here to be programmers,” Smith said. “They are there to be data scientists.”
Cremonesi, who researches dark matter at CMS, was among the first researchers to use Coffea without a backup system. At first, he and the rest of the Fermilab team actively sought to persuade other groups to try the tool. Now researchers frequently approach them asking how to apply Coffea to their own work.
Soon, the use of Coffea will extend beyond the CMS. Researchers at the Institute for Software Research and Innovation for High Energy Physics, supported by the US National Science Foundation, plan to integrate Coffea into future CMS analysis systems and ATLAS, the LHC’s other large general-purpose experimental detector. An LHC upgrade known as the High-Luminosity LHC, due for completion in the mid-2020s, will record around 100 times more data, making the efficient data analysis offered by Coffea even more valuable for international collaborators in the LHC experiments.
Going forward, the Fermilab team also plans to break Coffea down into multiple Python packages, allowing researchers to use only the stuff that’s relevant to them. For example, some scientists use Coffea primarily for its histogram function, Gray said.
For the Fermilab researchers, the success of Coffea reflects a necessary shift in the mindset of particle physicists.
“Historically, the way we do science has focused a lot on the material component of creating an experiment,” Cremonesi said. “But we have reached an era in physics research where managing the software component of our scientific process is just as important.”
Coffea promises to synchronize high-energy physics with recent advances in big data in other scientific fields. This cross-pollination may prove to be Coffea’s most important benefit.
“I think it’s important for us as a high-energy physics community to think about what kind of skills we’re imparting to the people we’re training,” Gray said. “Ensuring that we, as a field, are relevant to the rest of the world when it comes to data science is a good thing to do.”
US participation in CMS is supported by the Department of Energy Office of Science.
Fermilab is supported by the US Department of Energy’s Office of Science. The Office of Science is the largest supporter of basic physical science research in the United States and works to address some of the most pressing challenges of our time. For more information, visit science.energy.gov.