Computing the LHC Data
With 15 petabytes of data (that's 15,000,000 gigabytes) gathered by the LHC detectors every year, scientists have an enormous task ahead of them. How do you process that much information? How do you know you're looking at something significant within such a large data set? Even using a supercomputer, processing that much information could take thousands of hours. Meanwhile, the LHC would continue accumulating even more data.
CERN's solution to this problem is the LHC Computing Grid. The grid is a network of computers, each of which can analyze a chunk of data on its own. Once a computer completes its analysis, it can send the findings on to a centralized computer and accept a new chunk of data. As long as scientists can divide the data up into chunks, the system works well. Within the computer industry this approach is called grid computing.
The scientists at CERN decided to focus on using relatively inexpensive equipment to perform their calculations. Instead of purchasing cutting-edge data servers and processors, CERN concentrates on off-the-shelf hardware that can work well in a network. Their approach is very similar to the strategy Google employs. It's more cost efficient to purchase lots of average hardware than a few advanced pieces of equipment.
Using a special kind of software called midware, the network of computers will be able to store and analyze data for every experiment conducted at the LHC. The structure for the system is organized into tiers:
- Tier 0 is CERN's computing system, which will first process information and divide it into chunks for the other tiers.
- Twelve Tier 1 sites located in several countries will accept data from CERN over dedicated computer connections. These connections will be able to transmit data at 10 gigabytes per second. The Tier 1 sites will further process data and divide it up to send further down the grid.
- More than 100 Tier 2 sites will connect with the Tier 1 sites. Most of these sites are universities or scientific institutions. Each site will have multiple computers available to process and analyze data. As each processing job completes, the sites will push data back up the tier system. The connection between Tier 1 and Tier 2 is a standard network connection.
Any Tier 2 site can access any Tier 1 site. The reason for that is to allow research institutions and universities the chance to focus on specific information and research.
One challenge with such a large network is data security. CERN determined that the network couldn't rely on firewalls because of the amount of data traffic on the system. Instead, the system relies on identification and authorization procedures to prevent unauthorized access to LHC data.
Some people say that worrying about data security is a moot point. That's because they think the LHC will end up destroying the entire world.
Is it really possible? Find out in the next section.