Chances are good that you've either sent or received a message that was garbled thanks to autocorrect or autocomplete. There are hundreds of memes in which these errors cause miscommunication, often with hilarious results. Like this one. Or this one. Or this one.
But it's not so funny when it happens to your scientific paper. Academic papers frequently include supplemental files with data, charts and graphs supporting the paper's conclusion. Many of those files are spreadsheets.
Since 2004, scientists have noticed that Microsoft Excel, the popular spreadsheet program, has a bad habit of changing some gene names into other types of data. For example, geneticists use MARCH1 as shorthand for membrane associated ring-CH-type finger 1. But Excel interprets MARCH1 as a date, automatically converting it to 1-Mar or another designation for the first of March.
Excel misidentifies some other gene names as coordinates or floating points. You might be able to suss out that 1-Mar is actually MARCH1, but how about 2.31E+13? That's how Excel converts the RIKEN identifier 2310009E13. RIKEN is a massive research institute in Japan that, among other things, conducts genome projects. (For the record, we tried plugging in the RIKEN identifier into an Excel spreadsheet and had it automatically converted to 2.31E+19, still wrong and still not what's intended in those academic papers.)
Despite the fact that researchers first identified this problem more than a decade ago, it persists today. Researchers Mark Ziemann, Yotam Eren and Assam El-Osta screened more than 35,000 supplemental files to see how widespread the problem actually was. They developed automated software that searched for any data that looked like it could be a list of gene names. They identified 7,467 gene lists representing the supplemental data for 3,597 published papers in 18 different journals. It turned out 704 of those published papers included files with Excel format errors in them. That's a 19.6 percent failure rate. In other words, nearly one in five published papers include supplemental files with Excel formatting errors in them.
This isn't just an inconvenience. Researchers rely upon published works to guide their own projects. If the files include errors, it can be difficult and time-consuming to sort out the issue and find the relevant data.
Why not just turn the auto-formatting features off? According to the research team, there's no way to permanently disable the features in Excel. You might be able to manually change each option every time you create a new file, but that's inefficient. They did point out, however, that Google Sheets doesn't follow the same approach to formatting. What's more, if you first put all your data into Google Sheets and then copy it into a different spreadsheet program, the formatting remains untouched.
It's a bit comical that a feature intended to help people use spreadsheets is causing such a headache in academic circles. Perhaps future versions of Excel will include an option to disable auto-formatting features indefinitely. Until then, geneticists should double-check their work or consider switching to Google Sheets.