Unraveling financial crimes like money laundering is a notoriously difficult task, especially when criminals purposely cover their tracks. It gets a little easier when you have advanced tools, such as text analytics, machine learning, and a graph database, which is what the International Consortium of Investigative Journalists (ICIJ) used with its latest investigation, dubbed the FinCEN Files.
The FinCEN Files is based on the leak of about 2,100 suspicious activity reports, known as SARS, that were sent to the U.S. Treasury’s Financial Crimes Enforcement Network, or FinCEN, between 2011 and 2017. SARS are written by compliance officials at banks in the US whenever fraud is suspected in a transaction. Anytime a transaction involves US dollars, it must go through a US bank, which means they pile up at the US Treasury Department.
Somebody with access to the SARS, which have rarely been revealed to the public, sent them to BuzzFeed News, and that publication in turn brought in ICIJ to help analyze them. That investigation was completed recently, and BuzzFeed and ICIJ have begun publishing reports based on the investigation.
The ICIJ has experience in this sort of thing. You will recall the ICIJ made headlines in 2015 with Swiss Leaks, in 2016 with the Panama Papers report, and in 2018 with its West Africa Leaks.
But the FinCEN files differed in a significant way from previous ICIJ efforts in one major aspect: the volume of the data. The ICIJ had just 2,100 SARS (in PDF format) and about 400 spreadsheets. For comparison, the Panama Papers totaled 11.5 million source documents, comprising 2.6TB of data.
What the FinCEN files lacked in size, it made up in complexity. “The volumes was not in millions of records like Panama Papers, but the complexity was very high, so we had to incorporate different approaches to do the analysis,” says Emilia Struck, the ICIJ research editor who oversaw the data and research teams that worked on the FinCEN files.
The data in the spreadsheets was fairly structured, so it could go straight into the Neo4j graph database that ICIJ journalists used to conduct cluster analysis and find the links connecting parties in transactions, including individuals, companies, and banks.
The information in the PDFs, however, required a bit more work. The SARS basically are written accounts of suspicious activity, as typed by bank compliance officers. ICIJ attempted to use natural language processing (NLP) algorithms to automatically extract the relevant information so that it could be used for downstream analysis tools, including the graph database. However, the NLP approach did not pan out.
“It didn’t work very well, even though we had done a lot of [automated extractions], so we had to use the power of collaboration,” she told Datanami. “We had more than 85 journalists in 35 countries to extract the data and structure it in a format so we could actually get the corresponding connections from the banks around the world, plus the transactions and other details, that the computer wasn’t able to get in the best way.”
Some compliance officers are better at communicating information than others, so there was that to contend with. According to Struck, about half the addresses listed in the SARS had some sort of classification error. For example, bank transactions coming from China would often be given the country code “CH.” Unfortunately, CH is the country code for Switzerland.
“The language was not standardized across all the reports,” Struck said. “The reports are full of banks names. But even if you extract the banks’ names, it doesn’t mean they’re always part of the transaction. So you would need to combine and create a set of rules to extract the information of the transactions.”
ICIJ journalists spent seven months working the PDF data, correcting addresses, and essentially re-creating the underlying transactions. It succeeded in turning that into about 100,000 transactions.
At the same time, it was basically working the other way on the spreadsheet data, which represented about 100,000 transactions but lacked details about the entities (which the PDFs had). Here, the Neo4j database and Linkurious reporting tools helped to work backwards and identify entities in this transactional data. There was some overlap between these two data sets, but not much, Struck said.
ICIJ also used an optical character recognition (OCR) and document search tool called DataShare. This offering, which was developed in house, allows ICIJ journalists to digitize documents and then conduct batch searches against them.
In addition to the 2,600 FinCEN files that it received, ICIJ brought other sets of data to bear on the investigation. One of these was a collection of nearly 18,000 documents that identified companies registered by UK government as limited partnership and limited liability partnerships (LPs and LLPs). Armed with this publicly accessible data, ICIJ was able to spot flagrant discrepancies between the bank transactions and reported revenue.
“You could have, for instance, a company registered in the UK reporting to the UK a turnover of $1,000, and in data we have, we see a transaction for that company for the same time period for more than $1 million,” Struck said. “So we calculated the excess that was not reported to the UK company registry.”
It was not uncommon for individuals to conceal their transactions by getting family members involved, by using a series of shell companies to send and receive money, and by using multiple hops across banks. These types of moves often stymie financial regulators armed with Excel and a pad of paper, but these patterns pop right out in a graph database.
“Where we have a lot of public data and combined it with the structured data, we used the graph database for the analysis,” Struck said. “You could have one signature, one person tied to hundreds of companies. They’re not the real owners, but they appear as the connection point for those companies. And that would allow us to [identify who] are the information agents that enable the registration of those companies.”
A graph database is an ideal tool for this type of work, says Michael Hunger, who is on Neo4j’s developer relations team. “It’s purpose built to handle relationship and managing…large amounts of relationships,” Hunger said. “If you use a relational database, you would have to use a lot of joins, and the performance of the database will drop, especially with larger volumes. But a graph database was built for that.”
A relational database can find connections between pieces of data, but the user must know what question to ask. In a graph databases, especially one equipped with machine learning algorithms, users are empowered to ask more open-ended questions, which helps flesh out unknown unknown, says Neo4j’s Rik Van Bruggen, who is Neo4j’s regional VP in Belgium.
“In a case like this, with anti-money laundering or fraud or an investigative context, it’s super interesting because you don’t’ know what you don’t know. You don’t know what the criminals have come up with,” Van Bruggen said. “With graph data science, it really helps investigators, people who are not familiar with the data yet, to immediately give them a little bit of a hint on where to start. It will identify, semi-automatically, clusters. It will tell you these parts are more connected, more important. It will immediately tell you that, without you having to browse and spend hours looking for something.”
Between 2011 and 2017, there were 12 million SARS filed, according to the Treasury Department, which means the SARS analyzed by ICIJ represented just 0.02% of the total number of SARS during that period. What would these tools reveal if they were pointed at the whole dataset? “It would be an interesting challenge,” Struck said.
First published at Datanami