Using data science and big data analytics against tax evasion and tax fraud


Big data is a term that can be applied to datasets that exceed the normal database technology that we use. These datasets have a heavy volume to them that our traditional software technology just cannot handle. Datasets can be a mixture of information, depending on the business or the information that needs to be stored. This large volume of data is used for many business areas and one important area we will focus on is tax evasion and tax fraud. Fig 1. Tax evasion infographic [13].

Tax evasion can be a huge cost to governments. It is stated that governments around the world lose more than 3.1 trillion annually, with canada contributing to an estimated 81 billion of the 3.1 trillion [2]. Other areas that are contributing to these losses are tax collection practises that are out of date or also very complicated that corporations and individuals cannot follow [3]. To detect tax evasion and tax fraud, data science and big data analytics can help solve this problem [3]. One problem organizations run into is differentiating between law abiding taxpayers and taxpayers who try to cheat the system [3].

One way to solve this problem is by using data science and big data analytics through using data classification, clustering and pattern recognition to be able to distinguish between law abiding taxpayers and taxpayers who are trying to cheat the system [3]. Tax fraud occurs because there is so much information that needs to be processed that it cannot be done in a timely manner which allows fraudsters to take advantage of loopholes [3]. Data science and big data analytics has been able to scale down on this problem by using different techniques that will be discussed within this paper. This report will cover different aspects of data science and big data analytics in tax evasion and tax fraud. We will analyze the different types of data that are collected for this application and why it is considered big data. We will also analyze the different types of big data analytics techniques that are currently being used and what the insights that are gained from the data.

What types of data collected for this application, and why it is considered Big data

Uncovering tax violations is a challenging task since many corporations have justifiable expenses and exemptions that are complex and hard to track [4]. We can use big data and analytics to detect anomalies

and the loopholes in the system and predict where these anomalies might occur in the future [4]. Potential tax evaders are detected by analyzing the taxpayer’s behavior through the data that is collected from various sources [4]. It is important to know what type of data is collected and where the source of the information is [5].

The IRS, for instance, collects data by tracking all online activities of every individual or business such as online purchases using electronic or card transactions, social media sites such as Facebook and Twitter to access taxpayer’s information [4]. They also observe offline purchases of taxpayers in order to detect possible evasions by considering several parameters such as waybills versus monthly returns, the percentage of purchases compared to sales, the proportion of exempt sale items to taxable ones [6]. They also collect data from third parties including banks, online marketplaces and government organizations [6]. The IRS compares the information gathered from third parties to the tax files taxpayers submit to make sure the information is accurate and correct [5]. It also purchases data from data brokers to collect more data and adds it to its own datasets [5].

However, there are no certain predefined rules for the information provided by third parties which can be considered a privacy issue since most of the regulations allowing tax companies to access information from third parties are old and has not been updated with the current technology [5]. There are various ways tax companies obtain information from external sources. They use technologies to track phone conversations or personal emails of taxpayers[5]. Based on a report by American Civil Liberties Union (ACLU), the IRS started using a phone tracking technology, called Stingray, in 2009 to record data from phone conversations, messages and even track the location of the users [5].

Tax companies also utilize online activity trackers to check publicly available online information that people post on social media sites [5]. CRA, for example, uses social network analysis to check the consistency of the information between individuals and businesses [7]. Nowadays tax agencies have access to millions of data thanks to the modern technology and the internet that records all users’ online activities. This leads to an excessive amount of data, known as big data, that is applied to massive datasets. The main features of big data are volume, velocity, variety and veracity [1]. Volume indicates the size of data gathered from different sources [1]. Velocity refers to how quick the changes occur in data [1]. Variety shows different sources (social media, phones and etc.) from which the data is being collected [1]. It can also refer to various types of data [1]. Veracity refers to the quality of the data [1]. Fig 2. The 4 V’s of big data [10].

As we mentioned earlier, tax companies track taxpayer’s online activities and also purchase data from data brokers to check the accuracy of the information provided by individuals in their tax returns. In other words, they use various data sources to get adequate amount of information. Formerly, in a tax company, an audit would only check the information of an individual or a business by analyzing the financial statements or industry norms of those companies [1]. This gave them very limited access to information, however, these days big data is a requirement in data analysis [1]. So, audits analyze large datasets form internal and external sources to record various types of information such as demographics, taxpayers or business profiles, their filing history, online data and call center data [1]. The collected data includes the historical information of the individuals over many years both internally and externally, which indicates the volume and variety of the data available in their datasets [1].

The other reason why tax companies need big data to detect fraudulent activities is the velocity of the information especially from online sources. Information on online platforms changes rapidly which requires modern analytics tools for the data analysis process [1]. Take China for instance,  there are more than 450 million taxpayers and 37,000 taxation administration offices. The total number of tax-related monthly financial reports has reached 2.5 billion annually, and the daily number of transaction records has reached 2 billion, and the amount of total data that is collected per year  is 200 TB, which is considered big data [9]. Given the massive size of data, traditional auditing techniques cannot handle this type of data so there is a need for new tax evasion detection methods in place[9].

What kinds of Big data analytics techniques employed in the analysis

The vast amount of data that is collected for tax information purposes are being cross referenced with pattern recognition algorithms to be able to identify different trends and relationships in the data that the normal eye cannot recognize [11]. There are two large subgroups of data mining techniques that can be used to detect fraud which are predictive tasks and descriptive tasks [1]. Predictive tasks or predictive modelling are a combination of machine learning and similar technologies that will result with a prediction for the output [1]. Descriptive tasks are a combination of association rules and cluster analysis that will give a description of the data that was being examined [1]. Examples of a few advanced techniques that are used are anomaly detection, advanced clustering and neural networks [11]. Fig 3. Different machine learning techniques used in tax evasion detection [12].

As mentioned earlier, one of the more popular techniques used in tax fraud detection is predictive modelling which is also called supervised learning. This technique allows tax agencies to use similar fraud and audit cases that have been worked on in the past to be able to predict the cases that are somewhat related to a successful case [12]. Supervised learning is heavily used in finding cases that tax agencies have missed before, and because of that the system must be able to have cases that had previously been successful and not successful to allow the machine learning algorithm to learn and be able to tell the difference between what passed and failed [12]. Supervised learning also allows the removal of emotions when it comes to auditing [12]. Auditors usually have gut feelings when it comes to picking who to audit, while supervised learning can remove any emotions and detect tax evaders based on patterns learned from previous cases that were previously audited whether it passed or failed [12].

Another technique used in tax evasion detection is unsupervised learning [12]. This type of machine learning technique is used in tax evasion detection for detecting tax evaders based on cases that are not available and not exactly knowing what the tax agency is looking for. It allows them to answer the question about what they are not seeing and what fraud techniques they don’t know about [12]. This technique looks at big data without knowing what the final output should be [12]. Clustering is an example of unsupervised learning which clusters similar cases or accounts that have close similarities and the ones that are outside these clusters are the outliers which need to be looked into further [12].

With that being said, tax agencies are slowly moving towards using machine learning techniques to detect tax fraud detection. The most common tools used to detect fraud is using rule based systems [8]. Rule-based systems are based on knowledge and historical fraud cases which create rules that are tested on new cases to be able to detect a fraudulent tax payer [8]. This type of technique can become obsolete since government laws change and implementing new rules can be delayed for a long time [8].

There are also many different techniques that have been used. For instance, Taiwan has been using association rules to enhance the performance of VAT evasion detection by searching for patterns between the attributes that are associated in VAT evasion [8]. Fuzzy association rules are extracted from a dataset that both fraudulent and genuine transactions have occurred [8]. Another approach used ensemble iteration learning self-generating neural networks which is another form of neural networks which helps in the detection of fraud in tax declarations [8]. This method is a form of unsupervised learning and uses marked data samples for weight adjustment and result validation [8]. Another technique that uses unsupervised learning is the use of non typical activities which groups activities based on previous historical data and then tries to see if a new activity is normal or not [8].

Academic literature has mentioned that unsupervised learning in tax fraud detection can have low results since most work will need historical datasets that are clearly marked [8]. In conclusion, many approaches and methods have been conducted for tax fraud detection, but most of these methods are supervised learning techniques which needs past historical data and audit results [8]. With that being said, supervised learning has had satisfactory results since marked data is really hard to come by with all the different types of taxes [8].

What insights gained from the data (and what actions taken, as it might be the case)

Thanks to the recent advancements in technology, tax agencies today have access to a continuous influx of data from various sources [15]. In order to make good use of this massive data, tax agencies are trying to apply big data analytics tool sets to integrate and analyze information from multiple sources [15]. This data consistency helps tax administrations to optimize their data management process by using big data mining tools to detect potential tax evaders and implement predictive models to detect new tax evasion behaviors and patterns to prevent future fraudulent activities [15]. Big data has helped reduce the number of tax evaders considerably. The IRS, for example, has reported an increase of up to 400 percent in detecting fraudulent activities and made up to 1000 percent more profit from detecting potential tax evasions [16].

Traditional auditing methods fail to keep up with the current data growth, therefore tax agencies are using data analytics tools to optimize their compliance techniques by designing more intelligent auditing strategies which mainly work based on risk factors of the taxpayer’s profile [15]. This model allows tax agencies to have a more dynamic framework for their auditing process and change their techniques towards a more risk-based compliance strategy and make a better use of the data and resources available [15]. Intelligent audits also help tax administrations to leverage the big data available in their massive datasets to identify high potential tax evasion behaviors and make patterns to prevent future evasions [15]. Given the fact that big data is collected from multiple structured and unstructured sources and the pace at which this data is changing, it is necessary for tax agencies to implement smart data mining techniques to optimize their auditing process and observe the consistency of the taxpayer’s information.

Conclusions, views and recommendations

Tax evasion is a problem across the entire world and tax agencies are currently looking at the benefits of data science and big data [14]. To be effective in the big data world and to have effective results, organizations must have the skills to be able to analyze this complex big data for big data analytics projects [14]. Big data and data mining has also allowed organizations to work faster and find patterns that the human eye cannot see [14]. Statistical and computing skills are much needed in analytics to be able to study big data and find patterns that are not found previously which could help in tax fraud detection [1].

Usually analyzing big data for tax fraud detection can come in two ways, either predictive or descriptive tasks and with that normal accounting skills and taxation skills cannot fulfill that need, but the statistical and computing skills need to be present in firms to be able to analyze big data [1]. Firms and government agencies must continue to grow with the use of big data analytics in tax auditing work [1]. Skills and infrastructure must also continue to grow stronger as accounting skills and infrastructure continue to be the same but with the combination of statistical and computing skills it must grow [1]. Staff members must be trained to be able to use different tools to analyze big data while also taking statistics and computer courses at universities, online or through certifications to grow their skillset [1]. With the addition of allowing current employees to be able to get training, new employees with statistics and computer backgrounds can also be hired [1].

Other options to help firms grow in the field of big data analytics is to start using cloud computing [1]. Many software and cloud services are available for fraud detection for businesses to take advantage of [1]. Such softwares that help in fraud detection are Opentext, SAS, Teradata. These software vendors can help in building the infrastructure needs such as the cloud computing tools [1]. Businesses should have a plan before acquiring all of the above such as the skillset or the infrastructure [1]. Having a firm plan will allow a business know what direction they are heading towards [1]. It is also important to note that acquiring the skill set is much more important than acquiring the infrastructure [1]. Big data analytics projects need careful planning, precisely tax fraud detection projects [1].