How Do I Choose the
Best Tools for My Project?
So far we have looked at the analytics stack in a purely abstract manner. In this section we give you some advice on how to select tools for your stack. As with most things, the right tool for one setting won’t be right in another setting.
So, this advice covers the things you should consider when you select a particular tool. For simplicity, we will divide these tools up into data storage, data privacy, data analysis and visualisation.
Once upon a time, companies had very few choices for how to store their data.
They could use some form of structured database (often MySQL, DB2 or Oracle), they could use spreadsheets (amazingly, this is still done by many companies) or they could use some form of roll-your-own unstructured data store. The advent of the cloud, coupled with the growth in big data, has seen a plethora of new storage approaches emerge.
Broadly, these approaches can be grouped into 3 groups:
- highly structured databases,
- lightweight key-value stores
- and unstructured storage (like NoSQL).
Which approach is right for your analytics stack will depend on a number of factors which we discuss below.
Volume of Data
The first thing to consider is the volume of data. This will directly influence your choice of storage. Really huge datasets may need to be stored in some form of unstructured data lake, probably using a modern NoSQL protocol, or they may require you to use a proprietary database such as Microsoft Azure SQL Data Warehouse.
By contrast, a customer database for a small business may be able to be stored on a single server (with appropriate backup of course).
Nature of the Data
The nature of the data will directly influence the nature of the storage. Some data has very obvious natural structure that has to be preserved. A good example here is health records which will always contain certain items of data like date of birth, details of any allergies or medical conditions, lists of vaccinations, what medication has been prescribed, etc.
Other data may naturally lend itself to key-value style storage. Data with little structure or relatively random structure may not be suitable for storage in an SQL database, in which case you need to choose an alternative.
Use Case for Data
In many cases, analysis may be a secondary use case for your data. Take the example of a bank. Here, the primary use case for customer account data is to provide banking services. This means that your data storage must be suitable for your customer-facing and in-branch systems. Considerations here may include required speed of access, API constraints and indexing or search requirements. You may even decide that it is better to have two separate versions of the data, one for analytics and the other for the main business use.
One thing you need to be cognisant of is any restrictions on data location that may need to be followed. For instance, your company policies may state that all data must be held on company premises. If this is the case, then you can’t use a cloud-based data warehouse tool such as Google’s BigQuery.
It’s worth highlighting here that GDPR does not ban you from storing personal data outside the EU. However, it does require you to ensure that data is appropriately protected and that that protection is legally binding.
GDPR mandates that you take reasonable steps to ensure your data is stored securely by means of “appropriate technical and organisational measures”. It is deliberately vague about what these measures constitute, but it does make certain suggestions. When considering what measures are appropriate you need to assess the risks and use a combination of organisational policies and physical or technical security measures.
You must make sure you consider the current state of the art regarding security.
So this means being aware of the latest approaches like two-factor authentication, TLS, encryption, etc. However, you are also allowed to take cost into account – for a small shop, paying tens of thousands on a hardware firewall would not be reasonable.
One of the important requirements that is easily overlooked is that you must be able to restore full access to your data and systems in a “timely manner” following any serious incident. In effect this means you must have a disaster recovery plan. This is part of the requirement to ensure the “confidentiality, integrity and availability” of your data. Therefore, you should also consider whether you need to include anonymization tools right in your analytics stack, or whether you will use other approaches to ensure data privacy is maintained.
Data analysis is a very broad term – Wikipedia defines it as “… a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.” There are a huge number of tools that can be used to help perform data analysis tasks. These include tools for extracting and finding data, tools for modelling the data and tools to extract information from the data.
Locating the correct data is a key part of data analysis.
There are a number of approaches to data mining and searching. Some analysts might need the ability to run SQL queries in order to access the data (in which case your system must be able to parse SQL). Others may want to use a natural language query tool to search the data. Yet others may use simple keyword searching. Whatever your analysts’ requirements, your analytics stack needs to offer the right support, while remaining cognisant of the requirements for data privacy.
Building models that use the underlying data to predict future events is a key task for many data analysts. These models may be simple VBA scripts in an Excel spreadsheet, or they may be more complex models created in R or SPSS. Again, your stack will need to offer suitable hooks and access to allow these models to be created and to produce usable data.
Often data needs to be visualised as an integral part of the analysis. And almost certainly, visualisations are necessary to disseminate the results of the analysis. Once, data visualisation was limited to simple things like graphs plotted in R or charts created in Excel. Over recent years there has been an explosion in the field of data visualisation, with new chart types being invented and systems that are able to display dynamic dashboards built with point-and-click interfaces.
Building a Privacy-preserving analytics stack – better understand how to comply with the requirements imposed by GDPR while still leveraging data analysis.
Since GDPR only relates to personal data, any data that is not personal is not covered by the regulation. This means that if you are able to completely remove any personal identifiers from the data, that data is no longer subject to the rules. This is where anonymization comes in.
Aircloak Insights is the first solution to offer real-time database anonymization that allows analysts to query anonymized data exactly as if it were the original raw data. In this section we will explain how Aircloak’s technology works and show why it is the first GDPR-compliant tool for database anonymization.