Part V

How Do I Choose the

Best Tools for My Project?

So far we have looked at the analytics stack in a purely abstract manner. In this section we give you some advice on how to select tools for your stack. As with most things, the right tool for one setting won’t be right in another setting.

So, this advice covers the things you should consider when you select a particular tool. For simplicity, we will divide these tools up into data storage, data privacy, data analysis and visualisation.

Data Storage Tools

Once upon a time, companies had very few choices for how to store their data.

They could use some form of structured database (often MySQL, DB2 or Oracle), they could use spreadsheets (amazingly, this is still done by many companies) or they could use some form of roll-your-own unstructured data store. The advent of the cloud, coupled with the growth in big data, has seen a plethora of new storage approaches emerge.

Broadly, these approaches can be grouped into 3 groups:

  • highly structured databases,
  • lightweight key-value stores
  • and unstructured storage (like NoSQL).

Which approach is right for your analytics stack will depend on a number of factors which we discuss below.

Volume of Data

The first thing to consider is the volume of data. This will directly influence your choice of storage. Really huge datasets may need to be stored in some form of unstructured data lake, probably using a modern NoSQL protocol, or they may require you to use a proprietary database such as Microsoft Azure SQL Data Warehouse.

By contrast, a customer database for a small business may be able to be stored on a single server (with appropriate backup of course).

Nature of the Data

The nature of the data will directly influence the nature of the storage. Some data has very obvious natural structure that has to be preserved. A good example here is health records which will always contain certain items of data like date of birth, details of any allergies or medical conditions, lists of vaccinations, what medication has been prescribed, etc.

Other data may naturally lend itself to key-value style storage. Data with little structure or relatively random structure may not be suitable for storage in an SQL database, in which case you need to choose an alternative.

Use Case for Data

In many cases, analysis may be a secondary use case for your data. Take the example of a bank. Here, the primary use case for customer account data is to provide banking services. This means that your data storage must be suitable for your customer-facing and in-branch systems. Considerations here may include required speed of access, API constraints and indexing or search requirements. You may even decide that it is better to have two separate versions of the data, one for analytics and the other for the main business use.

Location Constraints

One thing you need to be cognisant of is any restrictions on data location that may need to be followed. For instance, your company policies may state that all data must be held on company premises. If this is the case, then you can’t use a cloud-based data warehouse tool such as Google’s BigQuery.

It’s worth highlighting here that GDPR does not ban you from storing personal data outside the EU. However, it does require you to ensure that data is appropriately protected and that that protection is legally binding.

Data Security and Privacy Tools

GDPR mandates that you take reasonable steps to ensure your data is stored securely by means of “appropriate technical and organisational measures”. It is deliberately vague about what these measures constitute, but it does make certain suggestions. When considering what measures are appropriate you need to assess the risks and use a combination of organisational policies and physical or technical security measures.

You must make sure you consider the current state of the art regarding security.

So this means being aware of the latest approaches like two-factor authentication, TLS, encryption, etc. However, you are also allowed to take cost into account – for a small shop, paying tens of thousands on a hardware firewall would not be reasonable.

One of the important requirements that is easily overlooked is that you must be able to restore full access to your data and systems in a “timely manner” following any serious incident. In effect this means you must have a disaster recovery plan. This is part of the requirement to ensure the “confidentiality, integrity and availability” of your data. Therefore, you should also consider whether you need to include anonymization tools right in your analytics stack, or whether you will use other approaches to ensure data privacy is maintained.

Workbench with tools

Data Analysis and Visualisation Tools

Data analysis is a very broad term – Wikipedia defines it as “… a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.” There are a huge number of tools that can be used to help perform data analysis tasks. These include tools for extracting and finding data, tools for modelling the data and tools to extract information from the data.

Locating the correct data is a key part of data analysis.

There are a number of approaches to data mining and searching. Some analysts might need the ability to run SQL queries in order to access the data (in which case your system must be able to parse SQL). Others may want to use a natural language query tool to search the data. Yet others may use simple keyword searching. Whatever your analysts’ requirements, your analytics stack needs to offer the right support, while remaining cognisant of the requirements for data privacy.

Building models that use the underlying data to predict future events is a key task for many data analysts. These models may be simple VBA scripts in an Excel spreadsheet, or they may be more complex models created in R or SPSS. Again, your stack will need to offer suitable hooks and access to allow these models to be created and to produce usable data.

Often data needs to be visualised as an integral part of the analysis. And almost certainly, visualisations are necessary to disseminate the results of the analysis. Once, data visualisation was limited to simple things like graphs plotted in R or charts created in Excel. Over recent years there has been an explosion in the field of data visualisation, with new chart types being invented and systems that are able to display dynamic dashboards built with point-and-click interfaces.

Ready to see what Aircloak can do for you?