Ensuring Data Security
One of the main focuses of this paper is data privacy. However, if you are collecting personal data then you can’t achieve data privacy without data security. In this section we will explore the Data Security and Privacy function in detail. We will explain in detail what we mean by each term and explore some of the techniques used.
Data Security refers to the process of keeping your data safe from unauthorised access. By that we mean access by people or machines that have no reason to access the data, even if they may be technically able to do so (for instance because they have access to your internal network).
Frequently, someone may be in a position where they can view personal data, but if they have no good reason to view it then they shouldn’t be accessing it.
There are a number of ways to secure your data including access controls, data encryption and physical location. We will now look at each of these in turn.
The primary form of security for most data is access control. Access control is the process of ensuring only authorised people and machines can physically download or view the data.
The first step is authentication. Authentication means checking that someone is who they claim to be. Typically, this is done using password protection or, more securely, some form of two-factor authentication.
Next you need to check that person is authorised to access the data. This may require a secure list of authorised users and may also add checks such as IP address filtering to ensure they are accessing the data from a known location.
Finally, you really should record all data accesses. This will allow you to trace any unauthorised access, or any occasions where an authorised person inappropriately accessed the data. Often this process of verifying a person’s identity is only done periodically, for instance, when they first log into the system. After that they will generally be given some sort of token such as an API key that will allow the system to know they are who they say.
The next form of data security is encryption. This encryption may happen in the actual storage or it might happen while the data is in transit. If the storage itself is encrypted, then this will protect against physical theft of hard drives and other direct attacks. If the data is encrypted in transit (for instance by using a protocol such as Transport Layer Security or HTTPS), then this will protect it from interception attacks. In both cases, the encryption is only secure so long as the keys remain secure. Often the process of access control goes hand in hand with key security.
As mentioned above, one of the reasons to encrypt data is to protect it against physical attacks. Physical security is a key element of data security. Due to the volumes of data and processing required, most companies nowadays store their data in remote locations, typically large data centres. These data centres operate extremely high levels of physical security. Only authorised personnel are allowed to access the data centre. Racks are generally locked with keys only being released to authorised people. Groups of racks (known as a pod) may be further secured by being enclosed in a cage.
The idea is to prevent people stealing the hardware or being able to connect unauthorised machines to the servers to access the data. While this security is partly because of the intrinsic value in the hardware, it’s also a key element in the data security.
In this section we give a broad overview of what is meant by data privacy, what data is covered (and what data isn’t) and how this relates to our abstract model. Within Europe, data privacy is enshrined in law by the General Data Protection Regulation (GDPR). We will discuss the impact of GDPR later in the paper.
What Do We Mean by Data Privacy?
In the context of this paper, data privacy refers to the process of protecting sensitive personal data. By this we mean preventing that data from being released without the informed consent of the individual, except where other legal obligations require its release. For many organisations, this data forms an integral part of their customer data. Often, they need to be able to access this data to perform their legitimate business functions. However, they should not access it for other purposes or share it without permission.
What Data Is Covered by Data Privacy?
Generally, data privacy is interested in personal data – in other words it specifically relates to an individual. The GDPR states “‘personal data’ means any information relating to an identified or identifiable natural person”. This is quite a broad definition and covers a wide range of data. The following is a far from exhaustive list of the things that are covered: name, gender, sexual orientation, disability, address, phone number, email address, physical location, identification numbers, employment details, phone records, usernames, social media handles and passwords. Data privacy techniques can also extend to protecting companies and other entities, for instance protecting confidential or privileged information, trade secrets, etc.
The simple test to decide if something is covered is whether knowing that data might make it easier to identify the individual or entity involved.
The problem is that whether the data makes a person more identifiable may depend on the circumstances. This means there are grey areas such as knowing an individual’s employer – if they work for a large company then knowing this information won’t help identify them, but if they work for a company with only 2 or 3 employees then it clearly does help identify them.
What Data is Not Covered?
As already mentioned, whether or not data is covered comes down to whether it can be used to identify the individual or entity involved. This often depends on context as in the case of someone’s employer mentioned above. The interesting thing is that even apparently obvious personal identifiers may not be covered if they don’t make it possible to identify someone.
The classic example here is someone’s name. Within the UK there are about half a million people with the surname ‘Smith’, so knowing someone is British and has that surname is not a personal identifier. Other things may not be covered as personal data but might well be included under the requirements for data privacy.
This includes certain transactions such as purchasing history or bank balance (of itself, knowing there is an individual with a current balance of €4,500 is not going to allow you to identify that individual, though combining that with other data may allow you to identify them).
Another major exemption for data privacy is when an individual has explicitly allowed the data to be shared. For instance, a customer might have allowed you to share their email address with another company. Or a customer may have left a public review on your website which reveals their username. The important thing here is that such consent should be properly informed, meaning it can’t be hidden in the small text, or rely on a pre-ticked checkbox, and also the fact of the consent being given should be stored as part of the data.
How Does This Affect the Abstract Model?
The two key things with data privacy are to know when a piece of data is covered and to keep records of all permissions given relating to the data. This means that data privacy has to be considered in all three entities in the model.
Data Loaders need to know whether they have the rights to a piece of data (e.g. they can’t necessarily scrape data relating to an individual and assume they are allowed to use it). Where they do have the right to use the data, they need to know if the owner of the data has given consent for it to be shared.
The Data Warehouse needs to store the data along with all the consents that have been given.
Finally, the Data Consumer must be sure that any data that is being released doesn’t breach data privacy.
Building a Privacy-preserving analytics stack – better understand how to comply with the requirements imposed by GDPR while still leveraging data analysis.
Before explaining how to choose the best analytics tool stack, we first need to create an abstract model for the stack. This allows us to discuss the required functionality without being wedded to preconceived ideas about the capabilities and limitations of specific tools such as Postgres.
GDPR is one of the most far-reaching data protection laws anywhere in the world, and as such it has had a huge impact globally. This is because, unlike many national data protection laws, GDPR applies to any company that deals with EU residents, wherever they are in the world. In this section we will look at the specific impact GDPR has had on analytics.