Primary
Collected by a researcher from first-hand sources
Reminder what the Prepare stage means as defined in the Data Analysis Roadmap page:
The prepare phase ensures that you have all of the data you need for your analysis and that you have credible, useful data.
Preparing your data, or knowing how it is prepared are key, so let’s go through a few data scenarios.
You will decide what data you need to collect in order to answer your questions and how to organize it so that it is useful. You might use your business task to decide:
Questions to ask yourself in this step:
It all started with solid preparation. The group built a timeline of three months and decided how they wanted to relay their progress to interested parties. Also during this step, the analysts identified what data they needed to achieve the successful result they identified in the previous step - in this case, the analysts chose to gather the data from an online survey of new employees. These were the things they did to prepare:
They developed specific questions to ask about employee satisfaction with different business processes, such as hiring and onboarding, and their overall compensation.
They established rules for who would have access to the data collected - in this case, anyone outside the group wouldn’t have access to the raw data, but could view summarized or aggregated data. For example, an individual’s compensation wouldn’t be available, but salary ranges for groups of individuals would be viewable.
They finalized what specific information would be gathered, and how best to present the data visually. The analysts brainstormed possible project- and data-related issues and how to avoid them.
Collected by a researcher from first-hand sources
Collected by other people
Data format classification | Definition | Examples |
---|---|---|
Primary data | Collected by a researcher from first-hand sources |
|
Secondary data | Gathered by other people or from other research |
|
Data stored and collected inside your own company
Data stored outside
Data format classification | Definition | Examples |
---|---|---|
Internal data | Data that is stored inside a company’s own systems |
|
External data | Data that is stored outside of a company or organization |
|
Measured data and can have almost any numeric value
Data that’s counted and has a limited number of values
Data format classification | Definition | Examples |
---|---|---|
Continuous data | Data that is measured and can have almost any numeric value |
|
Discrete data | Data that is counted and has a limited number of values |
|
A subjective and explanatory measure
A measure such as a number, quantity or range
Data format classification | Definition | Examples |
---|---|---|
Qualitative | A subjective and explanatory measure of a quality or characteristic |
|
Quantitative | A specific and objective measure, such as a number, quantity, or range |
|
Qualitative data that is categorized without a set order
Qualitative data with a set order or scale
Data format classification | Definition | Examples |
---|---|---|
Nominal | A type of qualitative data that is categorized without a set order |
|
Ordinal | A type of qualitative data with a set order or scale |
|
Two general categories of data are:
Structured data: Organized in a certain format, such as rows and columns.
Unstructured data: Not organized in any easy-to-identify way.
For example, when you rate your favorite restaurant online, you’re creating structured data. But when you use Google Earth to check out a satellite image of a restaurant location, you’re using unstructured data.
Here’s a refresher on the characteristics of structured and unstructured data:
As we described earlier, structured data is organized in a certain format. This makes it easier to store and query for business needs. If the data is organized in a certain format. This makes it easier to store and query for business needs. If the data is exported, the structure goes along with the data.
Unstructured data can’t be organized in any easily identifiable manner. And there is much more unstructured than structured data in the world. Video and audio files, text files, social media content, satellite imagery, presentations, PDF files, open-ended survey responses, and websites all qualify as types of unstructured data.
Data format classification | Definition | Examples |
---|---|---|
Structured data | Data organized in a certain format, like rows and columns |
|
Unstructured data | Data that cannot be stored as columns and rows in a relational database. |
|
Fairness issue
The lack of structure makes unstructured data difficult to search, manage, and analyze. But recent advancements in artificial intelligence and machine learning algorithms are beginning to change that. Now, the new challenge facing data scientists is making sure these tools are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others.
An unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis.
Data models help keep data consistent and enable people to map out how data is organized. A basic understanding makes it easier for analysts and other stakeholders to make sense of their data and use it in the right ways.
Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole.
Data modeling can help you explore the high-level details of your data and how it is related across the organization’s information systems. Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data. And finally, data models make it easier for everyone in your organization to understand and collaborate with you on your data. This is important for you and everyone on your team!
Each level of data modeling has a different level of detail.
Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn’t contain technical details.
Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn’t spell out actual names of database tables. That’s the job of a physical data model.
Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.
Wide data is a dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject. It is helpful for comparing specific attributes across different subjects.
Wide data is data where each row contains multiple data points for the particular items identified in the columns.
Long data is data in which each row represents one observation per subject, so each subject will be represented by multiple rows. This data format is useful for comparing changes over time or making other comparisons across subjects. Long data is data where each row contains a single data point for a particular item. In the long data example below, individual stock prices (data points) have been collected for Apple (AAPL), Amazon (AMZN), and Google (GOOGL) (particular items) on the given dates.
Data transformation is the process of changing the data’s format, structure, or values. As a data analyst, there is a good chance you will need to transform data at some point to make it easier for you to analyze it.
Data transformation usually involves:
Adding, copying, or replicating data
Deleting fields or records
Standardizing the names of variables
Renaming, moving, or combining columns in a database
Joining one set of data with another
Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (.csv) file.
Goals for data transformation might be:
Data organization: better organized data is easier to use
Data compatibility: different applications or systems can then use the same data
Data migration: data with matching formats can be moved from one system to another
Data merging: data with the same organization can be merged together
Data enhancement: data can be displayed with more detailed fields
Data comparison: apples-to-apples comparisons of the data can then be made
Before you work with data, you must confirm that it is unbiased and credible. After all, if you start your analysis with unreliable data, you won’t be able to trust your results. Next we’ll learn to identify bias in data and to ensure your data is credible. You’ll also explore open data and the importance of data ethics and data privacy.
We’ll learn how to analyze data for bias and credibility. This is very important because even the most sound data can be skewed or misinterpreted. Then we’ll learn about the importance of being good and bad. After that, we’ll learn more about the world of data ethics, privacy and access.
As more and more data becomes available, and the algorithms we create to use this data become more complex and sophisticated, new issues keep popping up. We need to ask questions like, who owns all this data? How much control do we have over the privacy of data? Can we use and reuse data however we want to?
Bias has evolved to become a preference in favor of or against a person, group of people, or thing. It can be conscious or subconscious.
Data bias is a type of error that systematically skews results in a certain direction. Maybe the questions on a survey had a particular slant to influence answers, or maybe the sample group wasn’t truly representative of the population being studied.
Bias can also happen if a sample group lacks inclusivity. For example, people with disabilities tend to be under-identified, under-represented, or excluded in mainstream health research.
The way you collect data can also bias a data set. Sampling bias is when a sample isn’t representative of the population as a whole. You can avoid this by making sure the sample is chosen at random, so that all parts of the population have an equal chance of being included. If you don’t use random sampling during data collection, you end up favoring one outcome.
Observer bias, which is sometimes referred to as experimenter bias or research bias. Basically, it’s the tendency for different people to observe things differently.
You might remember earlier, we learned that scientists use observations a lot in their work, like when they’re looking at bacteria under a microscope to gather data. While two scientists looking into the same microscope might see different things, that’s observer bias.
Another time observer bias might happen is during manual blood pressure readings. Because the pressure meter is so sensitive, health care workers often get pretty different results. Usually, they’ll just round up to the nearest whole number to compensate for the margin of error. But if doctors consistently round up, or down the blood pressure readings on their patients, health conditions may be missed, and any studies involving their patients wouldn’t have precise, and accurate data.
The tendency to always interpret ambiguous situations in a positive, or negative way.
Let’s say you’re having lunch with a colleague, when you get a voicemail from your boss, asking you to call her back. You put the phone down in a huff, certain that she’s angry, and you’re on the hot seat for something. But when you play the message for your friend, he doesn’t hear anger at all, he actually thinks she sounds calm and straightforward.
Interpretation bias, can lead to two people seeing or hearing the exact same thing, and interpreting it in a variety of different ways, because they have different backgrounds, and experiences. Your history with your boss made you interpret the call one way, while your friend interpreted it in another way, because they’re strangers. Add these interpretations to a data analysis, and you can get bias results.
people see what they want to see
That pretty much sums up confirmation bias in a nutshell. Confirmation bias, is the tendency to search for, or interpret information in a way that confirms preexisting beliefs. Someone might be so eager to confirm a gut feeling, that they only notice things that support it, ignoring all other signals. This happens all the time in everyday life.
What about good data sources? Are those subjective, too? In some ways they are, but luckily, there’s some best practices to follow that’ll help you measure the reliability of data sets before you use them.
Like a good friend, good data sources are reliable. With this data you can trust that you’re getting accurate, complete and unbiased information that’s been vetted and proven fit for use.
There’s a good chance you’ll discover data through a second or third party source. To make sure you’re dealing with good data, be sure to validate it with the original source.
The best data sources contain all critical information needed to answer the question or find the solution. Think about it like this. You wouldn’t want to work for a company just because you found one great online review about it. You’d research every aspect of the organization to make sure it was the right fit. It’s important to do the same for your data analysis.
The usefulness of data decreases as time passes. If you wanted to invite all current clients to a business event, you wouldn’t use a 10-year-old client list. The same goes for data.
The best data sources are current and relevant to the task at hand.
If you’ve ever told a friend where you heard that a new movie sequel was in the works, you’ve cited a source. Citing makes the information you’re providing more credible.
When you’re choosing a data source, think about three things.
Who created the data set?
Is it part of a credible organization?
When was the data last refreshed?
If you have original data from a reliable organization and it’s comprehensive, current, and cited, it ROCCCs!
They’re not reliable, original, comprehensive, current or cited. Even worse, they could be flat-out wrong or filled with human error. Basically the opposite of ROCCC
When we analyze data, we’re also faced with questions, challenges, and opportunities, but we have to rely on more than just our personal code of ethics to address them.
As we learned earlier, we all have our own personal biases, not to mention subconscious biases that make ethics even more difficult to navigate. That’s why we have data ethics, an important aspect of analytics.
Just like humans, data has standards to live up to as well. Data ethics refers to well- founded standards of right and wrong that dictate how data is collected, shared, and used.
Since the ability to collect, share and use data in such large quantities is relatively new, the rules that regulate and govern the process are still evolving.
The GDPR of the European Union was created to do just this. The concept of data ethics and issues related to transparency and privacy are part of the process.
Data ethics tries to get to the root of the accountability companies have in protecting and responsibly using the data they collect. There are lots of different aspects of data ethics but we’ll cover six:
This answers the question who owns data? It isn’t the organization that invested time and money collecting, storing, processing, and analyzing it. It’s individuals who own the raw data they provide, and they have primary control over its usage, how it’s processed and how it’s shared.
Is the idea that all data processing activities and algorithms should be completely explainable and understood by the individual who provides their data. This is in response to concerns over data bias, which we discussed earlier, is a type of error that systematically skews results in a certain direction.
Biased outcomes can lead to negative consequences. To avoid them, it’s helpful to provide transparent analysis especially to the people who share their data. This lets people judge whether the outcome is fair and unbiased and allows them to raise potential concerns.
This is an individual’s right to know explicit details about how and why their data will be used before agreeing to provide it. They should know answers to questions like
Why is the data being collected?
How will it be used?
How long will it be stored?
The best way to give consent is probably a conversation between the person providing the data and the person requesting it. But with so much activity happening online these days, consent usually just looks like a terms and conditions checkbox with links to more details.
Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions. If your data is helping to fund a company’s efforts, you should know what those efforts are all about and be given the opportunity to opt out.
It’s all about access, use, and collection of data. It also covers a person’s legal right to their data. This means someone like you or me should have protection from unauthorized access to our private data, freedom from inappropriate use of our data, the right to inspect, update, or correct our data, ability to give consent to use our data, and legal right to access our data.
For companies, it means putting privacy measures in place to protect the individuals’ data. Data privacy is important, even if you’re not someone who thinks about it on a day-to-day basis.
You have been learning about the importance of privacy in data analytics. Now, it is time to talk about data anonymization and what types of data should be anonymized. Personally identifiable information, or PII, is information that can be used by itself or with other data to track down a person’s identity.
Data anonymization is the process of protecting people’s private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values.
What types of data should be anonymized?
Healthcare and financial data are two of the most sensitive types of data. These industries rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why data in these two industries usually goes through de-identification, which is a process used to wipe data clean of all personally identifying information.
Data anonymization is used in just about every industry. That is why it is so important for data analysts to understand the basics. Here is a list of data that is often anonymized:
Telephone numbers
Names
License plates and license numbers
Social security numbers
IP addresses
Medical records
Email addresses
Photographs
Account numbers
For some people, it just makes sense that this type of data should be anonymized. For others, we have to be very specific about what needs to be anonymized.
Openness refers to free access, usage, and sharing of data. But for data to be considered open, it has to:
Be available and accessible to the public as a complete dataset
Be provided under terms that allow it to be reused and redistributed
Allow universal participation so that anyone can use, reuse, and redistribute the data
In data analytics, open data is part of data ethics, which has to do with using data ethically. Data can only be considered open when it meets all three of these standards.
The open data debate: What data should be publicly available?
One of the biggest benefits of open data is that credible databases can be used more widely. Basically, this means that all of that good data can be leveraged, shared, and combined with other data. This could have a huge impact on scientific collaboration, research advances, analytical capacity, and decision-making. But it is important to think about the individuals being represented by the public, open data, too.
Third-party data is collected by an entity that doesn’t have a direct relationship with the data. g.
Personal identifiable information (PII) is data that is reasonably likely to identify a person and make information known about them. It is important to keep this data safe. PII can include a person’s address, credit card information, social security number, medical records, and more.
A relational database is a database that contains a series of tables that can be connected to form relationships. Basically, they allow data analysts to organize and link data based on what the data has in common.
In a non-relational table, you will find all of the possible variables you might be interested in analyzing all grouped together. This can make it really hard to sort through. This is one reason why relational databases are so common in data analysis: they simplify a lot of analysis processes and make data easier to find and use across an entire database.
Normalization is a process of organizing data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database.
Primary key is an identifier that references a column in which each value is unique. In other words, it’s a column of a table that is used to uniquely identify each record within that table. The value assigned to the primary key in a particular row must be unique within the entire table. For example, if customer_id is the primary key for the customer table, no two customers will ever have the same customer_id.
Foreign Key is a field within a table that is a primary key in another table. A table can have only one primary key, but it can have multiple foreign keys. These keys are what create the relationships between tables in a relational database, which helps organize and connect data across multiple tables in the database.
Some tables don’t require a primary key. For example, a revenue table can have multiple foreign keys and not have a primary key. A primary key may also be constructed using multiple columns of a table. This type of primary key is called a composite key. For example, if customer_id and location_id are two columns of a composite key for a customer table, the values assigned to those fields in any given row must be unique within the entire table.
Metadata is data about data. In database management, metadata provides information about other data and helps data analysts interpret the contents of the data within a database. In essence, metadata tells the who, what, when, where, which, why, and how of data.
As a data analyst, there are three common types of metadata that you’ll come across: descriptive, structural, and administrative.
Descriptive metadata is metadata that describes a piece of data and can be used to identify it at a later point in time.
For instance, the descriptive metadata of a book in a library would include the code you see on its spine, known as a unique International Standard Book Number, also called the ISBN. It would also include the book’s author and title.
Structural metadata, which is metadata that indicates how a piece of data is organized and whether it’s part of one or more than one data collection.
Let’s head back to the library. An example of structural data would be how the pages of a book are put together to create different chapters. It’s important to note that structural metadata also keeps track of the relationship between two things. For example, it can show us that the digital document of a book manuscript was actually the original version of a now printed book.
Administrative metadata is metadata that indicates the technical source of a digital asset.
When we looked at the metadata inside the photo, that was administrative metadata. It shows you the type of file it was, the date and time it was taken, and much more.
It’s important to understand what type of information metadata typically provides:
File or document type: What type of file or document are you examining?
Date, time, and creator: When was it created? Who created it? When was it last modified?
Title and description: What is the name of the item you are examining? What type of content does it contain?
Geolocation: If you’re examining a photo, where was it taken?
Tags and categories: What is the general overview of the item that you have? Is it indexed or described in a specific way?
Who last modified it and when: Were any changes made to the file? If yes, when were the most recent modifications made?
Who can access or update it: If you’re examining a dataset, is it public? Are special permissions needed to customize or modify it?
If the data being used to solve a problem or to make a data-driven decision is unreliable, there’s a good chance the results will be unreliable as well. Metadata helps data analysts confirm their data is reliable by making sure it is:
Accurate
Precise
Relevant
Timely
It does this by helping analysts ensure that they’re working with the right data and that the data is described correctly. For example, a data analyst completing a project with data from 2022 can use metadata to easily determine if they should use data from a particular file.
Data analysts thrive on consistency and aim for uniformity in their data and databases, and metadata helps make this possible. For example, to use survey data from two different sources, data analysts use metadata to make sure the same collection methods were applied in the survey so that both datasets can be compared reliably.
When a database is consistent, it’s easier to discover relationships between the data inside the database and data that exists elsewhere. When data is uniform, it is:
Organized: Data analysts can easily find tables and files, monitor the creation and alteration of assets, and store metadata.
Classified: Data analysts can categorize data when it follows a consistent format, which is beneficial in cleaning and processing data.
Stored: Consistent and uniform data can be efficiently stored in various data repositories. This streamlines storage management tasks such as managing a database.
Accessed: Users, applications, and systems can efficiently locate and use data.
Metadata repositories help data analysts ensure their data is reliable and consistent. Metadata repositories are specialized databases specifically created to store and manage metadata. They can be kept in a physical location or a virtual environment—like data that exists in the cloud.
Metadata repositories describe where the metadata came from and store that data in an accessible form with a common structure. This provides data analysts with quick and easy access to the data. If data analysts didn’t use a metadata repository, they would have to select each file to look up its information and compare the data manually, which would waste a lot of time and effort.
Data analysts also use metadata repositories to bring together multiple sources for data analysis. Metadata repositories do this by describing the state and location of the data, the structure of the tables inside the data, and who accessed the user logs.
Data analysts use both second-party and third-party data to gain valuable insights and make strategic, data-driven decisions. Second-party data is data that’s collected by a group directly from the group’s audience and then sold. Third-party data is provided by outside sources that didn’t collect it directly. The providers of this data are not its original collectors and do not have a direct relationship with any individuals to whom the data belongs. The outside providers get the data from websites or other programs that pull it from the various platforms where it was originally generated.
Data analysts should understand the metadata of external databases to confirm that it is consistent and reliable. In some cases, they should also contact the owner of the third-party data to confirm that it is accessible and available for purchase. Confirming that the data is reliable and that the proper permissions to use it have been obtained are best practices when using data that comes from another organization.
As you’ve learned, you can import data from some data sources, like .csv files into a Google spreadsheet from the File menu. Keep in mind that, when you use this method, data that is updated in the .csv will not automatically be updated in the Google Sheet. Instead, it will need to be manually—and continually—updated in the Google Sheet. In some situations, such as when you want to be able to keep track of changes you’ve made, this method is ideal. In other situations, you might need to keep the data the same in both places, and using data that doesn’t update automatically can be time-consuming and tedious. Further, trying to maintain the same dataset in multiple places can cause errors later on.
Fortunately, there are tools to help you automate data imports so you don’t need to continually update the data in your current spreadsheet. Take a small general store as an example. The store has three cash registers handled by three clerks. At the end of each day, the owner wants to determine the total sales and the amount of cash in each register. Each clerk is responsible for counting their money and entering their sales total into a spreadsheet. The owner has the spreadsheets set up to import each clerks’ data into another spreadsheet, where it automates and calculates the total sales for all three registers. Without this automation, each clerk would have to take turns entering their data into the owner’s spreadsheet. This is an example of a dynamic method of importing data, which saves the owner and clerks time and energy. When data is dynamic, it is interactive and automatically changes and updates over time.
IMPORT functions in Google Sheets
In Google Sheets, the IMPORTRANGE function can import all or part of a dataset from another Google Sheet.
To use this function, you need two pieces of information:
The URL of the Google Sheet from which you’ll import data.
The name of the sheet and the range of cells you want to import into your Google Sheet.
Once you have this information, open the Google Sheet into which you want to import data and select the cell into which the first cell of data should be copied. Enter = to indicate you will enter a function, then complete the IMPORTRANGE function with the URL and range you identified in the following manner: =IMPORTRANGE(“URL”, “sheet_name!cell_range”). Note that an exclamation point separates the sheet name and the cell range in the second part of this function.
An example of this function is:
=IMPORTRANGE(“https://docs.google.com/thisisatestabc123”, “sheet1!A1:F13”)
Note: This URL is for syntax purposes only. It is not meant to be entered into your own spreadsheet.
Once you’ve completed the function, a box will pop up to prompt you to allow access to the Google Sheet from which you’re importing data. You must allow access to the spreadsheet containing the data the first time you import it into Google Sheets. Replace it with a spreadsheet’s URL that you have created so you can control access by selecting the Allow access button.
Importing HTML tables is a basic method to extract data from public web pages. This process is often called “scraping.” Web scraping made easy introduces how to do this with Google Sheets or Microsoft Excel.
In Google Sheets, you can use the IMPORTHTML function to import the data from an HTML table (or list) on a web page. This function is similar to the IMPORTRANGE function.
Sometimes data displayed on the web is in the form of a comma- or tab-delimited file. You can use the IMPORTDATA function in a Google Sheet to import data into a Google Sheet. This function is similar to the IMPORTRANGE function.
File-naming conventions help you organize, access, process, and analyze data because they act as quick reference points to identify what’s in a file. You should align your project’s file names with your team’s or company’s existing file-naming conventions. You don’t want to spend time learning a new file-naming convention each time you look up a file in a new project!
It’s also critical to ensure that file names are meaningful, consistent, and easy-to-read. File names should include:
The project’s name
The file creation date
Revision version
Consistent style and order
Further, file-naming conventions should act as quick reference points to identify what is in the file. Because of this, they should be short and to the point SalesReport_20231125_v02
To keep your files organized, create folders and subfolders—in a logical hierarchy—to ensure related files are stored together and can be found easily later. A hierarchy is a way of organizing files and folders. Broader-topic folders are located at the top of the hierarchy, and more specific subfolders and files are contained within those folders. Each folder can contain other folders and files. This allows you to group related files together and makes it easier to find the files you need. In addition, it’s a best practice to store completed files separately from in-progress files so the files you need are easy to find. Archive older files in a separate folder or in an external storage location.
Data security means protecting data from unauthorized access or corruption by putting safety measures in place. Usually the purpose of data security is to keep unauthorized users from accessing or viewing sensitive data.
Encryption uses a unique algorithm to alter data and make it unusable by users and applications that don’t know the algorithm. This algorithm is saved as a “key” which can be used to reverse the encryption; so if you have the key, you can still use the data in its original form.
Tokenization replaces the data elements you want to protect with randomly generated data referred to as a “token.” The original data is stored in a separate location and mapped to the tokens. To access the complete original data, the user or application needs to have permission to use the tokenized data and the token mapping. This means that even if the tokenized data is hacked, the original data is still safe and secure in a separate location.
Encryption and tokenization are just some of the data security options out there. There are a lot of others, like using authentication devices for AI technology.
Version control enables all collaborators within a file to track changes over time. You can understand who made what changes to a file, when they were made, and why.