prepare

Reminder what the Prepare stage means as defined in the Data Analysis Roadmap page:

Guiding questions

Where is your data located?
How is the data organized?
Are there issues with bias or credibility in this data? Does your data ROCCC?
How are you addressing licensing, privacy, security, and accessibility?
How did you verify the data’s integrity?
How does it help you answer your question?
Are there any problems with the data?

Key tasks

The prepare phase ensures that you have all of the data you need for your analysis and that you have credible, useful data.

Collect data and store it appropriately
Identify how it’s organized
Sort and filter the data
Determine the credibility of the data

Preparing your data, or knowing how it is prepared are key, so let’s go through a few data scenarios.

You will decide what data you need to collect in order to answer your questions and how to organize it so that it is useful. You might use your business task to decide:

What metrics to measure
Locate data in your database
Create security measures to protect that data

Questions to ask yourself in this step:

What do I need to figure out how to solve this problem?
What research do I need to do?

It all started with solid preparation. The group built a timeline of three months and decided how they wanted to relay their progress to interested parties. Also during this step, the analysts identified what data they needed to achieve the successful result they identified in the previous step - in this case, the analysts chose to gather the data from an online survey of new employees. These were the things they did to prepare:

They developed specific questions to ask about employee satisfaction with different business processes, such as hiring and onboarding, and their overall compensation.
They established rules for who would have access to the data collected - in this case, anyone outside the group wouldn’t have access to the raw data, but could view summarized or aggregated data. For example, an individual’s compensation wouldn’t be available, but salary ranges for groups of individuals would be viewable.
They finalized what specific information would be gathered, and how best to present the data visually. The analysts brainstormed possible project- and data-related issues and how to avoid them.

Data format

Primary

Collected by a researcher from first-hand sources

Secondary

Collected by other people

Data format classification	Definition	Examples
Primary data	Collected by a researcher from first-hand sources	Data from an interview you conducted - Data from a survey returned from 20 participants Data from questionnaires you got back from a group of workers
Secondary data	Gathered by other people or from other research	Data you bought from a local data analytics firm’s customer profiles Demographic data collected by a university Census data gathered by the federal government

Internal

Data stored and collected inside your own company

External

Data stored outside

Data format classification	Definition	Examples
Internal data	Data that is stored inside a company’s own systems	Wages of employees across different business units tracked by HR Sales data by store location Product inventory levels across distribution centers
External data	Data that is stored outside of a company or organization	National average wages for the various positions throughout your organization Credit reports for customers of an auto dealership

Continuous

Measured data and can have almost any numeric value

Discrete

Data that’s counted and has a limited number of values

Data format classification	Definition	Examples
Continuous data	Data that is measured and can have almost any numeric value	Height of kids in third grade classes (52.5 inches, 65.7 inches) Runtime markers in a video Temperature
Discrete data	Data that is counted and has a limited number of values	Number of people who visit a hospital on a daily basis (10, 20, 200) Maximum capacity allowed in a room Tickets sold in the current month

Qualitative

A subjective and explanatory measure

Quantitative

A measure such as a number, quantity or range

Data format classification	Definition	Examples
Qualitative	A subjective and explanatory measure of a quality or characteristic	Favorite exercise activity Brand with best customer service Fashion preferences of young adults
Quantitative	A specific and objective measure, such as a number, quantity, or range	Percentage of board certified doctors who are women Population size of elephants in Africa Distance from Earth to Mars at a particular time

Nominal

Qualitative data that is categorized without a set order

Ordinal

Qualitative data with a set order or scale

Data format classification	Definition	Examples
Nominal	A type of qualitative data that is categorized without a set order	First time customer, returning customer, regular customer New job applicant, existing applicant, internal applicant New listing, reduced price listing, foreclosure
Ordinal	A type of qualitative data with a set order or scale	Movie ratings (number of stars: 1 star, 2 stars, 3 stars) Ranked-choice voting selections (1st, 2nd, 3rd) Satisfaction level measured in a survey (satisfied, neutral, dissatisfied)

Structure

Two general categories of data are:

Structured data: Organized in a certain format, such as rows and columns.
Unstructured data: Not organized in any easy-to-identify way.

For example, when you rate your favorite restaurant online, you’re creating structured data. But when you use Google Earth to check out a satellite image of a restaurant location, you’re using unstructured data.

Here’s a refresher on the characteristics of structured and unstructured data:

Structured data

As we described earlier, structured data is organized in a certain format. This makes it easier to store and query for business needs. If the data is organized in a certain format. This makes it easier to store and query for business needs. If the data is exported, the structure goes along with the data.

Unstructured data

Unstructured data can’t be organized in any easily identifiable manner. And there is much more unstructured than structured data in the world. Video and audio files, text files, social media content, satellite imagery, presentations, PDF files, open-ended survey responses, and websites all qualify as types of unstructured data.

Data format classification	Definition	Examples
Structured data	Data organized in a certain format, like rows and columns	Expense reports Tax returns Store inventory
Unstructured data	Data that cannot be stored as columns and rows in a relational database.	Social media posts Emails Videos

Fairness issue

The lack of structure makes unstructured data difficult to search, manage, and analyze. But recent advancements in artificial intelligence and machine learning algorithms are beginning to change that. Now, the new challenge facing data scientists is making sure these tools are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others.

An unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis.

Data modeling

Data models help keep data consistent and enable people to map out how data is organized. A basic understanding makes it easier for analysts and other stakeholders to make sense of their data and use it in the right ways.

What is DM?

Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole.

Data modeling can help you explore the high-level details of your data and how it is related across the organization’s information systems. Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data. And finally, data models make it easier for everyone in your organization to understand and collaborate with you on your data. This is important for you and everyone on your team!

Levels of DM

Each level of data modeling has a different level of detail.

Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn’t contain technical details.
Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn’t spell out actual names of database tables. That’s the job of a physical data model.
Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.

Data types

Wide

Wide data is a dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject. It is helpful for comparing specific attributes across different subjects.

Wide data is data where each row contains multiple data points for the particular items identified in the columns.

Long

Long data is data in which each row represents one observation per subject, so each subject will be represented by multiple rows. This data format is useful for comparing changes over time or making other comparisons across subjects. Long data is data where each row contains a single data point for a particular item. In the long data example below, individual stock prices (data points) have been collected for Apple (AAPL), Amazon (AMZN), and Google (GOOGL) (particular items) on the given dates.

Data Transformation

Data transformation is the process of changing the data’s format, structure, or values. As a data analyst, there is a good chance you will need to transform data at some point to make it easier for you to analyze it.

Data transformation usually involves:

Adding, copying, or replicating data
Deleting fields or records
Standardizing the names of variables
Renaming, moving, or combining columns in a database
Joining one set of data with another
Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (.csv) file.

Why transform data?

Goals for data transformation might be:

Data organization: better organized data is easier to use
Data compatibility: different applications or systems can then use the same data
Data migration: data with matching formats can be moved from one system to another
Data merging: data with the same organization can be merged together
Data enhancement: data can be displayed with more detailed fields
Data comparison: apples-to-apples comparisons of the data can then be made

Data responsibility

Before you work with data, you must confirm that it is unbiased and credible. After all, if you start your analysis with unreliable data, you won’t be able to trust your results. Next we’ll learn to identify bias in data and to ensure your data is credible. You’ll also explore open data and the importance of data ethics and data privacy.

We’ll learn how to analyze data for bias and credibility. This is very important because even the most sound data can be skewed or misinterpreted. Then we’ll learn about the importance of being good and bad. After that, we’ll learn more about the world of data ethics, privacy and access.

As more and more data becomes available, and the algorithms we create to use this data become more complex and sophisticated, new issues keep popping up. We need to ask questions like, who owns all this data? How much control do we have over the privacy of data? Can we use and reuse data however we want to?

Bias has evolved to become a preference in favor of or against a person, group of people, or thing. It can be conscious or subconscious.

Bias

Sampling

Data bias is a type of error that systematically skews results in a certain direction. Maybe the questions on a survey had a particular slant to influence answers, or maybe the sample group wasn’t truly representative of the population being studied.
Bias can also happen if a sample group lacks inclusivity. For example, people with disabilities tend to be under-identified, under-represented, or excluded in mainstream health research.
The way you collect data can also bias a data set. Sampling bias is when a sample isn’t representative of the population as a whole. You can avoid this by making sure the sample is chosen at random, so that all parts of the population have an equal chance of being included. If you don’t use random sampling during data collection, you end up favoring one outcome.

Observer

Observer bias, which is sometimes referred to as experimenter bias or research bias. Basically, it’s the tendency for different people to observe things differently.

You might remember earlier, we learned that scientists use observations a lot in their work, like when they’re looking at bacteria under a microscope to gather data. While two scientists looking into the same microscope might see different things, that’s observer bias.
Another time observer bias might happen is during manual blood pressure readings. Because the pressure meter is so sensitive, health care workers often get pretty different results. Usually, they’ll just round up to the nearest whole number to compensate for the margin of error. But if doctors consistently round up, or down the blood pressure readings on their patients, health conditions may be missed, and any studies involving their patients wouldn’t have precise, and accurate data.

Interpretation

The tendency to always interpret ambiguous situations in a positive, or negative way.

Let’s say you’re having lunch with a colleague, when you get a voicemail from your boss, asking you to call her back. You put the phone down in a huff, certain that she’s angry, and you’re on the hot seat for something. But when you play the message for your friend, he doesn’t hear anger at all, he actually thinks she sounds calm and straightforward.

Interpretation bias, can lead to two people seeing or hearing the exact same thing, and interpreting it in a variety of different ways, because they have different backgrounds, and experiences. Your history with your boss made you interpret the call one way, while your friend interpreted it in another way, because they’re strangers. Add these interpretations to a data analysis, and you can get bias results.

Confirmation

people see what they want to see

That pretty much sums up confirmation bias in a nutshell. Confirmation bias, is the tendency to search for, or interpret information in a way that confirms preexisting beliefs. Someone might be so eager to confirm a gut feeling, that they only notice things that support it, ignoring all other signals. This happens all the time in everyday life.

Credibility

What about good data sources? Are those subjective, too? In some ways they are, but luckily, there’s some best practices to follow that’ll help you measure the reliability of data sets before you use them.

ROCCC

Reliable

Like a good friend, good data sources are reliable. With this data you can trust that you’re getting accurate, complete and unbiased information that’s been vetted and proven fit for use.

Original

There’s a good chance you’ll discover data through a second or third party source. To make sure you’re dealing with good data, be sure to validate it with the original source.

Comprehensive

The best data sources contain all critical information needed to answer the question or find the solution. Think about it like this. You wouldn’t want to work for a company just because you found one great online review about it. You’d research every aspect of the organization to make sure it was the right fit. It’s important to do the same for your data analysis.

Current

The usefulness of data decreases as time passes. If you wanted to invite all current clients to a business event, you wouldn’t use a 10-year-old client list. The same goes for data.

The best data sources are current and relevant to the task at hand.

Cited

If you’ve ever told a friend where you heard that a new movie sequel was in the works, you’ve cited a source. Citing makes the information you’re providing more credible.

When you’re choosing a data source, think about three things.

Who created the data set?
Is it part of a credible organization?
When was the data last refreshed?

If you have original data from a reliable organization and it’s comprehensive, current, and cited, it ROCCCs!

Bad

They’re not reliable, original, comprehensive, current or cited. Even worse, they could be flat-out wrong or filled with human error. Basically the opposite of ROCCC

Ethics

When we analyze data, we’re also faced with questions, challenges, and opportunities, but we have to rely on more than just our personal code of ethics to address them.

As we learned earlier, we all have our own personal biases, not to mention subconscious biases that make ethics even more difficult to navigate. That’s why we have data ethics, an important aspect of analytics.

Just like humans, data has standards to live up to as well. Data ethics refers to well- founded standards of right and wrong that dictate how data is collected, shared, and used.

Since the ability to collect, share and use data in such large quantities is relatively new, the rules that regulate and govern the process are still evolving.

The GDPR of the European Union was created to do just this. The concept of data ethics and issues related to transparency and privacy are part of the process.

Data ethics tries to get to the root of the accountability companies have in protecting and responsibly using the data they collect. There are lots of different aspects of data ethics but we’ll cover six:

Ownership

This answers the question who owns data? It isn’t the organization that invested time and money collecting, storing, processing, and analyzing it. It’s individuals who own the raw data they provide, and they have primary control over its usage, how it’s processed and how it’s shared.

Transparency

Is the idea that all data processing activities and algorithms should be completely explainable and understood by the individual who provides their data. This is in response to concerns over data bias, which we discussed earlier, is a type of error that systematically skews results in a certain direction.

Biased outcomes can lead to negative consequences. To avoid them, it’s helpful to provide transparent analysis especially to the people who share their data. This lets people judge whether the outcome is fair and unbiased and allows them to raise potential concerns.

Currency

Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions. If your data is helping to fund a company’s efforts, you should know what those efforts are all about and be given the opportunity to opt out.

Privacy

It’s all about access, use, and collection of data. It also covers a person’s legal right to their data. This means someone like you or me should have protection from unauthorized access to our private data, freedom from inappropriate use of our data, the right to inspect, update, or correct our data, ability to give consent to use our data, and legal right to access our data.

For companies, it means putting privacy measures in place to protect the individuals’ data. Data privacy is important, even if you’re not someone who thinks about it on a day-to-day basis.

Anonymization

You have been learning about the importance of privacy in data analytics. Now, it is time to talk about data anonymization and what types of data should be anonymized. Personally identifiable information, or PII, is information that can be used by itself or with other data to track down a person’s identity.

Data anonymization is the process of protecting people’s private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values.

What types of data should be anonymized?

Healthcare and financial data are two of the most sensitive types of data. These industries rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why data in these two industries usually goes through de-identification, which is a process used to wipe data clean of all personally identifying information.

Data anonymization is used in just about every industry. That is why it is so important for data analysts to understand the basics. Here is a list of data that is often anonymized:

Telephone numbers
Names
License plates and license numbers
Social security numbers
IP addresses
Medical records
Email addresses
Photographs
Account numbers

For some people, it just makes sense that this type of data should be anonymized. For others, we have to be very specific about what needs to be anonymized.

Openness

Openness refers to free access, usage, and sharing of data. But for data to be considered open, it has to:

Be available and accessible to the public as a complete dataset
Be provided under terms that allow it to be reused and redistributed
Allow universal participation so that anyone can use, reuse, and redistribute the data

In data analytics, open data is part of data ethics, which has to do with using data ethically. Data can only be considered open when it meets all three of these standards.

The open data debate: What data should be publicly available?

One of the biggest benefits of open data is that credible databases can be used more widely. Basically, this means that all of that good data can be leveraged, shared, and combined with other data. This could have a huge impact on scientific collaboration, research advances, analytical capacity, and decision-making. But it is important to think about the individuals being represented by the public, open data, too.

Third-party data is collected by an entity that doesn’t have a direct relationship with the data. g.

Personal identifiable information (PII) is data that is reasonably likely to identify a person and make information known about them. It is important to keep this data safe. PII can include a person’s address, credit card information, social security number, medical records, and more.

Database

Relational

A relational database is a database that contains a series of tables that can be connected to form relationships. Basically, they allow data analysts to organize and link data based on what the data has in common.

Non-Relational

In a non-relational table, you will find all of the possible variables you might be interested in analyzing all grouped together. This can make it really hard to sort through. This is one reason why relational databases are so common in data analysis: they simplify a lot of analysis processes and make data easier to find and use across an entire database.

Normalization

Normalization is a process of organizing data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database.

Primary Key

Primary key is an identifier that references a column in which each value is unique. In other words, it’s a column of a table that is used to uniquely identify each record within that table. The value assigned to the primary key in a particular row must be unique within the entire table. For example, if customer_id is the primary key for the customer table, no two customers will ever have the same customer_id.

Foreign Key

Foreign Key is a field within a table that is a primary key in another table. A table can have only one primary key, but it can have multiple foreign keys. These keys are what create the relationships between tables in a relational database, which helps organize and connect data across multiple tables in the database.

Composite Key

Some tables don’t require a primary key. For example, a revenue table can have multiple foreign keys and not have a primary key. A primary key may also be constructed using multiple columns of a table. This type of primary key is called a composite key. For example, if customer_id and location_id are two columns of a composite key for a customer table, the values assigned to those fields in any given row must be unique within the entire table.

Metadata

Metadata is data about data. In database management, metadata provides information about other data and helps data analysts interpret the contents of the data within a database. In essence, metadata tells the who, what, when, where, which, why, and how of data.

Types

As a data analyst, there are three common types of metadata that you’ll come across: descriptive, structural, and administrative.

Descriptive

Descriptive metadata is metadata that describes a piece of data and can be used to identify it at a later point in time.

For instance, the descriptive metadata of a book in a library would include the code you see on its spine, known as a unique International Standard Book Number, also called the ISBN. It would also include the book’s author and title.

Structural

Structural metadata, which is metadata that indicates how a piece of data is organized and whether it’s part of one or more than one data collection.

Let’s head back to the library. An example of structural data would be how the pages of a book are put together to create different chapters. It’s important to note that structural metadata also keeps track of the relationship between two things. For example, it can show us that the digital document of a book manuscript was actually the original version of a now printed book.

Administrative

Administrative metadata is metadata that indicates the technical source of a digital asset.

When we looked at the metadata inside the photo, that was administrative metadata. It shows you the type of file it was, the date and time it was taken, and much more.

Elements of Metadata

It’s important to understand what type of information metadata typically provides:

File or document type: What type of file or document are you examining?
Date, time, and creator: When was it created? Who created it? When was it last modified?
Title and description: What is the name of the item you are examining? What type of content does it contain?
Geolocation: If you’re examining a photo, where was it taken?
Tags and categories: What is the general overview of the item that you have? Is it indexed or described in a specific way?
Who last modified it and when: Were any changes made to the file? If yes, when were the most recent modifications made?
Who can access or update it: If you’re examining a dataset, is it public? Are special permissions needed to customize or modify it?

Benefits of Metadata

Reliability

If the data being used to solve a problem or to make a data-driven decision is unreliable, there’s a good chance the results will be unreliable as well. Metadata helps data analysts confirm their data is reliable by making sure it is:

Accurate
Precise
Relevant
Timely

It does this by helping analysts ensure that they’re working with the right data and that the data is described correctly. For example, a data analyst completing a project with data from 2022 can use metadata to easily determine if they should use data from a particular file.

Consistency

Data analysts thrive on consistency and aim for uniformity in their data and databases, and metadata helps make this possible. For example, to use survey data from two different sources, data analysts use metadata to make sure the same collection methods were applied in the survey so that both datasets can be compared reliably.

When a database is consistent, it’s easier to discover relationships between the data inside the database and data that exists elsewhere. When data is uniform, it is:

Organized: Data analysts can easily find tables and files, monitor the creation and alteration of assets, and store metadata.
Classified: Data analysts can categorize data when it follows a consistent format, which is beneficial in cleaning and processing data.
Stored: Consistent and uniform data can be efficiently stored in various data repositories. This streamlines storage management tasks such as managing a database.
Accessed: Users, applications, and systems can efficiently locate and use data.

Metadata repositories

Metadata repositories help data analysts ensure their data is reliable and consistent. Metadata repositories are specialized databases specifically created to store and manage metadata. They can be kept in a physical location or a virtual environment—like data that exists in the cloud.

Metadata repositories describe where the metadata came from and store that data in an accessible form with a common structure. This provides data analysts with quick and easy access to the data. If data analysts didn’t use a metadata repository, they would have to select each file to look up its information and compare the data manually, which would waste a lot of time and effort.

Data analysts also use metadata repositories to bring together multiple sources for data analysis. Metadata repositories do this by describing the state and location of the data, the structure of the tables inside the data, and who accessed the user logs.

External databases

Data analysts use both second-party and third-party data to gain valuable insights and make strategic, data-driven decisions. Second-party data is data that’s collected by a group directly from the group’s audience and then sold. Third-party data is provided by outside sources that didn’t collect it directly. The providers of this data are not its original collectors and do not have a direct relationship with any individuals to whom the data belongs. The outside providers get the data from websites or other programs that pull it from the various platforms where it was originally generated.

Data analysts should understand the metadata of external databases to confirm that it is consistent and reliable. In some cases, they should also contact the owner of the third-party data to confirm that it is accessible and available for purchase. Confirming that the data is reliable and that the proper permissions to use it have been obtained are best practices when using data that comes from another organization.

Data access

Import from source

As you’ve learned, you can import data from some data sources, like .csv files into a Google spreadsheet from the File menu. Keep in mind that, when you use this method, data that is updated in the .csv will not automatically be updated in the Google Sheet. Instead, it will need to be manually—and continually—updated in the Google Sheet. In some situations, such as when you want to be able to keep track of changes you’ve made, this method is ideal. In other situations, you might need to keep the data the same in both places, and using data that doesn’t update automatically can be time-consuming and tedious. Further, trying to maintain the same dataset in multiple places can cause errors later on.

Dynamic Import

Fortunately, there are tools to help you automate data imports so you don’t need to continually update the data in your current spreadsheet. Take a small general store as an example. The store has three cash registers handled by three clerks. At the end of each day, the owner wants to determine the total sales and the amount of cash in each register. Each clerk is responsible for counting their money and entering their sales total into a spreadsheet. The owner has the spreadsheets set up to import each clerks’ data into another spreadsheet, where it automates and calculates the total sales for all three registers. Without this automation, each clerk would have to take turns entering their data into the owner’s spreadsheet. This is an example of a dynamic method of importing data, which saves the owner and clerks time and energy. When data is dynamic, it is interactive and automatically changes and updates over time.

IMPORT functions in Google Sheets

importrange()

In Google Sheets, the IMPORTRANGE function can import all or part of a dataset from another Google Sheet.

To use this function, you need two pieces of information:

The URL of the Google Sheet from which you’ll import data.
The name of the sheet and the range of cells you want to import into your Google Sheet.

Once you have this information, open the Google Sheet into which you want to import data and select the cell into which the first cell of data should be copied. Enter = to indicate you will enter a function, then complete the IMPORTRANGE function with the URL and range you identified in the following manner: =IMPORTRANGE(“URL”, “sheet_name!cell_range”). Note that an exclamation point separates the sheet name and the cell range in the second part of this function.

An example of this function is:

=IMPORTRANGE(“https://docs.google.com/thisisatestabc123”, “sheet1!A1:F13”)

Note: This URL is for syntax purposes only. It is not meant to be entered into your own spreadsheet.

Once you’ve completed the function, a box will pop up to prompt you to allow access to the Google Sheet from which you’re importing data. You must allow access to the spreadsheet containing the data the first time you import it into Google Sheets. Replace it with a spreadsheet’s URL that you have created so you can control access by selecting the Allow access button.

importhtml()

Importing HTML tables is a basic method to extract data from public web pages. This process is often called “scraping.” Web scraping made easy introduces how to do this with Google Sheets or Microsoft Excel.

In Google Sheets, you can use the IMPORTHTML function to import the data from an HTML table (or list) on a web page. This function is similar to the IMPORTRANGE function.

importdata()

Sometimes data displayed on the web is in the form of a comma- or tab-delimited file. You can use the IMPORTDATA function in a Google Sheet to import data into a Google Sheet. This function is similar to the IMPORTRANGE function.

Files

File-naming conventions help you organize, access, process, and analyze data because they act as quick reference points to identify what’s in a file. You should align your project’s file names with your team’s or company’s existing file-naming conventions. You don’t want to spend time learning a new file-naming convention each time you look up a file in a new project!

It’s also critical to ensure that file names are meaningful, consistent, and easy-to-read. File names should include:

The project’s name
The file creation date
Revision version
Consistent style and order

Further, file-naming conventions should act as quick reference points to identify what is in the file. Because of this, they should be short and to the point SalesReport_20231125_v02

To keep your files organized, create folders and subfolders—in a logical hierarchy—to ensure related files are stored together and can be found easily later. A hierarchy is a way of organizing files and folders. Broader-topic folders are located at the top of the hierarchy, and more specific subfolders and files are contained within those folders. Each folder can contain other folders and files. This allows you to group related files together and makes it easier to find the files you need. In addition, it’s a best practice to store completed files separately from in-progress files so the files you need are easy to find. Archive older files in a separate folder or in an external storage location.

Data security

Data security means protecting data from unauthorized access or corruption by putting safety measures in place. Usually the purpose of data security is to keep unauthorized users from accessing or viewing sensitive data.

Encrypytion

Encryption uses a unique algorithm to alter data and make it unusable by users and applications that don’t know the algorithm. This algorithm is saved as a “key” which can be used to reverse the encryption; so if you have the key, you can still use the data in its original form.

Tokenization

Tokenization replaces the data elements you want to protect with randomly generated data referred to as a “token.” The original data is stored in a separate location and mapped to the tokens. To access the complete original data, the user or application needs to have permission to use the tokenized data and the token mapping. This means that even if the tokenized data is hacked, the original data is still safe and secure in a separate location.

Encryption and tokenization are just some of the data security options out there. There are a lot of others, like using authentication devices for AI technology.

Version Control

Version control enables all collaborators within a file to track changes over time. You can understand who made what changes to a file, when they were made, and why.

Prepare