How to Deal with Missing Data

The concept of missing data is implied in the name: it’s data that is not captured for a variable for the observation in question. Missing data can skew all kinds of tasks for data scientists, from economic analyses to clinical trials. After all, any analysis is only as good as the data used. A data scientist doesn’t want to produce biased estimates that lead to invalid results. Fortunately, there are proven techniques to deal with missing data.

SPONSORED SCHOOLS

University of Cape Town

info

UCT Data Analysis

Learn how to sort, analyse, and interpret data to inform business strategy.

The London School of Economics and Political Science

info

Data Analysis for Management

Study data analytics with a world-leading social science university.

University of Cape Town

info

UCT Data Science with Python

Gain meaningful insights from data to inform your decision-making.

info SPONSORED

Imputation vs. Removing Data

When dealing with missing data, data scientists can use two primary methods to solve the error: imputation or data removal.

The imputation method substitutes reasonable guesses for missing data. It’s most useful when the percentage of missing data is low. If the portion of missing data is too high, the results lack natural variation that could result in an effective model.

The other option is to remove data. When dealing with data that is missing at random, the entire data point that is missing information can be deleted to help reduce bias. Removing data may not be the best option if there are not enough observations to result in a reliable analysis. In some situations, observation of specific events or factors may be required, even if incomplete.

Before deciding which approach to employ, it helps to understand why the data is missing.

Missing at Random (MAR)

Missing at Random (MAR) means the probability of data being missing is relative to a variable where there is complete information. The data is not missing across all observations but only within sub-samples of the data. The missing data can be predicted based on the complete observed data. For example, in a survey of the general population, we might be missing data from certain populations because of a known property, such as responsiveness among men.

Missing Completely at Random (MCAR) 

When data is MCAR, the data is missing across all observations regardless of the expected value or other variables. Data scientists can compare two sets of data, one with missing observations and one without. Using a t-test, if there is no difference between the two data sets, the data is characterized as MCAR.

Data may be missing due to test design, failure in the observations or failure in recording observations. This type of data is seen as MCAR because the reasons for its absence are external and not related to the value of the observation.

It is typically safe to remove MCAR data because the results will be unbiased. The test may not be as powerful, but the results can still be reliable.

Missing Not at Random (MNAR)

The MNAR category applies when the probability that data is missing seems to be dependent on the unobserved or missing values themselves. For example, if people with weaker opinions are less likely to answer, then that data is MNAR. Like MAR, the data cannot be determined by the observed data, because the missing information is unknown. Data scientists must model the missing data to develop an unbiased estimate. Simply removing observations with missing data could result in a model with bias.

Deletion

There are three primary methods for deleting data when dealing with missing data: listwise, pairwise and dropping variables.

Listwise

In this method, all data for an observation that has one or more missing values are deleted. The analysis is run only on observations that have a complete set of data. If the data set is small, it may be the most efficient method to eliminate those cases from the analysis. However, in many cases, the data are not missing completely at random (MCAR). Deleting the instances with missing observations can result in biased parameters and estimates and reduce the statistical power of the analysis.

Pairwise

Pairwise deletion assumes data are missing completely at random (MCAR) and the statistical analysis uses all cases with data, even if some is missing. Pairwise deletion allows data scientists to use more of the data. However, the resulting statistics may vary because they are based on different data sets. The results may be impossible to duplicate with a complete set of data.

Dropping Variables

If data is missing for a large proportion of the observations, it may be best to discard the variable entirely if it is insignificant.

SPONSORED SCHOOLS

The London School of Economics and Political Science

info

LSE Applied Data Analysis

Discover how to harness the enormous potential of data with the London School of Economics and Political Science (LSE) Applied Data Visualisation and Analysis for Business online certificate course.

  • Become proficient in Tableau and Microsoft Excel
  • 6 weeks, excluding 1 week orientation
  • 4–7 hours of self-paced learning per week, entirely online

Rice University

info

Rice Data Analysis and Visualization

Reduce decision-making uncertainty with a data analysis and visualization toolkit.

info SPONSORED

Imputation

When data is missing, it may make sense to delete data, as mentioned above. However, that may not be the most effective option. For example, if too much information is discarded, it may not be possible to complete a reliable analysis. Or there may be insufficient data to generate a reliable prediction for observations that have missing data.

Instead of deletion, data scientists have multiple solutions to impute the value of missing data. Depending why the data are missing, imputation methods can deliver reasonably reliable results. These are examples of single imputation methods for replacing missing data.

Mean, Median and Mode

This is one of the most common methods of imputing values when dealing with missing data. In cases where there are a small number of missing observations, data scientists can calculate the mean or median of the existing observations and insert them in place of the missing observations. However, when there are many missing variables, mean or median results can result in a loss of variation in the data. This method does not use time-series characteristics or depend on the relationship between the variables.

Time-Series Specific Methods 

Another option is to use time-series specific methods when appropriate to impute data.

The time series methods of imputation assume the adjacent observations will be like the missing data. These methods work well when that assumption is valid. However, these methods won’t always produce reasonable results, particularly in the case of strong seasonality. 

Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)

These options are used to analyze longitudinal repeated measures data, in which follow-up observations may be missing. In these methods, every missing value is replaced with either the last observed value or the next one. Longitudinal data track the same instance at different points along a timeline. This method is easy to understand and implement. However, this method may introduce bias when data has a visible trend. It assumes the value is unchanged by the missing data.

Linear Interpolation

Linear interpolation is often used to approximate a value of some function by using two known values of that function at other points. This formula can also be understood as a weighted average. The weights are inversely related to the distance from the end points to the unknown point. The closer point has more influence than the farther point. 

When dealing with missing data, you might use this method in a time series that exhibits a trend line, but it’s not appropriate for seasonal data.

Seasonal Adjustment with Linear Interpolation

When dealing with data that exhibits both trend and seasonality characteristics, consider using seasonal adjustment with linear interpolation. First, you would perform the seasonal adjustment by computing a centered moving average or taking the average of multiple averages—for example, two one-year averages—that are offset by one period relative to another. You can then complete data smoothing with linear interpolation as discussed above.

Multiple Imputation

Multiple imputation is considered a good approach for data sets with a large amount of missing data. Instead of substituting a single value for each missing data point, the missing values are exchanged for values that encompass the natural variability and uncertainty of the right values. Using the imputed data, the process is repeated to make multiple imputed data sets. Each set is then analyzed using the standard analytical procedures and the results are combined to produce an overall result.

The various imputations incorporate natural variability into the missing values, which creates a valid statistical inference. Multiple imputations can produce statistically valid results even when there is a small sample size or a large amount of missing data.

K-Nearest Neighbors 

In this method, data scientists determine a data point’s nearest neighbors and approximate an estimate based on the values of points closest and other variables. The data scientist must select the number of nearest neighbors and the distance metric. KNN can identify the most frequent value among the neighbors and the mean among the nearest neighbors.

Learn More About Data Science

When working as a data scientist, you often will be faced with imperfect data sets. Analyzing data with missing information is an important part of work as a data scientist. Advancing your career in data science can help you learn how to tackle these issues and more.

Last updated October 2023