9 Python Libraries for Managing Missing Data Efficiently

python
data science
machine learning
data wrangling
data cleaning
Author

Adejumo Ridwan Suleiman

Data cleaning is one of the most tedious tasks of a data scientist before building any model, especially when it comes to imputing missing data. Most real-world data have missing observations, which can arise due to various factors.

It’s your role as a data scientist to find how to solve the missing data problem in your dataset, because building a model without missing data imputation can lead to wrong predictions.

There are various ways to impute missing data, from simple techniques such as mean/median imputation to advanced techniques like multiple imputation or KNN imputation. In this article, you will learn about the various libraries you can use for missing data imputation in Python

Key Criteria for Selection

When selecting a library for missing data imputation, ensure you look at the following criteria.

  • Ease of use and documentation quality: Ensure the library is not complex, it’s easy to use with well-written documentation. The library explains all the imputation algorithms used, and also the functions for implementing them. This makes it easy to know the right scenario to use a particular imputation method.
  • Support for different imputation methods: Some data might warrant simpler methods, while others will need complex imputation methods. You should use libraries that have a wide range of support for various simple and advanced methods. This keeps your workflow lean and simple, and avoids using multiple libraries.
  • Integration with popular data science workflows: Some popular Python libraries already support various imputation methods, hence avoiding the need for an external library. Some missing data imputation libraries also make it easy to integrate well with classes and functions from other Python libraries, making it easy to add to your workflow.
  • Community support and performance: You want to make sure you are using a library with a lot of help and support online, and it’s also actively maintained. Whenever you get stuck, you can easily reach out for help.

Python Libraries for Missing Data Imputation

Let’s look at some of the libraries you can use for missing data imputation.

1. Pandas

Pandas is one of the most popular libraries among data professionals working with Python. Designed for data wrangling, it comes with various imputation methods, mostly for simple use cases. It’s easy to use, fast, and well-integrated for exploratory data analysis.

Here are some of its built-in methods for missing data imputation:

  • fillna(): replaces missing values in a dataframe either by the mean, median, or a custom value.
  • interpolate(): fills missing values by interpolating between existing data points, essentially estimating values between known ones.
  • dropna(): removes rows or columns that contain missing (NaN) values.

Examples

Here are some examples demonstrating the above functions.

  • fillna() :

    df = pd.DataFrame({
        'A': [1, 2, None, 4],
        'B': [None, 2, 3, None]
    })
    
    print("Original DataFrame:")
    print(df)
    
    # Replace NaN with 0
    df_filled = df.fillna(0)
    print("\nAfter fillna(0):")
    print(df_filled)
    
    # Forward fill (propagate last valid value forward)
    df_ffill = df.fillna(method='ffill')
    print("\nAfter forward fill:")
    print(df_ffill)
    Original DataFrame:
         A    B
    0  1.0  NaN
    1  2.0  2.0
    2  NaN  3.0
    3  4.0  NaN
    
    After fillna(0):
         A    B
    0  1.0  0.0
    1  2.0  2.0
    2  0.0  3.0
    3  4.0  0.0
    
    After forward fill:
         A    B
    0  1.0  NaN
    1  2.0  2.0
    2  2.0  3.0
    3  4.0  3.0

    In the first missing value imputation, the missing values are filled with 0 , while in the second missing value imputation, they are filled with the value of the observation in the previouis row.

  • interpolate():

    import pandas as pd
    
    df = pd.DataFrame({
        'A': [1, None, 3, None, 5]
    })
    
    print("Original DataFrame:")
    print(df)
    
    # Linear interpolation
    df_interp = df.interpolate()
    print("\nAfter interpolation:")
    print(df_interp)
    Original DataFrame:
         A
    0  1.0
    1  NaN
    2  3.0
    3  NaN
    4  5.0
    
    After interpolation:
         A
    0  1.0
    1  2.0
    2  3.0
    3  4.0
    4  5.0

    There .interpolate() method supports various types of interpolation, but the example above used a linear interpolation whereby the missing values are filled with the assumption that they change at a constant rate between two points, either an increase by 1,2, and so on.

  • dropna():

    import pandas as pd
    
    df = pd.DataFrame({
        'A': [1, None, 3, 4],
        'B': [None, 2, None, 4]
    })
    
    print("Original DataFrame:")
    print(df)
    
    # Drop rows with any NaN values
    df_drop_any = df.dropna()
    print("\nAfter dropna() (any NaN):")
    print(df_drop_any)
    
    # Drop columns with any NaN values
    df_drop_col = df.dropna(axis=1)
    print("\nAfter dropna(axis=1):")
    print(df_drop_col)
    Original DataFrame:
         A    B
    0  1.0  NaN
    1  NaN  2.0
    2  3.0  NaN
    3  4.0  4.0
    
    After dropna() (any NaN):
         A    B
    3  4.0  4.0
    
    After dropna(axis=1):
    Empty DataFrame
    Columns: []
    Index: [0, 1, 2, 3]

    Just as it’s shown in the example above, you can apply dropna() , either row or column-wise, to drop missing observations from your data.

2. Scikit-learn

Scikit-learn, which is popular for machine learning, also has various classes for handling missing data imputations. This makes it easy to implement missing data imputation in both the train and test data when building your model workflow. Some of the missing imputation classes are:

  • SimpleImputer: for filling missing values using a simple rule like mean, median, most frequent value, and so on.
  • KNNImputer: fills missing values using the values of nearest neighbors.
  • IterativeImputer: fills missing values by modeling each feature with missing values as a function of the other features.

Examples

  • SimpleImputer:

    from sklearn.impute import SimpleImputer
    import numpy as np
    import pandas as pd
    
    # Original data with missing values
    data = np.array([[1, 2],
                     [np.nan, 3],
                     [7, np.nan]])
    
    print("Original data:")
    print(pd.DataFrame(data, columns=["A", "B"]))
    
    # Imputer replaces missing values with column mean
    imputer = SimpleImputer(strategy='mean')
    imputed_data = imputer.fit_transform(data)
    
    print("\nImputed data (SimpleImputer - mean):")
    print(pd.DataFrame(imputed_data, columns=["A", "B"]))
    Original data:
         A    B
    0  1.0  2.0
    1  NaN  3.0
    2  7.0  NaN
    
    Imputed data (SimpleImputer - mean):
         A    B
    0  1.0  2.0
    1  4.0  3.0
    2  7.0  2.5

    The above example shows mean imputation, which is mostly used for numerical variables, while the mode is used for categorical variables.

  • KNNImputer:

    from sklearn.impute import KNNImputer
    import numpy as np
    import pandas as pd
    
    data = np.array([[1, 2],
                     [np.nan, 3],
                     [7, 6]])
    
    print("Original data:")
    print(pd.DataFrame(data, columns=["A", "B"]))
    
    imputer = KNNImputer(n_neighbors=2)
    imputed_data = imputer.fit_transform(data)
    
    print("\nImputed data (KNNImputer):")
    print(pd.DataFrame(imputed_data, columns=["A", "B"]))
    Original data:
         A    B
    0  1.0  2.0
    1  NaN  3.0
    2  7.0  6.0
    
    Imputed data (KNNImputer):
         A    B
    0  1.0  2.0
    1  4.0  3.0
    2  7.0  6.0

    In the code above, the n_neighbors argument in the KNNImputer() class takes the number of nearest neighbors needed to run the missing data imputation.

  • IterativeImputer:

    from sklearn.experimental import enable_iterative_imputer
    from sklearn.impute import IterativeImputer
    import numpy as np
    import pandas as pd
    
    data = np.array([[1, 2],
                     [np.nan, 3],
                     [7, np.nan]])
    
    print("Original data:")
    print(pd.DataFrame(data, columns=["A", "B"]))
    
    imputer = IterativeImputer(max_iter=10, random_state=0)
    imputed_data = imputer.fit_transform(data)
    
    print("\nImputed data (IterativeImputer):")
    print(pd.DataFrame(imputed_data, columns=["A", "B"]))
    Original data:
         A    B
    0  1.0  2.0
    1  NaN  3.0
    2  7.0  NaN
    
    Imputed data (IterativeImputer):
               A    B
    0   1.000000  2.0
    1  12.999998  3.0
    2   7.000000  2.5

image.png

3. Fancyimpute

Fancyimpute is a standalone library for missing-data imputation that includes various algorithms. It also includes popular algorithms, which are already implemented in scikit-learn such as SimpleFill, KNN and IterativeImputer. Other complex algorithms in it are:

  • SoftImpute: Which implements the Spectral Regularization Algorithms for Learning Large Incomplete Matrices for missing data imputation.
  • IterativeSVD: Which implements the iterative low-rank SVD decomposition for imputing missing data.
  • MatrixFactorization: This uses direct factorization to fill missing data.
  • BiScaler: This uses iterative estimation of row/column means and standard deviations to get a doubly normalized matrix.

Example

Here is a simple example implementing the SoftImpute() class on missing data.

import numpy as np
import pandas as pd
from fancyimpute import SoftImpute

# Sample data with missing values
data = np.array([
    [5, 3, np.nan, 1],
    [4, np.nan, np.nan, 1],
    [1, 1, np.nan, 5],
    [1, np.nan, np.nan, 4],
    [np.nan, 1, 5, 4]
])

print("Original data:")
print(pd.DataFrame(data, columns=["Item1", "Item2", "Item3", "Item4"]))

# SoftImpute: matrix factorization-based imputation
soft_imputer = SoftImpute()
data_imputed = soft_imputer.fit_transform(data)

print("\nImputed data (SoftImpute):")
print(pd.DataFrame(data_imputed, columns=["Item1", "Item2", "Item3", "Item4"]))
Original data:
   Item1  Item2  Item3  Item4
0    5.0    3.0    NaN    1.0
1    4.0    NaN    NaN    1.0
2    1.0    1.0    NaN    5.0
3    1.0    NaN    NaN    4.0
4    NaN    1.0    5.0    4.0

Imputed data (SoftImpute):
      Item1     Item2     Item3  Item4
0  5.000000  3.000000  0.270492    1.0
1  4.000000  2.270316  0.120304    1.0
2  1.000000  1.000000  2.475254    5.0
3  1.000000  0.896283  1.941855    4.0
4  0.455978  1.000000  5.000000    4.0

image.png

4. Missingno

Missingno is a visualization library for missing observations that allows you to study the pattern of missing observations in your dataset. Through this visualization, you can understand why and how the observations in your data are missing. It’s easy to use and works seamlessly with Pandas. Missingno also supports visualizations such as matrix plot, bar plot, heatmap, and dendogram.

Example

import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt

# Sample dataset with missing values
data = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, np.nan, np.nan, 4, 5],
    'D': [1, 2, 3, 4, 5]
})

print("Original Data:")
print(data)

# Matrix plot
plt.figure(figsize=(6,3))
msno.matrix(data)
plt.show()

Missing data visualization with missingno. Image by Author.

The above example is a matrix plot of missing observations from the dataset. The matrix plot is like a visual display of the data, where the rows are the observations, while the columns are the variables. Shaded regions represent an observation with complete data, while the white regions represent missing data. The curve on the right shows the distribution of the rows with complete and incomplete data.

5. MICEforest

MICEforest is also a Python library for missing data imputation that uses Multiple Imputation by Chained Equations(MICE) through an iterative series of predictive models, based on other variables in the dataset. This method of imputation is better when you are handling complex missing observations in your data or missing data of mixed data types.

Example

import numpy as np
import pandas as pd
import miceforest as mf

# --- Create sample data ---
np.random.seed(42)
data = pd.DataFrame({
    'Age': [25, 30, np.nan, 40, 22, np.nan, 35],
    'Salary': [50000, 60000, 55000, np.nan, 52000, 58000, np.nan],
    'Experience': [1, 3, 2, 5, np.nan, 4, np.nan]
})
print("Original data with missing values:")
print(data)

# --- Create the imputation kernel ---
kernel = mf.ImputationKernel(
    data,
    num_datasets=1,
    random_state=42
)

# --- Run MICE for 3 iterations ---
kernel.mice(3)

# --- Get completed data ---
imputed_data = kernel.complete_data(dataset=0)
print("\nImputed data:")
print(imputed_data)
Original data with missing values:
    Age   Salary  Experience
0  25.0  50000.0         1.0
1  30.0  60000.0         3.0
2   NaN  55000.0         2.0
3  40.0      NaN         5.0
4  22.0  52000.0         NaN
5   NaN  58000.0         4.0
6  35.0      NaN         NaN

Imputed data:
    Age   Salary  Experience
0  25.0  50000.0         1.0
1  30.0  60000.0         3.0
2  40.0  55000.0         2.0
3  40.0  50000.0         5.0
4  22.0  52000.0         2.0
5  35.0  58000.0         4.0
6  35.0  55000.0         4.0

From the code above, .ImputationKernel() creates a model that imputes each column by taking the number of imputed datasets as arguments, and also the random_state for reproducibility.

The .mice() method runs the iterations based on the number given, which is 3 in our case. It first of all imputes the missing values, then run another model based on the previous imputed data till the number of iterations is fulfilled.

How miceforest works. Source: MICEforest documentation.

6. Autoimpute

Autoimpute is another Python library that supports a wide range of imputation methods such as univariate, multivariate, and interpolation methods. It works well with scikit-learn and scipy, and also supports missing data visualization using missingno.

Example

**import pandas as pd
from autoimpute.imputations import SingleImputer

# Example data
df = pd.DataFrame({
    "age": [25, None, 30, 22, None],
    "salary": [50000, 60000, None, None, 70000],
    "gender": ["M", "F", "F", None, "M"]
})

imputer = SingleImputer(
    strategy={
        "age": "mean", 
        "salary": "median", 
        "gender": "mode"  # use mode instead of logistic regression
    }
)

df_imputed = imputer.fit_transform(df)

print(df)
print(df_imputed)**
    age   salary gender
0  25.0  50000.0      M
1   NaN  60000.0      F
2  30.0      NaN      F
3  22.0      NaN   None
4   NaN  70000.0      M
         age   salary gender
0  25.000000  50000.0      M
1  25.666667  60000.0      F
2  30.000000  60000.0      F
3  22.000000  60000.0      F
4  25.666667  70000.0      M

The above is a simple imputation using the SingleImputer class, where it imputes various rows with various measures of spread.

image.png

7. H2O.ai

Popularly known for it’s automl capabilities, H2O can also handle missing data observations both in static or streaming data. You can also use some of its methods to impute missing data, where it’s a simple imputation like mean, median, or mode, or advanced imputation techniques. H20 is also great in production. If you have incoming data that can contain missing values, you can use the H2O platform to handle that, ensuring that your ML model does not break and give wrong predictions in production.

image.png

8. TensorFlow Data Validation (TFDV)

TensorFlow Data Validation (TFDV) is a powerful library designed to analyze, validate, and monitor data used in machine learning (ML) pipelines. It is useful in production environments, where data consistency and quality are critical for maintaining reliable model performance.

TFDV allows you to detect issues such as missing features, data drift, and anomalies that may occur as new data flows into your production pipeline. If a particular observation or feature is missing, TFDV can flag it and help you decide whether to impute, correct, or exclude it based on your defined schema.

image.png

9. Datawig (Amazon)

Datawig is an open-source library developed by Amazon for data imputation. It offers advanced data imputation by learning from other variables. While offering complex missing data imputation, it’s easy to use and not verbose. It also handles both numerical and categorical missing data very well. Datawig uses various deep-learning-based imputation methods to fill missing data.

image.png

Comparison Table

You might wonder, with all the options above, which is the best for your use case. Here is a table summarizing the above Python libraries and what they are best for.

Library Approach Best For Learning curve
Pandas Simple Small data with few missing observations. Low
Scikit-learn Statistical/ML Integration of missing data imputations into ML pipelines. Low
Fancyimpute Advanced High-dimensional data Low
Missingno Visualization Missing data visualization and diagnostics. Low
MICEforest Random Forest Robust imputation Low
Autoimpute Statistical Reproducibility Low
H2O.ai ML/AutoML Big data High
TFDV Production-scale ML pipelines High
Datawig Deep Learning Complex patterns High

Best Practices for Missing Data Imputation

When imputing missing data, here are some best practices you should adhere to:

  • Always visualize missingness patterns first: Visualizing missing data lets you see the nature of missing observations in your data. Through the visualizations, you can know the kind of missing data you are dealing with and the relationship of the missing data with other variables. This lets you know the kind of imputation technique to go for.
  • Avoid blind imputation without domain knowledge: Don’t just drop missing data or impute data anyhow; try to understand why it’s missing. Some data have to be missing, for example, if in a survey given to both men and women, and there is a question regarding pregnancy or menstruation, it’s obvious that for males, those questions will contain missing values. This kind of missing value is understandable, and imputing it will mean the male has been pregnant or menstruating, which is not possible.
  • Compare multiple methods for robustness: Test various imputation methods; some imputation methods are more robust than others. Also, it’s advisable that you go for multiple imputations to get accurate results.
  • Validate imputation impact on model performance: If possible, try to run your model with and without missing values imputed, which lets you see the impact of the missing values on the model and if they are worth imputing.

Conclusion

Choosing the right imputation library is important; some libraries offer more advanced imputation methods, while others offer integration into production ML systems. Your choice depends on your use case.

You can combine these libraries to achieve better results for example, you can use Missingno with Scikit-learn, where Missingno will handle diagnostics of missing observations, and you use Scikit-learn’s diverse set of missing values imputation algorithms for imputation.

In the future, we hope to see the ease in imputing missing data with the rise of AI-based imputation, and also causal inference methods.

Need Help with Data? Let’s Make It Simple.

At LearnData.xyz, we’re here to help you solve tough data challenges and make sense of your numbers. Whether you need custom data science solutions or hands-on training to upskill your team, we’ve got your back.

📧 Shoot us an email at admin@learndata.xyz—let’s chat about how we can help you make smarter decisions with your data.

Your next breakthrough could be one email away. Let’s make it happen!