• 0 Vote(s) - 0 Average
• 1
• 2
• 3
• 4
• 5
[Tut] Pandas NaN — Working With Missing Data

Pandas NaN — Working With Missing Data

Pandas is Excel on steroids—the powerful Python library allows you to analyze structured and tabular data with surprising efficiency and ease. Pandas is one of the reasons why master coders reach 100x the efficiency of average coders. In today’s article, you’ll learn how to work with missing data—in particular, how to handle NaN values in Pandas DataFrames.

You’ll learn about all the different reasons why NaNs appear in your DataFrames—and how to handle them. Let’s get started!

## Checking Series for NaN Values

Problem: How to check a series for NaN values?

Have a look at the following code:

```import pandas as pd
import numpy as np data = pd.Series([0, np.NaN, 2])
result = data.hasnans print(result)
# True```

Series can contain `NaN`-values—an abbreviation for Not-A-Number—that describe undefined values.

To check if a Series contains one or more `NaN` value, use the attribute `hasnans`. The attribute returns `True` if there is at least one `NaN` value and `False` otherwise.

There’s a `NaN` value in the Series, so the output is `True`.

## Filtering Series Generates NaN

Problem: When filtering a Series with `where()` and no element passes the filtering condition, what’s the result?

```import pandas as pd xs = pd.Series([5, 1, 4, 2, 3])
xs.where(xs > 2, inplace=True)
result = xs.hasnans print(result)
# True```

The method `where()` filters a Series by a condition. Only the elements that satisfy the condition remain in the resulting Series. And what happens if a value doesn’t satisfy the condition? Per default, all rows not satisfying the condition are filled with `NaN`-values.

This is why our Series contains `NaN`-values after filtering it with the method `where()`.

## Working with Multiple Series of Different Lengths

Problem: If you element-wise add two Series objects with a different number of elements—what happens with the remaining elements?

```import pandas as pd s = pd.Series(range(0, 10))
t = pd.Series(range(0, 20))
result = (s + t)[1] print(result)
# 2```

To add two Series element-wise, use the default addition operator `+`. The Series do not need to have the same size because once the first Series ends, the subsequent element-wise results are `NaN` values.

At index `1` in the resulting Series, you get the result of `1 + 1 = 2`.

## Create a DataFrame From a List of Dictionaries with Unequal Keys

Problem: How to create a DataFrame from a list of dictionaries if the dictionaries have unequal keys? A DataFrame expects the same columns to be available for each row!

```import pandas as pd data = [{'Car':'Mercedes', 'Driver':'Hamilton, Lewis'}, {'Car':'Ferrari', 'Driver':'Schumacher, Michael'}, {'Car':'Lamborghini'}] df = pd.DataFrame(data, index=['Rank 2', 'Rank 1', 'Rank 3'])
df.sort_index(inplace=True)
result = df['Car'].iloc[0] print(result)
# Ferrari```

You can create a DataFrame from a list of dictionaries. The dictionaries’ keys define the column labels, and the values define the columns’ entries. Not all dictionaries must contain the same keys. If a dictionary doesn’t contain a particular key, this will be interpreted as a `NaN`-value.

This code snippet uses string labels as index values to sort the DataFrame. After sorting the DataFrame, the row with index label `Rank 1` is at location `0` in the DataFrame and the value in the column `Car` is `Ferrari`.

## Sorting a DataFrame by Column with NaN Values

Problem: What happens if you sort a DataFrame by column if the column contains a `NaN` value?

```import pandas as pd df = pd.read_csv("Cars.csv") # Dataframe "df"
# ----------
# make fuel aspiration body-style price engine-size
# 0 audi gas turbo sedan 30000 2.0
# 1 dodge gas std sedan 17000 1.8
# 2 mazda diesel std sedan 17000 NaN
# 3 porsche gas turbo convertible 120000 6.0
# 4 volvo diesel std sedan 25000 2.0
# ---------- selection = df.sort_values(by="engine-size")
result = selection.index.to_list()[0]
print(result)
# 1```

In this code snippet, you sort the rows of the DataFrame by the values of the column `engine-size`.

The main point is that `NaN` values are always moved to the end in Pandas sorting. Thus, the first value is `1.8`, which belongs to the row with index value `1`.

## Count Non-NaN Values

Problem: How to count the number of elements in a dataframe column that are not `Nan`?

```import pandas as pd df = pd.read_csv("Cars.csv") # Dataframe "df"
# ----------
# make fuel aspiration body-style price engine-size
# 0 audi gas turbo sedan 30000 2.0
# 1 dodge gas std sedan 17000 1.8
# 2 mazda diesel std sedan 17000 NaN
# 3 porsche gas turbo convertible 120000 6.0
# 4 volvo diesel std sedan 25000 2.0
# ---------- df.count()[5]
print(result)
# 4```

The method `count()` returns the number of non-`NaN` values for each column. The DataFrame `df` has five rows. The fifth column
contains one `NaN` value. Therefore, the count of the fifth column is `4`.

## Drop NaN-Values

Problem: How to drop all rows that contain a `NaN` value in any of its columns—and how to restrict this to certain columns?

```import pandas as pd df = pd.read_csv("Cars.csv") # Dataframe "df"
# ----------
# make fuel aspiration body-style price engine-size
# 0 audi gas turbo sedan 30000 2.0
# 1 dodge gas std sedan 17000 1.8
# 2 mazda diesel std sedan 17000 NaN
# 3 porsche gas turbo convertible 120000 6.0
# 4 volvo diesel std sedan 25000 2.0
# ---------- selection1 = df.dropna(subset=["price"])
selection2 = df.dropna()
print(len(selection1), len(selection2))
# 5 4```

The DataFrame’s `dropna()` method drops all rows that contain a `NaN` value in any of its columns. But how to restrict the columns to be scanned for `NaN` values?

By passing a list of column labels to the optional parameter `subset`, you can define which columns you want to consider.

The call of `dropna()` without restriction, drops line `2` because of the `NaN` value in the column `engine-size`. When you restrict the columns only to `price`, no rows will be dropped, because no `NaN` value is present.

## Drop Nan and Reset Index

Problem: What happens to indices after dropping certain rows?

```import pandas as pd df = pd.read_csv("Cars.csv") # Dataframe "df"
# ----------
# make fuel aspiration body-style price engine-size
# 0 audi gas turbo sedan 30000 2.0
# 1 dodge gas std sedan 17000 1.8
# 2 mazda diesel std sedan 17000 NaN
# 3 porsche gas turbo convertible 120000 6.0
# 4 volvo diesel std sedan 25000 2.0
# ---------- df.drop([0, 1, 2], inplace=True)
df.reset_index(inplace=True)
result = df.index.to_list()
print(result)
# [0, 1]```

The method `drop()` on a DataFrame deletes rows or columns by index. You can either pass a single value or a list of values.

By default the `inplace` parameter is set to `False`, so that modifications won’t affect the initial DataFrame object. Instead, the method returns a modified copy of the DataFrame. In the puzzle, you set `inplace` to `True`, so the deletions are performed directly on the DataFrame.

After deleting the first three rows, the first two index labels are 3 and 4. You can reset the default indexing by calling the method `reset_index()` on the DataFrame, so that the index starts at 0 again. As there are only two rows left in the DataFrame, the result is `[0, 1]`.

## Concatenation of Dissimilar DataFrames Filled With NaN

Problem: How to concatenate two DataFrames if they have different columns?

```import pandas as pd df = pd.read_csv("Cars.csv")
df2 = pd.read_csv("Cars2.csv") # Dataframe "df"
# ----------
# make fuel aspiration body-style price engine-size
# 0 audi gas turbo sedan 30000 2.0
# 1 dodge gas std sedan 17000 1.8
# 2 mazda diesel std sedan 17000 NaN
# 3 porsche gas turbo convertible 120000 6.0
# 4 volvo diesel std sedan 25000 2.0
# ---------- # Additional Dataframe "df2"
# ----------
# make origin
# 0 skoda Czechia
# 1 toyota Japan
# 2 ford USA
# ---------- try: result = pd.concat([df, df2], axis=0, ignore_index=True) print("Y")
except Exception: print ("N") # Y```

Even if DataFrames have different columns, you can concatenate them.

If DataFrame 1 has columns A and B and DataFrame 2 has columns C and D, the result of concatenating DataFrames 1 and 2 is a DataFrame with columns A, B, C, and D. Missing values in the rows are filled with `NaN`.

## Outer Merge

Problem: When merging (=joining) two DataFrames—what happens if there are missing values?

```import pandas as pd df = pd.read_csv("Cars.csv")
df2 = pd.read_csv("Cars2.csv") # Dataframe "df"
# ----------
# make fuel aspiration body-style price engine-size
# 0 audi gas turbo sedan 30000 2.0
# 1 dodge gas std sedan 17000 1.8
# 2 mazda diesel std sedan 17000 NaN
# 3 porsche gas turbo convertible 120000 6.0
# 4 volvo diesel std sedan 25000 2.0
# ---------- # Additional dataframe "df2"
# ----------
# make origin
# 0 skoda Czechia
# 1 mazda Japan
# 2 ford USA
# ---------- result = pd.merge(df, df2, how="outer", left_on="make", right_on="make")
print(len(result["fuel"]))
print(result["fuel"].count())
# 7
# 5```

With Panda’s function `merge()` and the parameter `how` set to `outer`, you can perform an outer join.

The resulting DataFrame of an outer join contains all values from both input DataFrames; missing values are filled with `NaN`.

In addition, this puzzle shows how `NaN` values are counted by the `len()` function whereas the method `count()` does not include `NaN` values.

## Replacing NaN

Problem: How to Replace all `NaN` values in a DataFrame with a given value?

```import pandas as pd df = pd.read_csv("Cars.csv") # Dataframe "df"
# ----------
# make fuel aspiration body-style price engine-size
# 0 audi gas turbo sedan 30000 2.0
# 1 dodge gas std sedan 17000 1.8
# 2 mazda diesel std sedan 17000 NaN
# 3 porsche gas turbo convertible 120000 6.0
# 4 volvo diesel std sedan 25000 2.0
# ---------- df.fillna(2.0, inplace=True)
result = df["engine-size"].sum()
print(result)
# 13.8```

The method `fillna()` replaces `NaN` values with a new value. Thus, the sum of all values in the column `engine-size` is 13.8.

## Length vs. Count Difference — It’s NaN!

Problem: What’s the difference between the `len()` and the `count()` functions?

```import pandas as pd df = pd.read_csv("Cars.csv")
df2 = pd.read_csv("Cars2.csv") # Dataframe "df"
# ----------
# make fuel aspiration body-style price engine-size
# 0 audi gas turbo sedan 30000 2.0
# 1 dodge gas std sedan 17000 1.8
# 2 mazda diesel std sedan 17000 NaN
# 3 porsche gas turbo convertible 120000 6.0
# 4 volvo diesel std sedan 25000 2.0
# ---------- # Additional dataframe "df2"
# ----------
# make origin
# 0 skoda Czechia
# 1 mazda Japan
# 2 ford USA
# ---------- result = pd.merge(df2, df, how="left", left_on="make", right_on="make")
print(len(result["fuel"]))
print(result["fuel"].count())
# 3
# 1```

In a left join, the left DataFrame is the master, and all its values are included in the resulting DataFrame.

Therefore, the result DataFrame contains three rows, yet, since `skoda` and `ford` don’t appear in DataFrame `df`, only one the row for `mazda` contains value.

Again, we see the difference between using the function `len()` which also includes `NaN` values and the method `count()` which does not count `NaN` values.

## Equals() vs. == When Comparing NaN

Problem:

```import pandas as pd df = pd.read_csv("Cars.csv") # Dataframe "df"
# ----------
# make fuel aspiration body-style price engine-size
# 0 audi gas turbo sedan 30000 2.0
# 1 dodge gas std sedan 17000 1.8
# 2 mazda diesel std sedan 17000 NaN
# 3 porsche gas turbo convertible 120000 6.0
# 4 volvo diesel std sedan 25000 2.0
# ---------- df["engine-size_copy"] = df["engine-size"]
check1 = (df["engine-size_copy"] == df["engine-size"]).all()
check2 = df["engine-size_copy"].equals(df["engine-size"])
print(check1 == check2)
# False```

This code snippet shows how to compare columns or entire DataFrames regarding the shape and the elements.

The comparison using the operator `==` returns `False` for our DataFrame because the comparing `NaN`-values with `==` always yields `False`.

On the other hand, `df.equals()` allows comparing two Series or DataFrames. In this case, `NaN`-values in the same location are considered to be equal.

The column headers do not need to have the same type, but the elements within the columns must be of the same `dtype`.

Since the result of `check1` is `False` and the result of `check2` yields `True`, the final output is `False`.

## Where to Go From Here?

Enough theory, let’s get some practice!

To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?

Practice projects is how you sharpen your saw in coding!

Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?

Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.

Join my free webinar “How to Build Your High-Income Skill Python” and watch how I grew my coding business online and how you can, too—from the comfort of your own home.

Join the free webinar now!

The post Pandas NaN — Working With Missing Data first appeared on Finxter.

https://www.sickgaming.net/blog/2020/11/...sing-data/

 Possibly Related Threads… Thread Author Replies Views Last Post [Tut] How Does Pandas Concat Work? xSicKxBot 0 8 11-23-2020, 09:03 AM Last Post: xSicKxBot [Tut] [Ultimate Guide] Freelancing as a Data Scientist xSicKxBot 0 26 11-15-2020, 02:12 AM Last Post: xSicKxBot [Tut] How to Create a DataFrame in Pandas? xSicKxBot 0 30 11-14-2020, 01:31 AM Last Post: xSicKxBot [Tut] How Much Can You Earn as a Data Science Freelancer? xSicKxBot 0 32 11-10-2020, 08:37 PM Last Post: xSicKxBot [Tut] How to Fix “ImportError: No module named pandas” [Mac/Linux/Windows/PyCharm] xSicKxBot 0 27 11-08-2020, 04:36 AM Last Post: xSicKxBot [Tut] A Simple Recommendation System With Pandas xSicKxBot 0 23 10-31-2020, 06:58 AM Last Post: xSicKxBot [Tut] Tilde Python Pandas DataFrame xSicKxBot 0 155 05-17-2020, 06:40 AM Last Post: xSicKxBot [Tut] Pandas to_csv() xSicKxBot 0 194 04-30-2020, 03:56 AM Last Post: xSicKxBot [Tut] [PDF Collection] 7 Beautiful Pandas Cheat Sheets — Post Them to Your Wall xSicKxBot 0 251 04-29-2020, 03:23 AM Last Post: xSicKxBot [Tut] How to Convert List of Lists to a Pandas Dataframe xSicKxBot 0 176 04-28-2020, 02:37 PM Last Post: xSicKxBot

Forum Jump:

 Active Threads News - Nintendo Shares Tips On How To Ke... Last Post: xSicKxBot Today 02:43 PM » Replies: 0 » Views: 1 News - Random: Amateur Dev Releases Reim... Last Post: xSicKxBot Today 02:43 PM » Replies: 0 » Views: 1 [Tut] Python abs() Function Last Post: xSicKxBot Today 09:45 AM » Replies: 0 » Views: 4 News - It Looks Like Kadabra Can Finally... Last Post: xSicKxBot Today 07:46 AM » Replies: 0 » Views: 7 News - Bungie Cancels Destiny 2's Trials... Last Post: xSicKxBot Today 07:46 AM » Replies: 0 » Views: 5 (Indie Deal) ?U+Me Black Friday Adult Bu... Last Post: xSicKxBot Today 06:42 AM » Replies: 0 » Views: 4 [1.06] BO4 Zombie Trainer (Call of duty ... Last Post: theolikeappels Today 01:00 AM » Replies: 31 » Views: 1188 Blender 2.91 Released Last Post: xSicKxBot Today 12:49 AM » Replies: 0 » Views: 4 News - Feature: What’s The Best Way To P... Last Post: xSicKxBot Today 12:48 AM » Replies: 0 » Views: 4 News - How Cyberpunk 2077 Cross-Saves Wo... Last Post: xSicKxBot Today 12:48 AM » Replies: 0 » Views: 5