Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] pd.agg() – Aggregating Data in Pandas

#1
pd.agg() – Aggregating Data in Pandas

<div><div class="kk-star-ratings kksr-valign-top kksr-align-left " data-payload="{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;443671&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;top&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;count&quot;:&quot;1&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;5&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;5&quot;,&quot;greet&quot;:&quot;Rate this post&quot;,&quot;legend&quot;:&quot;5\/5 - (1 vote)&quot;,&quot;size&quot;:&quot;24&quot;,&quot;width&quot;:&quot;142.5&quot;,&quot;_legend&quot;:&quot;{score}\/{best} - ({count} {votes})&quot;}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend"> 5/5 – (1 vote) </div>
</div>
<p>The name <em>agg </em>is short for <em>aggregate</em>. To aggregate is to summarize many observations into a single value that represents a certain aspect of the observed data. </p>
<p>The <code>.agg()</code> function can process a dataframe, a series, or a grouped dataframe. It can execute many aggregation functions, e.g. ‘<code><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-dataframe-mean-method/" data-type="post" data-id="343522" target="_blank">mean</a></code>’, ‘<code><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-dataframe-max-method/" data-type="post" data-id="343520" target="_blank">max</a></code>’,… in a single call along one of the axis. It can also execute <a href="https://blog.finxter.com/a-simple-introduction-of-the-lambda-function-in-python/" data-type="post" data-id="2701" target="_blank" rel="noreferrer noopener">lambda functions</a>. Read on for examples.    </p>
<p>We will use a dataset of FIFA players. Find the dataset <a href="https://data.world/raghav333/fifa-players" data-type="URL" data-id="https://data.world/raghav333/fifa-players" target="_blank" rel="noreferrer noopener">here</a>.</p>
<h2>Basic Setup using Jupyter Notebook</h2>
<p>Let’s start by importing pandas and loading our dataset.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd
df_fifa_soccer_players = pd.read_csv('fifa_cleaned.csv')
df_fifa_soccer_players.head()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="968" height="314" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-174.png" alt="" class="wp-image-443676" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-174.png 968w, https://blog.finxter.com/wp-content/uplo...300x97.png 300w, https://blog.finxter.com/wp-content/uplo...68x249.png 768w" sizes="(max-width: 968px) 100vw, 968px" /></figure>
</div>
<p>To increase readability, we will work with a subset of the data. Let’s create the subset by selecting the columns we want to have in our subset and create a new dataframe.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset = df_fifa_soccer_players[['nationality', 'age', 'height_cm', 'weight_kgs', 'overall_rating', 'value_euro', 'wage_euro']]
df_fifa_soccer_players_subset.head()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="524" height="155" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-175.png" alt="" class="wp-image-443678" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-175.png 524w, https://blog.finxter.com/wp-content/uplo...300x89.png 300w" sizes="(max-width: 524px) 100vw, 524px" /></figure>
</div>
<h2>Basic Aggregation</h2>
<p><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank">Pandas</a> provides a variety of built-in aggregation functions. For example, <code>pandas.DataFrame.describe</code>. When applied to a dataset, it returns a summary of statistical values. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset.describe()</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="588" height="235" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-176.png" alt="" class="wp-image-443679" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-176.png 588w, https://blog.finxter.com/wp-content/uplo...00x120.png 300w" sizes="(max-width: 588px) 100vw, 588px" /></figure>
</div>
<p>To understand aggregation and why it is helpful, let’s have a closer look at the data returned.&nbsp;</p>
<p class="has-base-2-background-color has-background"><strong>Example</strong>: Our dataset contains records for 17954 players. The youngest player is 17 years of age and the oldest player is 46 years old. The mean age is 25 years. We learn that the tallest player is 205 cm tall and the average player’s height is around 175 cm. With a single line of code, we can answer a variety of statistical questions about our data. The <code>describe</code> function identifies numeric columns and performs the statistical aggregation for us. Describe also excluded the column <code>nationality</code> that contains string values.</p>
<p>To aggregate is to summarize many observations into a single value that represents a certain aspect of the observed data.</p>
<p>Pandas provides us with a variety of pre-built aggregate functions.</p>
<figure class="wp-block-table is-style-stripes">
<table>
<tbody>
<tr>
<td><strong>Functions</strong></td>
<td><strong>Description</strong></td>
</tr>
<tr>
<td><code><a href="https://blog.finxter.com/pandas-dataframe-mean-method/" data-type="post" data-id="343522" target="_blank" rel="noreferrer noopener">mean()</a></code></td>
<td>returns the mean of a set of values</td>
</tr>
<tr>
<td><code>sum()</code></td>
<td>returns the sum of a set of values</td>
</tr>
<tr>
<td><code><a href="https://blog.finxter.com/pandas-dataframe-count-method/" data-type="post" data-id="343443" target="_blank" rel="noreferrer noopener">count()</a></code></td>
<td>returns the count of a set of values</td>
</tr>
<tr>
<td><code><a href="https://blog.finxter.com/how-to-get-the-standard-deviation-of-a-python-list/" data-type="post" data-id="7507" target="_blank" rel="noreferrer noopener">std()</a></code></td>
<td>returns the standard deviation of a set of values</td>
</tr>
<tr>
<td><code><a href="https://blog.finxter.com/pandas-dataframe-min-method/" data-type="post" data-id="343521" target="_blank" rel="noreferrer noopener">min()</a></code></td>
<td>returns the smallest value of a set of values</td>
</tr>
<tr>
<td><code><a href="https://blog.finxter.com/pandas-dataframe-max-method/" data-type="post" data-id="343520" target="_blank" rel="noreferrer noopener">max()</a></code></td>
<td>returns the largest value of a set of values</td>
</tr>
<tr>
<td><code><a href="https://blog.finxter.com/pandas-dataframe-describe-method/" data-type="post" data-id="343482" target="_blank" rel="noreferrer noopener">describe()</a></code></td>
<td>returns a collection of statistical values of a set of values</td>
</tr>
<tr>
<td><code>size()</code></td>
<td>returns the size of a set of values</td>
</tr>
<tr>
<td><code><a href="https://blog.finxter.com/pandas-dataframe-first-method/" data-type="post" data-id="343656" target="_blank" rel="noreferrer noopener">first()</a></code></td>
<td>returns the first value of a set of values</td>
</tr>
<tr>
<td><code><a href="https://blog.finxter.com/pandas-dataframe-last-method/" data-type="post" data-id="343657" target="_blank" rel="noreferrer noopener">last()</a></code></td>
<td>returns the last value of a set of values</td>
</tr>
<tr>
<td><code>nth()</code></td>
<td>returns the nth value of a set of values</td>
</tr>
<tr>
<td><code>sem()</code></td>
<td>returns the standard error of the mean of a set of value</td>
</tr>
<tr>
<td><code><a href="https://blog.finxter.com/how-to-calculate-the-column-variance-of-a-dataframe-in-python-pandas/" data-type="post" data-id="7499" target="_blank" rel="noreferrer noopener">var()</a></code></td>
<td>returns the variance of a set of values</td>
</tr>
<tr>
<td><code>nunique()</code></td>
<td>returns the count of unique values of a set of values</td>
</tr>
</tbody>
</table>
</figure>
<p>Let’s use another function from the list above. We can be more specific and request the ‘<code>sum</code>’ for the ‘<code>value_euro</code>’ series. This column contains the market value of a player. We select the column or series ‘<code>value_euro</code>’ and execute the pre-build <code>sum()</code> function.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset['value_euro'].sum()
# 43880780000.0</pre>
<p>Pandas returned us the requested value. Let’s get to know an even more powerful pandas method for aggregating data.</p>
<h2>The ‘pandas.DataFrame.agg’ Method</h2>
<h3>Function Syntax</h3>
<p>The <code>.agg()</code> function can take in many input types. The output type is, to a large extent, determined by the input type. We can pass in many parameters to the <code>.agg()</code> function. </p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="721" height="264" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-177.png" alt="" class="wp-image-443680" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-177.png 721w, https://blog.finxter.com/wp-content/uplo...00x110.png 300w" sizes="(max-width: 721px) 100vw, 721px" /></figure>
</div>
<p>The “<code>func</code>” parameter:</p>
<ul>
<li>is by default set to <code><strong>None</strong> </code></li>
<li>contains one or many functions that aggregate the data</li>
<li>supports pre-defined pandas aggregate functions</li>
<li>supports lambda expressions</li>
<li>supports the <code><a href="https://blog.finxter.com/pandas-apply-a-helpful-illustrated-guide/" data-type="post" data-id="18235" target="_blank" rel="noreferrer noopener">dataframe.apply()</a></code> method for specific function calls</li>
</ul>
<p>The “<code>axis</code>” parameter:</p>
<ul>
<li>is by default set to <strong>0 </strong>and applies functions to each column</li>
<li>if set to <strong>1</strong> applies functions to rows</li>
<li>can hold values:
<ul>
<li><code>0</code> or ‘<code>index</code>’</li>
<li><code>1</code> or ‘<code>columns</code>’</li>
</ul>
</li>
</ul>
<p>What about <code>*args</code> and <code>**kwargs</code>:</p>
<ul>
<li>we use these placeholders, if we do not know in advance how many arguments we will need to pass into the function</li>
<li>when arguments are of the same type, we use <code>*args</code></li>
<li>When arguments are of different types, we use <code>**kwargs</code>.</li>
</ul>
<h3>Agg method on a Series</h3>
<p>Let’s see the <code>.agg()</code> function in action. We request some of the pre-build aggregation functions for the ‘<code>wage_euro</code>’ series. We use the function parameter and provide the aggregate functions ‌we want to execute as a list. And let’s save the resulting series in a variable. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">wage_stats = df_fifa_soccer_players_subset['wage_euro'].agg(['sum', 'min', 'mean', 'std', 'max'])
print(wage_stats)
</pre>
<h6></h6>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="274" height="104" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-178.png" alt="" class="wp-image-443681"/></figure>
</div>
<p>Pandas uses scientific notation for large and small floating-point numbers. To convert the output to a familiar format, we must move the floating point to the right as shown by the plus sign. The number behind the plus sign represents the amount of steps.</p>
<p>Let’s do this together for some values.</p>
<p>The sum of all wages is 175,347,000€ (1.753470e+08)</p>
<p>The mean of the wages is 9902.135€ (9.902135e+03)</p>
<p>We executed many functions on a series input source. Thus our variable ‘<code>wage_stats</code>’ is of the type <code>Series</code> because. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">type(wage_stats)
# pandas.core.series.Series</pre>
<p>See below how to extract, for example, the ‘<code>min</code>’ value from the variable and the data type returned.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">wage_stats_min = wage_stats['min']
print(wage_stats_min)
# 1000.0 print(type(wage_stats_min))
# numpy.float64
</pre>
<p>The data type is now a scalar.</p>
<p>If we execute a single function on the same data source (series), the type returned is a scalar.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">wage_stats_max = df_fifa_soccer_players_subset['wage_euro'].agg('max')
print(wage_stats_max)
# 565000.0 print(type(wage_stats_max))
# numpy.float64
</pre>
<p>Let’s use one more example to understand the relation between the input type and the output type. </p>
<p>We will use the function “<code>nunique</code>” which will give us the count of unique nationalities. Let’s apply the function in two code examples. We will reference the series ‘<code>nationality</code>’ both times. The only difference will be the way we pass the function “<code>nunique</code>” into our <code>agg()</code> function.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">nationality_unique_series = df_fifa_soccer_players_subset['nationality'].agg({'nationality':'nunique'})
print(nationality_unique_series)
# nationality 160
# Name: nationality, dtype: int64 print(type(nationality_unique_series))
# pandas.core.series.Series
</pre>
<p>When we use a <a href="https://blog.finxter.com/python-dictionary/" data-type="post" data-id="5232" target="_blank" rel="noreferrer noopener">dictionary</a> to pass in the “<code>nunique</code>” function, the output type is a series.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">nationality_unique_int = df_fifa_soccer_players_subset['nationality'].agg('nunique')
print(nationality_unique_int)
# 160 print(type(nationality_unique_int))
# int
</pre>
<p>When we pass the “<code>nunique</code>” function directly into <code>agg()</code> the output type is an integer.</p>
<h3>Agg method on a DataFrame</h3>
<h4>Passing the aggregation functions as a Python list</h4>
<p>One column represents a series. We will now select two columns as our input and so work with a dataframe. </p>
<p>Let’s select the columns ‘<code>height_cm</code>’ and ‘<code>weight_kgs</code>’. </p>
<p>We will execute the functions <code>min()</code>, <code>mean()</code> and <code>max()</code>. To select a two-dimensional data (dataframe), we need to use double brackets. We will round the results to two decimal points. </p>
<p>Let’s store the result in a variable.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">height_weight = df_fifa_soccer_players_subset[['height_cm', 'weight_kgs']].agg(['min', 'mean', 'max']).round(2)
print(height_weight)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="196" height="108" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-179.png" alt="" class="wp-image-443682"/></figure>
</div>
<p>We get a data frame containing rows and columns. Let’s confirm this observation by checking the type of the ‘<code>height_weight</code>’ variable.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(type(height_weight))
# pandas.core.frame.DataFrame
</pre>
<p>We will now use our newly created dataframe named ‘<code>height_weight</code>’ to use the ‘<code>axis</code>’ parameter. The entire dataframe contains numeric values. </p>
<p>We define the functions and pass in the <code>axis</code> parameter. I used the <code>count()</code> and <code>sum()</code> functions to show the effect of the <code>axis</code> parameter. The resulting values make little sense. This is also the reason why I do not rename the headings to restore the lost column names.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">height_weight.agg(['count', 'sum'], axis=1)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="145" height="108" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-180.png" alt="" class="wp-image-443683"/></figure>
</div>
<p>We aggregated along the rows. Returning the count of items and the sum of item values in each row.</p>
<h4>Passing the aggregation functions as a python dictionary</h4>
<p>Now let’s apply different functions to the individual sets in our dataframe. We select the sets ‘<code>overall_rating</code>’ and ‘<code>value_euro</code>’. We will apply the functions <code>std()</code>, <code>sem()</code> and <code>mean()</code> to the ‘<code>overall_rating</code>’ series, and the functions <code>min()</code> and <code>max()</code> to the ‘<code>value_euro</code>’ series.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">rating_value_euro_dict = df_fifa_soccer_players_subset[['overall_rating', 'value_euro']].agg({'overall_rating':['std', 'sem', 'mean'], 'value_euro':['min', 'max']})
print(rating_value_euro_dict)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="218" height="156" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-181.png" alt="" class="wp-image-443684"/></figure>
</div>
<p>The dataframe contains calculated and empty (<a href="https://blog.finxter.com/check-for-nan-values-in-python/" data-type="post" data-id="273492" target="_blank" rel="noreferrer noopener">NaN</a>) values. Let’s quickly confirm the type of our output.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(type(rating_value_euro_dict))
# pandas.core.frame.DataFrame
</pre>
<h4>Passing the aggregation functions as a Python tuple</h4>
<p>We will now repeat the previous example. </p>
<p>We will use <a href="https://blog.finxter.com/the-ultimate-guide-to-python-tuples/" data-type="post" data-id="12043" target="_blank" rel="noreferrer noopener">tuples</a> instead of a dictionary to pass in the aggregation functions. Tuple have limitations. We can only pass one aggregation function within a tuple. We also have to name each tuple. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">rating_value_euro_tuple = df_fifa_soccer_players_subset[['overall_rating', 'value_euro']].agg(overall_r_std=('overall_rating', 'std'),overall_r_sem=('overall_rating', 'sem'),overall_r_mean=('overall_rating', 'mean'),value_e_min=('value_euro', 'min'),value_e_max=('value_euro', 'max'))
print(rating_value_euro_tuple)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="277" height="160" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-182.png" alt="" class="wp-image-443685"/></figure>
</div>
<h3>Agg method on a grouped DataFrame</h3>
<h4>Grouping by a single column</h4>
<p>The ‘<code><a rel="noreferrer noopener" href="https://blog.finxter.com/pd-dataframe-groupby-a-simple-illustrated-guide/" data-type="post" data-id="340015" target="_blank">groupby</a></code>’ method creates a grouped dataframe. We will now select the columns ‘<code>age</code>’ and ‘<code>wage_euro</code>’ and group our dataframe using the column ‘<code>age</code>’. On our grouped dataframe we will apply the <code>agg()</code> function using the functions <code>count()</code>, <code>min()</code>, <code>max()</code> and <code>mean()</code>.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">age_group_wage_euro = df_fifa_soccer_players_subset[['age', 'wage_euro']].groupby('age').aggage(['count', 'min', 'max', 'mean'])
print(age_group_wage_euro)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="277" height="282" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-183.png" alt="" class="wp-image-443686"/></figure>
</div>
<p>Every row represents an age group. The count value shows how many players fall into the age group. The min, max and mean values aggregate the data of the age-group members.</p>
<h4>Multiindex</h4>
<p>One additional aspect of a grouped dataframe is the resulting hierarchical index. We also call it <em><strong>multiindex</strong></em>. </p>
<p>We can see that the individual columns of our grouped dataframe are at different levels. Another way to view the hierarchy is to request the columns for the particular dataset.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(age_group_wage_euro.columns)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="309" height="94" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-184.png" alt="" class="wp-image-443687" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-184.png 309w, https://blog.finxter.com/wp-content/uplo...300x91.png 300w" sizes="(max-width: 309px) 100vw, 309px" /></figure>
</div>
<p>Working with a multiindex is a topic for another blog post. To use the tools that we have discussed, let’s flatten the multiindex and reset the index. We need the following functions:</p>
<ul>
<li><code>droplevel()</code></li>
<li><code>reset_index()</code></li>
</ul>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">age_group_wage_euro_flat = age_group_wage_euro.droplevel(axis=1, level=0).reset_index()
print(age_group_wage_euro_flat.head())</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="294" height="158" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-185.png" alt="" class="wp-image-443688"/></figure>
</div>
<p>The resulting dataframe columns are now flat. We lost some information during the flattening process. Let’s rename the columns and return some of the lost context.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">age_group_wage_euro_flat.columns = ['age', 'athlete_count', 'min_wage_euro', 'max_wage_euro', 'mean_wage_euro']
print(age_group_wage_euro_flat.head())</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="465" height="160" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-186.png" alt="" class="wp-image-443689" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-186.png 465w, https://blog.finxter.com/wp-content/uplo...00x103.png 300w" sizes="(max-width: 465px) 100vw, 465px" /></figure>
</div>
<h4>Grouping by multiple columns</h4>
<p>Grouping by multiple columns creates even more granular subsections. </p>
<p>Let’s use ‘<code>age</code>’ as the first grouping parameter and ‘<code>nationality</code>’ as the second. We will aggregate the resulting group data using the columns ‘<code>overall_rating</code>’ and ‘<code>height_cm</code>’. We are by now familiar with the aggregation functions used in this example.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset.groupby(['age', 'nationality']).agg({'overall_rating':['count', 'min', 'max', 'mean'], 'height_cm':['min', 'max', 'mean']})</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="470" height="403" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-187.png" alt="" class="wp-image-443690" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-187.png 470w, https://blog.finxter.com/wp-content/uplo...00x257.png 300w" sizes="(max-width: 470px) 100vw, 470px" /></figure>
</div>
<p>Every age group contains nationality groups. The aggregated athletes data is within the nationality groups.</p>
<h3>Custom aggregation functions</h3>
<p>We can write and execute custom aggregation functions to answer very specific questions.</p>
<p>Let’s have a look at the inline lambda functions.</p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <a rel="noreferrer noopener" href="https://blog.finxter.com/a-simple-introduction-of-the-lambda-function-in-python/" data-type="post" data-id="2701" target="_blank">Lambda functions</a> are so-called <em>anonymous functions</em>. They are called this way because they do not have a name. Within a lambda function, we can execute multiple expressions. We will go through several examples to see lambda functions in action.</p>
<p>In pandas lambda functions live inside the “<code>DataFrame.apply()</code>” and the “<code>Series.appy()</code>” methods. We will use the <code>DataFrame.appy()</code> method to execute functions along both axes. Let’s have a look at the basics first.</p>
<h3>Function Syntax</h3>
<p>The <code>DataFrame.apply()</code> function will execute a function along defined axes of a DataFrame. The functions that we will execute in our examples will work with Series objects passed into our custom functions by the <code>apply()</code> method. Depending on the axes that we will select, the Series will comprise out of a row or a column or our data frame.</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="721" height="371" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-188.png" alt="" class="wp-image-443691" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-188.png 721w, https://blog.finxter.com/wp-content/uplo...00x154.png 300w" sizes="(max-width: 721px) 100vw, 721px" /></figure>
</div>
<p>The “<code>func</code>” parameter:</p>
<ul>
<li>contains a function applied to a column or a row of the data frame</li>
</ul>
<p>The “<code>axis</code>” parameter:</p>
<ul>
<li>is by default set to <strong>0 </strong>and will pass a series of column data</li>
<li>if set to <strong>1</strong> will pass a series of the row data</li>
<li>can hold values:
<ul>
<li>0 or ‘<code>index</code>’</li>
<li>1 or ‘<code>columns</code>’</li>
</ul>
</li>
</ul>
<p>The “<code>raw</code>” parameter:</p>
<ul>
<li>is a boolean value</li>
<li> is by default set to<strong> <code>False</code></strong></li>
<li>can hold values:
<ul>
<li><strong><code>False</code> </strong>-> a Series object is passed to the function</li>
<li><strong><code>True</code></strong> -> a <code>ndarray</code> object is passed to the function</li>
</ul>
</li>
</ul>
<p>The “<code>result_type</code>” parameter:</p>
<ul>
<li>can only apply when the axis is 1 or ‘<code>columns</code>’</li>
<li>can hold values:
<ul>
<li>‘<code>expand</code>’</li>
<li><code>‘reduce’</code></li>
<li>‘<code>broadcast</code>’</li>
</ul>
</li>
</ul>
<p> The “<code>args()</code>” parameter:</p>
<ul>
<li>additional parameters for the function as tuple</li>
</ul>
<p>The <code>**kwargs</code> parameter:</p>
<ul>
<li>additional parameters for the function as key-value pairs</li>
</ul>
<h4>Filters</h4>
<p>Let’s have a look at filters. They will be very handy as we explore our data. </p>
<p>In this code example, we create a filter named <code>filt_rating</code>. We select our dataframe and the column <code>overall_rating</code>. The condition <code>>= 90</code> returns <code>True</code> if the value in the <code>overall_rating</code> column is 90 or above. </p>
<p>Otherwise, the filter returns <code>False</code>.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">filt_rating = df_fifa_soccer_players_subset['overall_rating'] >= 90
print(filt_rating)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="373" height="191" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-189.png" alt="" class="wp-image-443692" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-189.png 373w, https://blog.finxter.com/wp-content/uplo...00x154.png 300w" sizes="(max-width: 373px) 100vw, 373px" /></figure>
</div>
<p>The result is a Series object containing the index, and the correlated value of <code>True</code> or <code>False</code>.</p>
<p>Let’s apply the filter to our dataframe. We call the <code><a rel="noreferrer noopener" href="https://blog.finxter.com/slicing-data-from-a-pandas-dataframe-using-loc-and-iloc/" data-type="post" data-id="230997" target="_blank">.loc</a></code> method and pass in the filter’s name as a list item. The filter works like a mask. It covers all rows that have the value <code>False</code>. The remaining rows match our filter criteria of <code>overall_rating >= 90</code>.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset.loc[filt_rating]</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="504" height="388" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-190.png" alt="" class="wp-image-443693" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-190.png 504w, https://blog.finxter.com/wp-content/uplo...00x231.png 300w" sizes="(max-width: 504px) 100vw, 504px" /></figure>
</div>
<h4>Lambda functions</h4>
<p>Let’s recreate the same filter using a lambda function. We will call our filter <code>filt_rating_lambda</code>. </p>
<p>Let’s go over the code. We specify the name of our filter and call our dataframe. Pay attention to the double square brackets. We use them to pass a dataframe and not a Series object to the <code>.appy()</code> method.</p>
<p>Inside <code>.apply()</code> we use the keyword ‘<code>lambda</code>’ to show that we are about to define our anonymous function. The ‘<code>x</code>’ represents the Series passed into the lambda function. </p>
<p>The series contains the data from the <code>overall_rating</code> column. After the semicolumn, we use the placeholder <code>x</code> again. Now we apply a method called <code>ge()</code>. It represents the same condition we used in our first filter example “<code>>=</code>” (greater or equal). </p>
<p>We define the integer value 90 and close the brackets on our apply function. The result is a dataframe that contains an index and only one column of boolean values. To convert this dataframe to a Series we use the <code>squeeze()</code> method.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">filt_rating_lambda = df_fifa_soccer_players_subset[['overall_rating']].apply(lambda x:x.ge(90)).squeeze()
print(filt_rating_lambda)</pre>
<h6 class="has-text-align-center"><img loading="lazy" src="https://lh4.googleusercontent.com/kNUHc8X_7CwsJUknkIanlptXicNjg7SExuUtgZ-cmxrnUFkHhM3h08ts8XNOixaqdPMlP5oy5jS4Opo07loO5-mXucFLRDIHV2pVpA-JM3UixZL0LHQQYwOlGoENDVwq-Qll7NM-7RyFADQwVg" width="373" height="192"></h6>
<p>Let’s use our filter. Great, we get the same result as in our first filter example.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset.loc[filt_rating_lambda]</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="499" height="396" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-191.png" alt="" class="wp-image-443694" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-191.png 499w, https://blog.finxter.com/wp-content/uplo...00x238.png 300w" sizes="(max-width: 499px) 100vw, 499px" /></figure>
</div>
<p>We now want to know how many players our filter returned. Let’s first do it without a lambda function and then use a lambda function to see the same result. We are counting the lines or records.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset.loc[filt_rating_lambda].count()</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="167" height="129" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-192.png" alt="" class="wp-image-443695"/></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset.apply(lambda x:x.loc[filt_rating_lambda]).count()</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="167" height="129" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-193.png" alt="" class="wp-image-443696"/></figure>
</div>
<p>Great. Now let’s put us in a place where we actually need to use the <code>apply()</code> method and a lambda function. We want to use our filter on a grouped data-frame. </p>
<p>Let’s group by nationality to see the distribution of these amazing players. The output will contain all columns. This makes the code easier to read.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset.groupby('nationality').loc[filt_rating_lambda]</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="877" height="225" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-194.png" alt="" class="wp-image-443697" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-194.png 877w, https://blog.finxter.com/wp-content/uplo...300x77.png 300w, https://blog.finxter.com/wp-content/uplo...68x197.png 768w" sizes="(max-width: 877px) 100vw, 877px" /></figure>
</div>
<p>Pandas tells us in this error message that we can not use the ‘<code>loc</code>’ method on a grouped dataframe object. </p>
<p>Let’s now see how we can solve this problem by using a lambda function. Instead of using the ‘<code>loc</code>’ function on the grouped dataframe we use the <code>apply()</code> function. Inside the <code>apply()</code> function we define our lambda function. Now we use the ‘<code>loc</code>’ method on the variable ‘<code>x</code>’ and pass our filter. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_fifa_soccer_players_subset.groupby('nationality').apply(lambda x:x.loc[filt_rating_lambda])</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="563" height="412" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-195.png" alt="" class="wp-image-443698" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-195.png 563w, https://blog.finxter.com/wp-content/uplo...00x220.png 300w" sizes="(max-width: 563px) 100vw, 563px" /></figure>
</div>
<h4>Axis parameter of the apply() function</h4>
<p>Now let’s use the <code>axis</code> parameter to calculate the Body-Mass-Index (BMI) for these players. Until now we have used the lambda functions on the columns of our data. </p>
<p>The ‘<code>x</code>’ variable was a representation of the individual column. We set the axis parameter to ‘<code>1</code>’. The ‘<code>x</code>’ variable in our lambda function will now represent the individual rows of our data.</p>
<p>Before we calculate the BMI let’s create a new dataframe and define some columns. We will call our new dataframe ‘<code>df_bmi</code>’. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_bmi = df_fifa_soccer_players_subset.groupby('nationality')[['age', 'height_cm', 'weight_kgs']].apply(lambda x:x.loc[filt_rating_lambda])
print(df_bmi)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="270" height="409" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-196.png" alt="" class="wp-image-443699" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-196.png 270w, https://blog.finxter.com/wp-content/uplo...98x300.png 198w" sizes="(max-width: 270px) 100vw, 270px" /></figure>
</div>
<p>Now let’s reset the index.&nbsp;</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_bmi = df_bmi.reset_index()
print(df_bmi)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="301" height="388" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-197.png" alt="" class="wp-image-443700" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-197.png 301w, https://blog.finxter.com/wp-content/uplo...33x300.png 233w" sizes="(max-width: 301px) 100vw, 301px" /></figure>
</div>
<p>We calculate the BMI as follows. We divide the weight in kilogram by the square of the height in meters.&nbsp;</p>
<p>Let’s have a closer look at the lambda function. We define the ‘<code>axis</code>’ to be ‘<code>1</code>’. The ‘<code>x</code>’ variable now represents a row. We need to use specific values in each row. To define these values, we use the variable ‘<code>x</code>’ and specify a column name. At the beginning of our code example, we define a new column named ‘<code>bmi</code>’. And at the very end, we round the results.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_bmi['bmi'] = df_bmi.apply(lambda x:x['weight_kgs']/((x['height_cm']/100)**2), axis=1).round()
print(df_bmi)</pre>
<h6 class="has-text-align-center"><img loading="lazy" src="https://lh4.googleusercontent.com/WSKAoXXDPrtx3_iL31-NJvRau8ssl5JWfZ0muxUNe26IBS4OqpCWtIgYEzxSegDWYjV5TS2T1iTHJ9HUKcM5wDT-vK5_PfqtVcB9iZRDLcSe41A2p8LxAcY8brA9O4V2ZzexZn-2IdooAkwcww" width="333" height="388"></h6>
<p>Great! Our custom function worked. The new BMI column contains calculated values.</p>
<h2>Conclusion</h2>
<p>Congratulations on finishing the tutorial. I wish you many great and small insights for your future data projects. I include the Jupyter-Notebook <a rel="noreferrer noopener" href="https://colab.research.google.com/drive/1kb2wPlnkYGek0M9oCPFvTO9sq_aYD-Dh?usp=sharing" data-type="URL" data-id="https://colab.research.google.com/drive/1kb2wPlnkYGek0M9oCPFvTO9sq_aYD-Dh?usp=sharing" target="_blank">file</a>, so you can experiment and tweak the code.</p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<h2>Nerd Humor</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="733" height="913" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-72.png" alt="" class="wp-image-410345" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-72.png 733w, https://blog.finxter.com/wp-content/uplo...41x300.png 241w" sizes="(max-width: 733px) 100vw, 733px" /><figcaption><em>Oh yeah, I didn’t even know they renamed it the Willis Tower in 2009, because I know a normal amount about skyscrapers.</em> — <a rel="noreferrer noopener" href="https://imgs.xkcd.com/comics/or_whatever_2x.png" data-type="URL" data-id="https://imgs.xkcd.com/comics/or_whatever_2x.png" target="_blank">xkcd</a> (source)</figcaption></figure>
</div>
</div>


https://www.sickgaming.net/blog/2022/06/...in-pandas/
Reply



Forum Jump:


Users browsing this thread:
2 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016