Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] How Does Pandas Concat Work?

#1
How Does Pandas Concat Work?

<div><p>The <code>pandas.concat( )</code> function combines the data from multiple Series and/or DataFrames fast and in an intuitive manner. It is one of the most basic data wrangling operations used in Pandas. In general, we draw some conclusions from the data by analyzing it. The confidence in our conclusions increases as we include more variables or meta-data about our data. This is achieved by combining data from a variety of different data sources. The basic Pandas objects, Series, and DataFrames are created by keeping these relational operations in mind. For example, <code>pd.concat([df1, df2])</code> concatenates two DataFrames <code>df1</code>, <code>df2</code> together horizontally and results in a new DataFrame.</p>
<h1>Pandas Concat Two or More DataFrames</h1>
<p>The most important and widely used use-case of Pandas concat – <code>pd.concat( )</code> is to concatenate DataFrames.</p>
<p>For example, when you’re buying a new smartphone, often you might like to compare the specifications and price of the phones. This makes you take an informed decision. Such a comparison can be viewed below as an example from the amazon website for recent OnePlus phones.</p>
<figure class="wp-block-image size-large"><img loading="lazy" width="624" height="329" src="https://blog.finxter.com/wp-content/uploads/2020/11/image-34.png" alt="" class="wp-image-17173" srcset="https://blog.finxter.com/wp-content/uploads/2020/11/image-34.png 624w, https://blog.finxter.com/wp-content/uplo...00x158.png 300w, https://blog.finxter.com/wp-content/uplo...150x79.png 150w" sizes="(max-width: 624px) 100vw, 624px" /></figure>
<p>In the above image, the data about four different smartphones are concatenated with their features as an index.</p>
<p>Let us construct two DataFrames and combine them to see how it works.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import pandas as pd
>>> df1 = pd.DataFrame(
... {"Key": ["A", "B", "A", "C"], "C1":[1, 2, 3, 4], "C2": [10, 20, 30, 40]})
>>> df1.index = ["L1", "L2", "L3", "L4"]
>>> print(df1) Key C1 C2
L1 A 1 10
L2 B 2 20
L3 A 3 30
L4 C 4 40
>>> df2 = pd.DataFrame(
... {"Key": ["A", "B", "C", "D"], "C3": [100, 200, 300, 400]})
>>> df2.index = ["R1", "R2", "R3", "R4"]
>>> print(df2) Key C3
R1 A 100
R2 B 200
R3 C 300
R4 D 400</pre>
<p>From the official Pandas documentation of Pandas concat;</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="624" height="178" src="https://blog.finxter.com/wp-content/uploads/2020/11/image-35.png" alt="" class="wp-image-17174" srcset="https://blog.finxter.com/wp-content/uploads/2020/11/image-35.png 624w, https://blog.finxter.com/wp-content/uplo...300x86.png 300w, https://blog.finxter.com/wp-content/uplo...150x43.png 150w" sizes="(max-width: 624px) 100vw, 624px" /></figure>
</div>
<p>The two major arguments used in <code>pandas.concat( )</code> from the above image are,</p>
<ul>
<li><strong>objs</strong> – A sequence of Series and/or DataFrame objects</li>
<li><strong>axis</strong> – Axis along which <strong>objs</strong> are concatenated</li>
</ul>
<p>Out of the two arguments, <code>objs</code> remains constant. But, based on the value of the axis, the concatenation operation differs. Possible values of the axis are,</p>
<ul>
<li><code>axis = 0</code> – Concatenate or stack the DataFrames down the rows</li>
<li><code>axis = 1</code> – Concatenate or stack the DataFrames along the columns</li>
</ul>
<p>Remember this <code>axis</code> argument functionality, because it comes in many other <a href="https://blog.finxter.com/pandas-cheat-sheets/" target="_blank" rel="noreferrer noopener" title="[PDF Collection] 7 Beautiful Pandas Cheat Sheets — Post Them to Your Wall">Pandas </a>functions. Let us see them in action using the above created Dataframes.</p>
<h2>1. Row-Wise Concatenation (axis = 0 / ’index’)</h2>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> df3 = pd.concat([df1, df2], axis=0)
>>> print(df3) Key C1 C2 C3
L1 A 1.0 10.0 NaN
L2 B 2.0 20.0 NaN
L3 A 3.0 30.0 NaN
L4 C 4.0 40.0 NaN
R1 A NaN NaN 100.0
R2 B NaN NaN 200.0
R3 C NaN NaN 300.0
R4 D NaN NaN 400.0
>>> df3_dash = pd.concat([df1, df2])
>>> print(df3_dash) Key C1 C2 C3
L1 A 1.0 10.0 NaN
L2 B 2.0 20.0 NaN
L3 A 3.0 30.0 NaN
L4 C 4.0 40.0 NaN
R1 A NaN NaN 100.0
R2 B NaN NaN 200.0
R3 C NaN NaN 300.0
R4 D NaN NaN 400.0
>>> print(len(df3) == len(df1) + len(df2))
True</pre>
<p>Any number of DataFrames can be given in the first argument which has a list of DataFrames like <code>[df1, df2, df3, ..., dfn]</code>.</p>
<p>Some observations from the above results:</p>
<ul>
<li>Note the outputs of <code>df3</code> and <code>df3_dash</code> are the same. So, we need not explicitly mention the axis when we want to concatenate down the rows.</li>
<li>The number of rows in the output DataFrame = Total number of rows in all the input DataFrames.</li>
<li>The columns of the output DataFrame = Combination of distinct columns of all the input DataFrames.</li>
<li>There are unique columns present in the input DataFrames. The corresponding values at the row labels of different input DataFrames are filled with <code>NaN</code>s (<a href="https://blog.finxter.com/pandas-nan/" target="_blank" rel="noreferrer noopener" title="Pandas NaN — Working With Missing Data">Not a Number</a> – missing values) in the output DataFrame.</li>
</ul>
<p>Let’s visualize the above process in the following animation:</p>
<figure class="wp-block-video aligncenter"><video autoplay loop muted src="https://blog.finxter.com/wp-content/uploads/2020/11/Pd_Concat_Axis_0.mp4"></video></figure>
</p>
<h2>2. Column-Wise Concatenation (axis = 1 / ’columns’)</h2>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> df3 = pd.concat([df1, df2], axis=1)
>>> print(df3) Key C1 C2 Key C3
L1 A 1.0 10.0 NaN NaN
L2 B 2.0 20.0 NaN NaN
L3 A 3.0 30.0 NaN NaN
L4 C 4.0 40.0 NaN NaN
R1 NaN NaN NaN A 100.0
R2 NaN NaN NaN B 200.0
R3 NaN NaN NaN C 300.0
R4 NaN NaN NaN D 400.0
>>> print("The unique row indexes of df1 and df2:", '\n\t', df1.index.append(df2.index).unique())
The unique row indexes of df1 and df2: Index(['L1', 'L2', 'L3', 'L4', 'R1', 'R2', 'R3', 'R4'], dtype='object')
>>> print("The row indexes of df3:", "\n\t", df3.index)
The row indexes of df3: Index(['L1', 'L2', 'L3', 'L4', 'R1', 'R2', 'R3', 'R4'], dtype='object')
>>> print("The column indexes of df1 and df2:", "\n\t", df1.columns.append(df2.columns))
The column indexes of df1 and df2: Index(['Key', 'C1', 'C2', 'Key', 'C3'], dtype='object')
>>> print("The column indexes of df3:", "\n\t", df3.columns)
The column indexes of df3: Index(['Key', 'C1', 'C2', 'Key', 'C3'], dtype='object')</pre>
<p>Some observations from the above results:</p>
<ul>
<li>The DataFrames are concatenated side by side.</li>
<li>The columns in the output DataFrame = Total columns in all the input DataFrames.</li>
<li>Rows in the output DataFrame = Unique rows in all the input DataFrames.</li>
<li>There are unique rows present in all the input DataFrames. The corresponding values at the column labels of different input DataFrames are filled with <code>NaN</code>s (Not a Number – missing values) in the output DataFrame.</li>
</ul>
<p>Let’s visualize the above process in the following animation:</p>
<figure class="wp-block-video aligncenter"><video autoplay loop muted src="https://blog.finxter.com/wp-content/uploads/2020/11/Pd_Concat_Axis_1.mp4"></video></figure>
<h1>Pandas Concat Columns</h1>
<p>Please take a look at the initial OnePlus phones comparison table from the amazon website. A column in that table constitutes all the specifications of a given smartphone. Such all equivalent specifications (row labels) of all varieties (phones – column labels) are concatenated as columns to form the final comparison table.</p>
<p>So, to concatenate columns, we should have the same row indexes. In Pandas, the Series data structure is exactly designed to represent the columns and their combination forms the DataFrame data structure.</p>
<p>Let us construct two Series and concatenate them as columns to form a resultant DataFrame.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> ser1 = pd.Series([10, 20, 30, 40], name='C1')
>>> ser2 = pd.Series([100, 200, 300, 400], name='C2')
>>> print("Series 1:", "\n", ser1, "\n\n", "Series 2:", "\n", ser2)
Series 1:
0 10
1 20
2 30
3 40
Name: C1, dtype: int64 Series 2:
0 100
1 200
2 300
3 400
Name: C2, dtype: int64
>>> df = pd.concat([ser1, ser2], axis=1)
>>> print("DataFrame:", "\n", df)
DataFrame: C1 C2
0 10 100
1 20 200
2 30 300
3 40 400</pre>
<h1>Pandas Concat MultiIndex</h1>
<p>Let us consider a use-case where we have hourly weather data for 4 hours about two cities. The data that we have are only the temperature (degC) and wind speed (kmph). One way of storing their data is to store them in different DataFrames per city. It can be done the following way,</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> Date_Hourly = pd.date_range(start = '2020-11-20', periods = 4, freq = 'H')
>>> df_city1 = pd.DataFrame(
... {"temp(degC)": [27, 24, 22, 20],
... "windspeed(kmph)": [18, 17, 17, 18]},
... index = Date_Hourly
... )
>>> df_city2 = pd.DataFrame(
... {"temp(degC)": [30, 33, 33, 34],
... "windspeed(kmph)": [23, 25, 27, 30]},
... index = Date_Hourly
... )
>>> print("Weather Data of City 1:", "\n", df_city1)
Weather Data of City 1: temp(degC) windspeed(kmph)
2020-11-20 00:00:00 27 18
2020-11-20 01:00:00 24 17
2020-11-20 02:00:00 22 17
2020-11-20 03:00:00 20 18
>>> print("Weather Data of City 2:", "\n", df_city2)
Weather Data of City 2: temp(degC) windspeed(kmph)
2020-11-20 00:00:00 30 23
2020-11-20 01:00:00 33 25
2020-11-20 02:00:00 33 27
2020-11-20 03:00:00 34 30</pre>
<p>Now, we might want to collect data of two cities into one DataFrame for easier analysis. MultiIndex keys serve as identifiers to specify the source of the data. This can be achieved by MultiIndex concatenation.</p>
<p>Multi-Index Concatenation is done in two ways;</p>
<h2>1. Row-Wise Concatenation (axis = 0 / ’index’)</h2>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> df_concat_rowwise = pd.concat([df_city1, df_city2], axis=0, keys=['City1', 'City2'])
>>> print("Row-Wise Multi-Index Concatenation:", "\n", df_concat_rowwise)
Row-Wise Multi-Index Concatenation: temp(degC) windspeed(kmph)
City1 2020-11-20 00:00:00 27 18 2020-11-20 01:00:00 24 17 2020-11-20 02:00:00 22 17 2020-11-20 03:00:00 20 18
City2 2020-11-20 00:00:00 30 23 2020-11-20 01:00:00 33 25 2020-11-20 02:00:00 33 27 2020-11-20 03:00:00 34 30</pre>
<h2>2. Column-Wise Concatenation (axis = 1 / ’columns’)</h2>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> df_concat_rowwise = pd.concat([df_city1, df_city2], axis=1, keys=['City1', 'City2']) >>> print("Column-Wise Multi-Index Concatenation:", "\n", df_concat_colwise)
Column-Wise Multi-Index Concatenation: City1 City2 temp(degC) windspeed(kmph) temp(degC) windspeed(kmph)
2020-11-20 00:00:00 27 18 30 23
2020-11-20 01:00:00 24 17 33 25
2020-11-20 02:00:00 22 17 33 27
2020-11-20 03:00:00 20 18 34 30</pre>
<p>The same can be achieved for many cities. After concatenation, all of the data is in one single DataFrame. This makes us analyze the weather efficiently instead of fetching data from multiple sources.</p>
<h1>Pandas concat vs append</h1>
<p>Concatenation along the rows (<code>axis = 0</code>) is very common. If you observe the weather data scenario after each hour data gets appended in the next row. So, for that purpose, a method called <code>append( )</code> is built on top of DataFrame to append another DataFrame row-wise. This makes you achieve the same results as <code>pd.concat( )</code> with few keystrokes.</p>
<p>It can be implemented as follows,</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> df1 = pd.DataFrame({'C1': ['A', 'B', 'C', 'D']})
>>> df2 = pd.DataFrame({'C1': ['E', 'F', 'G', 'H']})
>>> print("DataFrame 1:", "\n", df1)
DataFrame 1: C1
0 A
1 B
2 C
3 D
>>> print("DataFrame 2:", "\n", df2)
DataFrame 2: C1
0 E
1 F
2 G
3 H
>>> pd.concat([df1, df2]) C1
0 A
1 B
2 C
3 D
0 E
1 F
2 G
3 H
>>> df1.append(df2) C1
0 A
1 B
2 C
3 D
0 E
1 F
2 G
3 H</pre>
<p>You can observe above the same results for <code>pd.concat([df1, df2])</code> and <code>df1.append(df2)</code>.</p>
<h1>Pandas concat slow</h1>
<p>Each and every time we do a concatenation operation, it creates a new DataFrame. DataFrame concatenation operates equivalent to an SQL join operation. So, the output DataFrame’s index is formed first by join operation. Resolving all the mismatches between indexes of input DataFrames makes it slow. In some scenarios, indexes might not be of importance. In such cases, we can ignore indexes to make the concat operation faster.</p>
<p>Ignoring the index is done by the following way,</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> df = pd.DataFrame({'C1': [10, 20, 30, 40]}, index=['R1', 'R2', 'R3', 'R4'])
>>> df C1
R1 10
R2 20
R3 30
R4 40
>>> df.reset_index(drop=True) C1
0 10
1 20
2 30
3 40</pre>
<p>Along with <code>concat</code>, all other Pandas functions are executed by utilizing only a single core in the CPU. Operations on smaller datasets run in a seamless manner. As the dataset size increases, the functions of Pandas start to throttle because they do only one operation at once.</p>
<p><a href="https://github.com/modin-project/modin">Modin</a> is the python package created to speed up the execution of Pandas functions. It distributes the computation load to all the available cores. It does so by fragmenting the DatFrame and making the function run on DataFrame fragments in other cores parallelly. Please look after this <a href="https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html#:~:text=But%20there%20is%20one%20drawback,of%20a%20difference%20in%20speed.">article</a> to know about it in detail.</p>
<p>The post <a href="https://blog.finxter.com/how-does-pandas-concat-work/" target="_blank" rel="noopener noreferrer">How Does Pandas Concat Work?</a> first appeared on <a href="https://blog.finxter.com/" target="_blank" rel="noopener noreferrer">Finxter</a>.</p>
</div>


https://www.sickgaming.net/blog/2020/11/...ncat-work/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016