Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Pandas qcut() – A Simple Guide with Video

#1
Pandas qcut() – A Simple Guide with Video

In this tutorial, we learn about the Pandas function qcut(). This function creates unequal-sized bins with the same number of samples in each bin.




Here are the parameters from the official documentation:


Parameter Type Description
x 1d ndarray or Series
q int or list of float values Number of quantiles. Alternately: array of
quantiles.
labels array or False, default: None Used as the labels for the resulting bins.
Must be of the same length as the resulting bins. If False: returns only integer indicators of the bins. If True: raises an error.
retbins bool, optional Whether to return the bins/labels.
precision int, optional The precision at which to store and display
the bin labels.
duplicates {default 'raise', 'drop'},
optional
If the bin edges are not unique:
raise ValueError or drop the non-uniques.
Returns Type Description
out Categorical or Series or array of integers if labels is set to False The return type depends on the input:
a Series of type Category if input is a Series, else Categorical. Bins are represented as categories when categorical data is returned.
bins ndarray of floats Only if retbins is set to True.

Basic Example


Let’s create a data frame that we will be using throughout the tutorial:

import pandas as pd df = pd.DataFrame({'Competitor':['Alice', 'Mary', 'John', 'Ann', 'Bob', 'Jane', 'Tom', 'Vincent', 'Ella'], 'Score':[1,6,11,2,9,16,5,2,19]})
print(df)

Competitor Score
0 Alice 1
1 Mary 6
2 John 11
3 Ann 2
4 Bob 9
5 Jane 16
6 Tom 5
7 Vincent 2
8 Ella 19

We import the Pandas library and then we create a Pandas data frame which we assign to the variable “df“. The outputted data frame provides information about several competitors and a score that each competitor reached.

Now, we apply the qcut() function:

pd.qcut(x = df['Score'], q = 3)

0 (0.999, 4.0]
1 (4.0, 9.667]
2 (9.667, 19.0]
3 (0.999, 4.0]
4 (4.0, 9.667]
5 (9.667, 19.0]
6 (4.0, 9.667]
7 (0.999, 4.0]
8 (9.667, 19.0]


Name: Score, dtype: category
Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]

Inside the function, we put in “df['Score']” as the value for the parameter “x” to state that this is the column that we want to use to calculate the bins on. The second argument is “3” which we assign to the “q” parameter. This is the number of quantiles.

The output assigns each score to an interval. There are a few things to observe here.

First, we can see at the bottom of the output the intervals in order (“(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]“). The intervals start with parenthesis and end with square brackets. That means that the left value is not included in the interval, but the right one is. For example, “0.999” is not included, whereas “4.0” is included.

Additionally, we can see that the intervals do not have the same size. The first interval has a size of 3, the second has a size of 5.667 and the third one has a size of 9.333. Why are the intervals these particular sizes?

To answer that, we have to take a look at the number of values in each interval:

pd.qcut(x = df['Score'], q = 3).value_counts()

(0.999, 4.0] 3
(4.0, 9.667] 3
(9.667, 19.0] 3
Name: score, dtype: int64

We use the value_counts() function to achieve that. We can see that each bin has an equal amount of values. By assigning “3” to the “q” parameter we state that we want to get three intervals. And each interval should contain just as many values as the others. So, the interval sizes adjust to that.

To make it better visible which interval belongs to which score, we create a new column for the data frame:

df['Category'] = pd.qcut(x = df['Score'], q = 3)
print(df)

Competitor Score Category
0 Alice 1 (0.999, 4.0]
1 Mary 6 (4.0, 9.667]
2 John 11 (9.667, 19.0]
3 Ann 2 (0.999, 4.0]
4 Bob 9 (4.0, 9.667]
5 Jane 16 (9.667, 19.0]
6 Tom 5 (4.0, 9.667]
7 Vincent 2 (0.999, 4.0]
8 Ella 19 (9.667, 19.0]

We create a new column called “Category” which contains the intervals and we add it to the existing data frame.

The “q” parameter


In the previous example, we set the “q” parameter equal to “3”. Of course, we can also assign other values here. Apart from an integer value, we can assign this parameter a list:

pd.qcut(x = df['Score'], q = [0, .25, .5, .75, 1.])

Output:

0	(0.999, 2.0]
1 (2.0, 6.0]
2 (6.0, 11.0]
3 (0.999, 2.0]
4 (6.0, 11.0]
5 (11.0, 19.0]
6 (2.0, 6.0]
7 (0.999, 2.0]
8 (11.0, 19.0]
Name: Score, dtype: category
Categories: (4, interval[float64, right]): [(0.999, 2.0] < (2.0, 6.0] < (6.0, 11.0] < (11.0, 19.0]]

This way, we directly determine how many percent of the values are included in each interval. For example, the first interval (0.999, 2.0] contains the first 25% of the score values. Since the intervals we created here all have the same length of 25%, we should get an equal amount of values in each interval.

Let’s see if that’s the case:

pd.qcut(x = df['Score'], q = [0, .25, .5, .75, 1.]).value_counts()

Output:

(0.999, 2.0]	3
(2.0, 6.0] 2
(6.0, 11.0] 2
(11.0, 19.0] 2
Name: Score, dtype: int64

We make use of the value_counts() function again. As we can see, the first interval contains one value more than the other ones. That’s because we have nine scores in total and nine cannot be divided by four. Consequently, the number of values per interval cannot be the same in all intervals.

The distance between the quantiles in the array does not have to be even:

pd.qcut(x = df['Score'], q = [0, .5, .7, .85, 1.])

Output:

0	(0.999, 6.0]
1 (0.999, 6.0]
2 (10.2, 15.0]
3 (0.999, 6.0]
4 (6.0, 10.2]
5 (15.0, 19.0]
6 (0.999, 6.0]
7 (0.999, 6.0]
8 (15.0, 19.0] Name: Score, dtype: category
Categories: (4, interval[float64, right]): [(0.999, 6.0] < (6.0, 10.2] < (10.2, 15.0] < (15.0, 19.0]]

The first interval is way bigger than the other ones. Thus, the number of values per interval is not evenly distributed:

pd.qcut(x = df['Score'], q = [0, .5, .7, .85, 1.]).value_counts()

Output:

(0.999, 6.0]	5
(15.0, 19.0] 2
(6.0, 10.2] 1
(10.2, 15.0] 1
Name: Score, dtype: int64

As we can observe, the first interval contains the most score values.

Determine the Interval Precision


By now, the intervals we created all had a specific precision:

pd.qcut(x = df['Score'], q = 3)

Output:

0	(0.999, 4.0]
1 (4.0, 9.667]
2 (9.667, 19.0]
3 (0.999, 4.0]
4 (4.0, 9.667]
5 (9.667, 19.0]
6 (4.0, 9.667]
7 (0.999, 4.0]
8 (9.667, 19.0] Name: Score, dtype: category
Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]

As we can see, there are three decimal places except for the integer values that only have “.0” as the decimal place.

We can change that precision using the “precision” parameter. This parameter expects an integer value which determines how many decimal places we want to get.

Let’s assign “5” here to get five decimal places:

pd.qcut(x = df['Score'], q = 3, precision=5)

Output:

0	(0.99999, 4.0]
1 (4.0, 9.66667]
2 (9.66667, 19.0]
3 (0.99999, 4.0]
4 (4.0, 9.66667]
5 (9.66667, 19.0]
6 (4.0, 9.66667]
7 (0.99999, 4.0]
8 (9.66667, 19.0] Name: Score, dtype: category
Categories: (3, interval[float64, right]): [(0.99999, 4.0] < (4.0, 9.66667] < (9.66667, 19.0]]

In this manner, we create more precise intervals. How precise we should create them depends on the use case.

Print out the bins


If we want to print out the bins that we created, we apply the “retbins” parameter and set it to “True“:

pd.qcut(x = df['Score'],q = 3, retbins=True)

Output:

0	(0.999, 4.0]
1 (4.0, 9.667]
2 (9.667, 19.0]
3 (0.999, 4.0]
4 (4.0, 9.667]
5 (9.667, 19.0]
6 (4.0, 9.667]
7 (0.999, 4.0]
8 (9.667, 19.0] Name: Score, dtype: category
Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]
array([1., 4., 9.66666667, 19.]))

The only difference here compared to when we did not apply the “retbins” parameter is the additional line “array” at the bottom of the output. Here, we get the resulting bins inside an array.

This can be useful especially when we assign the “q” parameter an integer as we did here instead of a list.

Define labels for the categories


We already saw how to create a new column to our data frame to see which score belongs to which interval:

df['Category'] = pd.qcut(x = df['Score'], q = 3)
print(df)

Output:

	Competitor	Score	Category
0 Alice 1 (0.999, 4.0]
1 Mary 6 (4.0, 9.667]
2 John 11 (9.667, 19.0]
3 Ann 2 (0.999, 4.0]
4 Bob 9 (4.0, 9.667]
5 Jane 16 (9.667, 19.0]
6 Tom 5 (4.0, 9.667]
7 Vincent 2 (0.999, 4.0]
8 Ella 19 (9.667, 19.0]

This way, we get a great overview of our data. However, assigning the intervals to the scores can be a bit confusing as we do not clearly see what a good score is and what isn’t.

This is where the “labels” parameter comes into play. We can give each interval a label to categorize our data:

df['Category'] = pd.qcut(x = df['Score'], q = 3, labels=['bad', 'good', 'exceptional'])
print(df)

Output:

	Competitor	Score	Category
0 Alice 1 bad
1 Mary 6 good
2 John 11 exceptional
3 Ann 2 bad
4 Bob 9 good
5 Jane 16 exceptional
6 Tom 5 good
7 Vincent 2 bad
8 Ella 19 exceptional

The “labels” parameter expects a list of the labels. We choose the labels "bad", "good", and "exceptional". So, the smallest interval is assigned the label "bad", the middle interval is assigned the label "good", and the biggest interval is assigned the label "exceptional".

Thus, we can categorize our data in a more user-friendly way.

Comparison with the cut() function


Chances are when you work with the qcut() function, you have come across the cut() function as well.

In this final section, we will see the difference between the qcut() and the cut() function.

Let’s refer to our initial example of the qcut() function where we assigned the “q” parameter the value “3”:

pd.qcut(x = df['Score'], q = 3)

Output:

0	(0.999, 4.0]
1 (4.0, 9.667]
2 (9.667, 19.0]
3 (0.999, 4.0]
4 (4.0, 9.667]
5 (9.667, 19.0]
6 (4.0, 9.667]
7 (0.999, 4.0]
8 (9.667, 19.0] Name: Score, dtype: category
Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]

We created three quantiles in a way that each interval now contains the same amount of score values:

pd.qcut(x = df['Score'], q = 3).value_counts()

Output:

(0.999, 4.0]	3
(4.0, 9.667] 3
(9.667, 19.0] 3
Name: score, dtype: int64

Now we do essentially the same with the cut() function:

pd.cut(x = df['Score'], bins = 3)

Output:

0	(0.982, 7.0]
1 (0.982, 7.0]
2 (7.0, 13.0]
3 (0.982, 7.0]
4 (7.0, 13.0]
5 (13.0, 19.0]
6 (0.982, 7.0]
7 (0.982, 7.0]
8 (13.0, 19.0] Name: Score, dtype: category
Categories: (3, interval[float64, right]): [(0.982, 7.0] < (7.0, 13.0] < (13.0, 19.0]]

The cut() function does not provide a “q” parameter, instead, it has the “bins” parameter which we also assign the value “3” to create three bins.

As we can see, the intervals are different from the ones from the qcut() function. Compared to the qcut() function, these intervals all have the same size. They are all six units long.

However, the number of values in each interval is different:

pd.cut(x = df['Score'], bins = 3).value_counts()

Output:

(0.982, 7.0]	5
(7.0, 13.0] 2
(13.0, 19.0] 2
Name: Score, dtype: int64

Thus, qcut() creates intervals that are not equally long but they all contain the same number of values. Whereas the cut() function creates equal-sized intervals that don’t necessarily have the same number of values in them.

Summary


In this tutorial, we learned about the qcut() function. We saw how to create intervals in several ways, how to determine the interval’s precision, how to label our categories, and we determined the differences to the cut() function.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!



https://www.sickgaming.net/blog/2021/12/...ith-video/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

[-]
Discord

[-]
Active Threads
News - Switch eShop Confirms Sherlock Ho...
Last Post: xSicKxBot
Today 06:08 AM
» Replies: 0
» Views: 26
News - The So-Called BioWare Magic Is "B...
Last Post: xSicKxBot
Today 06:08 AM
» Replies: 0
» Views: 3
News - Wario And Waluigi Will Have A Bat...
Last Post: xSicKxBot
Today 02:34 AM
» Replies: 0
» Views: 8
News - Battlefield 2042 Adds Zombies Mod...
Last Post: xSicKxBot
Today 02:34 AM
» Replies: 0
» Views: 4
[Tut] How to Remove a Trailing Newline?
Last Post: xSicKxBot
Yesterday 10:54 PM
» Replies: 0
» Views: 11
(Indie Deal) S.T.A.L.K.E.R.: Bundle & Di...
Last Post: xSicKxBot
Yesterday 10:54 PM
» Replies: 0
» Views: 4
(Free Game Key) Relicta - Free Epic Game...
Last Post: xSicKxBot
Yesterday 10:54 PM
» Replies: 0
» Views: 6
Mobile - Dying Light dockets – weapons, ...
Last Post: xSicKxBot
Yesterday 10:54 PM
» Replies: 0
» Views: 2
Announcing .NET MAUI Preview 12
Last Post: xSicKxBot
Yesterday 10:54 PM
» Replies: 0
» Views: 5
Microsoft - Destructive malware targetin...
Last Post: xSicKxBot
Yesterday 10:54 PM
» Replies: 0
» Views: 4

[-]
Twitter



Discord Server © SickGaming.net 2012-2021