Question
Why the sorting using pd.Series.sort_index does not seem to work when using categorical index? How would one sort the index of the multi-index pd.Series using other sort order than alphabetical/numerical?
MWE
Setup code
import pandas as pd import numpy as np d = { 'Card': [ 'Visa', 'Visa', 'Master Card', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa', 'Master Card', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa', 'Master Card', 'Visa', 'Master Card', 'Master Card', 'Master Card', 'Master Card', 'Master Card', 'Master Card', 'Visa', 'Visa' ], 'Year': [ 'Three', 'Three', 'Seven', 'Three', 'Three', 'Seven', 'Seven', 'Seven', 'Three', 'Seven', 'Three', 'Three', 'Three', 'Seven', 'Three', 'Three', 'Seven', 'Seven', 'Seven', 'Three', 'Seven', 'Three', 'Five', 'One', 'One', 'Two', 'Four', 'Six', 'Six' ], 'Value': [ 45, 13, 52, 321, 31, 1231, 876, 231, 4, 213, 123, 45, 321, 1, 123, 52, 736, 35, 900, 301, 374, 9, 294, 337, 4465, 321, 755, 22, 8 ] } df = pd.DataFrame(d) grp_cols = ['Card', 'Year'] ser_val = df.groupby(grp_cols)['Value'].mean()
Using naively just
sort_index
, the data looks like this:In [2]: ser_val.sort_index() Out[2]: Card Year Master Card Five 294.000000 Four 755.000000 One 2401.000000 Seven 505.000000 Three 146.833333 Two 321.000000 Visa Seven 438.166667 Six 15.000000 Three 84.500000 Name: Value, dtype: float64
you can see that the columns are sorted alphabetically. Now, I want to force the ordering. For that, I try:
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven'] categories = pd.Categorical(ser_val.index.levels[1].values, categories=categories_order, ordered=True) ser_val.index.set_levels(categories, level='Year', inplace=True)
again, after sorting, the data looks like this (again, alphabetical order)
In [3]: ser_val.sort_index() Out[3]: Card Year Master Card Five 294.000000 Four 755.000000 One 2401.000000 Seven 505.000000 Three 146.833333 Two 321.000000 Visa Seven 438.166667 Six 15.000000 Three 84.500000 Name: Value, dtype: float64
I know that if I convert the data into pandas.DataFrame and sort there, it works, like this:
df_val = ser_val.reset_index().sort_values(grp_cols) df_val['Year'] = pd.Categorical(df_val['Year'].values, categories_order, ordered=True) df_val = df_val.sort_values(grp_cols).set_index(grp_cols) In [5]: df_val Out[5]: Value Card Year Master Card One 2401.000000 Two 321.000000 Three 146.833333 Four 755.000000 Five 294.000000 Seven 505.000000 Visa Three 84.500000 Six 15.000000 Seven 438.166667
Why won't the pd.Series sort with categorical data?
I am using pandas 1.0.5 in Python 3.7.3 64-bit
score:5
Actually, I think you found a bug or two!
Bug #1 - Changing dtype with set_levels using pd.Categorical doesn't work.
import pandas as pd
import numpy as np
d = {
'Card': [
'Visa', 'Visa', 'Master Card', 'Master Card', 'Visa', 'Master Card',
'Visa', 'Visa', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
'Master Card', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
'Master Card', 'Visa', 'Master Card', 'Master Card', 'Master Card',
'Master Card', 'Master Card', 'Master Card', 'Visa', 'Visa'
],
'Year': [
'Three', 'Three', 'Seven', 'Three', 'Three', 'Seven', 'Seven', 'Seven',
'Three', 'Seven', 'Three', 'Three', 'Three', 'Seven', 'Three', 'Three',
'Seven', 'Seven', 'Seven', 'Three', 'Seven', 'Three', 'Five', 'One',
'One', 'Two', 'Four', 'Six', 'Six'
],
'Value': [
45, 13, 52, 321, 31, 1231, 876, 231, 4, 213, 123, 45, 321, 1, 123, 52,
736, 35, 900, 301, 374, 9, 294, 337, 4465, 321, 755, 22, 8
]
}
df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols)['Value'].mean()
#---------------------------------------------
#Attempt 1st
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
categories=categories_order,
ordered=True)
ser_val.index.set_levels(categories, level=1, inplace=True)
print(ser_val.index.levels[1].dtype)
#--------------------------------------------
#Attempt 2nd
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
categories=categories_order,
ordered=True)
ser_val.index = ser_val.index.set_levels(categories, level='Year')
print(ser_val.index.levels[1].dtype)
#----------------------------------
#Attempt 3rd and success
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categoriesDtype = pd.CategoricalDtype(categories_order, ordered=True)
ser_val.index = ser_val.index.set_levels(ser_val.index.levels[1].astype(categoriesDtype), level='Year')
print(ser_val.index.levels[1].dtype)
Output:
object *FAILED change type using inplace*
object *FAILED change type using reassignment*
category *SUCCESS change type using pd.CategoricalDtype*
Bug #2 - sort_index using categorical for MultiIndex level 1 doesn't work
This may be found already here as an open issue #24271
After a successful change of dype for index level 1:
ser_val.index.levels[1]
Output:
CategoricalIndex(['Five', 'Four', 'One', 'Seven', 'Six', 'Three', 'Two'], categories=['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven'], ordered=True, name='Year', dtype='category')
Now, let's sort the dataframe using sort_index:
ser_val.sort_index()
Output (FAIL):
Card Year
Master Card Five 294.000000
Four 755.000000
One 2401.000000
Seven 505.000000
Three 146.833333
Two 321.000000
Visa Seven 438.166667
Six 15.000000
Three 84.500000
Name: Value, dtype: float64
Now, just for kicks and testing, let's swap index levels and try sort_index again.
ser_val.swaplevel(0,1).sort_index()
Output(SUCCESS):
Year Card
One Master Card 2401.000000
Two Master Card 321.000000
Three Master Card 146.833333
Visa 84.500000
Four Master Card 755.000000
Five Master Card 294.000000
Six Visa 15.000000
Seven Master Card 505.000000
Visa 438.166667
Name: Value, dtype: float64
However, if we explicitly set the sorting levels... FAIL again.
ser_val.swaplevel(0,1).sort_index(level=[0,1])
Output:
Year Card
Five Master Card 294.000000
Four Master Card 755.000000
One Master Card 2401.000000
Seven Master Card 505.000000
Visa 438.166667
Six Visa 15.000000
Three Master Card 146.833333
Visa 84.500000
Two Master Card 321.000000
Name: Value, dtype: float64
score:6
TLDR: You need to set sort=False
in your groupby
, and you need to change your Categorical
to a CategoricalIndex
. Here's the full working example:
df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols, sort=False)['Value'].mean()
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.CategoricalIndex(ser_val.index.levels[1].values,
categories=categories_order,
ordered=True)
ser_val.index.set_levels(categories, level='Year', inplace=True)
ser_val.sort_index(inplace=True)
And ser_val
is now:
Card Year
Master Card One 2401.000000
Two 321.000000
Three 146.833333
Four 755.000000
Five 294.000000
Seven 505.000000
Visa Three 84.500000
Six 15.000000
Seven 438.166667
Name: Value, dtype: float64
Longer Ramblings: You ask why what you did doesn't work, which I certainly cannot explain (I'm sure that involves some rooting around in the source), but here is how I arrived at my solution.
See that the following example works, constructing a toy MultiIndex Series from scratch:
lets = ['a','b','c']*3
ids = ['MALE']*4 + ['FEMALE']*5
s = pd.Series(range(9), index=[ids,lets])
categories_order = ['b','a','c']
categories = pd.CategoricalIndex(s.index.levels[1].values,
categories=categories_order,
ordered=True)
s.index.set_levels(categories, level=1,inplace=True)
s.sort_index(inplace=True)
s
is sorted as we want:
FEMALE b 4
b 7
a 6
c 5
c 8
MALE b 1
a 0
a 3
c 2
dtype: int64
The only significant difference between your example and mine (that I could tell) is that yours starts from groupby
. There is a sort
parameter of groupby
:
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
So it seems like the groupby
sort is enforcing some order that is not being overridden by your new categorical order.
But still with sort=False
, your code as is didn't work. Just through googling I found that there were separate classes for Categorical
and CategoricalIndex
, and apparently the latter is what you need here. And sure enough my example will also fail if Categorical
is used instead of CategoricalIndex
.
So the groupby
thing seems like the more curious issue; again the underlying rules here I can't tell you, but maybe someone can elaborate.
Credit To: stackoverflow.com
Related Query
- Sort a multi-index Series on a particular level using Categorical Index values
- Converting a pandas multi-index series to a dataframe by using second index as columns
- pandas divide two multi index series
- Finding the intersection between two series in Pandas using index
- Multi index series into pandas dataframe
- PANDAS vlookup against series with common index using map
- How to transform a time series pandas dataframe using the index attributes?
- Pandas using loc for assignment in a Multi Index DataFrame
- Sorting a table values based on index and non-indexed columns using python
- Sorting Multi Index in Pandas differently for each level
- python pandas series loc value from multi index
- Retain DataFrame's index when using groupby apply to generate Series
- using cumsum method on multi level index in pandas
- TypeError: unhashable type: 'dict' when using pandas Multi Index
- Pandas Series replacement values using index
- Using agg with multi index in pandas
- How to indicate the multi index columns using read_sql_query (pandas dataframes)
- Filter Pandas Series using both index and value
- how to groupby a Multi Index time series data?
- Multiplying Multi Index Pandas Series and DataFrame
- using rolling functions, in pandas, with a series where the time index is very sparse
- How to have multi index on both rows and columns of a dataframe without using tuples?
- Slicing Multi Index Header DataFrames in Python With Custom Sorting
- Sort multi index dataframe using a reference list
- Reordering heatmap rows in Python using a custom categorical index order
- Capturing DataFrame values using row index and columns names in Series
- Map a Pandas Series of strings using index position in another array
- Select subset of dataframe using condition on multi index
- Sorting Multi level dataframe index level0, based on value of index level1 at specific column
- Accessing elements of series object without using index python, accessing values of correlation or other matrices
More Query from same tag
- Given an index label, how would you extract the index position in a dataframe?
- IndexError: indices are out-of-bounds when accessing pandas.DataFrame
- How to combine two histograms python
- return the index using pandas series.sample()?
- Convert Timezone in Dataframe with pandas in python
- pandas.read_csv not partitioning data at semicolon delimiter
- Build pandas dataframe in for loop
- Avoid pandas change plot colors when using **kwds
- Calculating Exponential Moving Average using pandas
- Pandas read_csv get rid of enclosing double quotes
- Pandas - Using a list of values to create a smaller frame
- How to identify unique ID's that have only 1 true condition?
- Taking next x values for each item in pandas column and making them into adjacent row
- Add a column to a dataframe from another dataframe based on conditions in both dataframes
- Python Pandas: Merge or Filter DataFrame by Another. Is there a Better Way?