A demonstration of simple uses of MultiIndex

Pandas Dataframes generally have an "index", one column of a dataset that gives the name for each row. It works like a primary key in a database table. But Pandas also supports a MultiIndex, in which the index for a row is some composite key of several columns. It's quite confusing at first, here's a simple demo of creating a multi-indexed DataFrame and then querying subsets with various syntax.

In [1]:
import pandas, io

Create an unindexed DataFrame from a CSV file

In [2]:
data = io.StringIO('''Fruit,Color,Count,Price
Apple,Red,3,$1.29
Apple,Green,9,$0.99
Pear,Red,25,$2.59
Pear,Green,26,$2.79
Lime,Green,99,$0.39
''')
df_unindexed = pandas.read_csv(data)
df_unindexed
Out[2]:
Fruit Color Count Price
0 Apple Red 3 $1.29
1 Apple Green 9 $0.99
2 Pear Red 25 $2.59
3 Pear Green 26 $2.79
4 Lime Green 99 $0.39

Add a multi-index based on two columns

Note, set_index() creates a new DataFrame

In [3]:
df = df_unindexed.set_index(['Fruit', 'Color'])
df
Out[3]:
Count Price
Fruit Color
Apple Red 3 $1.29
Green 9 $0.99
Pear Red 25 $2.59
Green 26 $2.79
Lime Green 99 $0.39

Querying the data frame with .xs()

This works pretty simply, but the resulting DataFrames no longer have the multi-index. Also .xs() is not the most powerful way to subset a DataFrame.

Find all Apples

In [4]:
df.xs('Apple')
Out[4]:
Count Price
Color
Red 3 $1.29
Green 9 $0.99

Find all red fruits

In [5]:
df.xs('Red', level='Color')
Out[5]:
Count Price
Fruit
Apple 3 $1.29
Pear 25 $2.59

Querying the data frame with .loc[] for single values

.loc[] is the preferred way to subset a DataFrame based on labels.

Find all rows with the label "Apple". Extract all columns

In [6]:
df.loc['Apple', :]        # the , : means "all columns"; we could name a subset of columns here
Out[6]:
Count Price
Color
Red 3 $1.29
Green 9 $0.99

Find all red apples, using a tuple

In [7]:
df.loc[('Apple', 'Red'), :]
Out[7]:
Count        3
Price    $1.29
Name: (Apple, Red), dtype: object

Find all apples using a tuple. This prints a warning in Pandas 0.18.1, see below

In [8]:
df.loc[('Apple', ), :]
/home/nelson/src/lol-winrate-by-tier/venv/lib/python3.4/site-packages/pandas/core/indexing.py:1294: PerformanceWarning: indexing past lexsort depth may impact performance.
  return self._getitem_tuple(key)
Out[8]:
Count Price
Color
Red 3 $1.29
Green 9 $0.99

Lexsorting

Unfortunately we've been cheating with our MultiIndex all along. Pandas really wants your DataFrame to be sorted if you are doing complicated queries with a MultiIndex. Slice queries require sorting. This is documented as the need for sortedness with MultiIndex. Sometimes this is called "lexsorting", because lexicographic sorting of the index keys is common.

If your DataFrame is not sorted, you will see several possible errors or warnings.

PerformanceWarning: indexing past lexsort depth may impact performance.
KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'

The simple solution is to sort the DataFrame using sortlevel(), which sorts lexicographically. If you don't like that sorting order, more control is available with sort_index(). I believe all that Pandas cares about is that your DataFrame was sorted; it tracks whether that happened with one flag per MultiIndex level. It doesn't care what order they were actually sorted in, although presumably it affects the semantics of label slicing.

Anyway, let's sort our DataFrame.

In [9]:
df.sortlevel(inplace=True)

💥 Ta-Da!💥

Note that the dataframe is now reordered, with Lime in the middle and Green before Red.

In [10]:
df
Out[10]:
Count Price
Fruit Color
Apple Green 9 $0.99
Red 3 $1.29
Lime Green 99 $0.39
Pear Green 26 $2.79
Red 25 $2.59

Now querying with a tuple won't print a warning about lexsort

In [11]:
df.loc[('Apple', ), :]
Out[11]:
Count Price
Color
Green 9 $0.99
Red 3 $1.29

Querying the data frame with .loc[] for ranges of values

We can uses slices to find ranges of things in our DataFrame. Note the full MultiIndex is preserved in the result sets, which is nice.

Slicing with index labels is a bit weird. As the docs say, "contrary to usual python slices, both the start and the stop are included!" Also the order implied by slicing is a bit ambiguous (see discussion above about Lexsorting).

Find all Apples with an explicit slice construction

In [12]:
df.loc[slice('Apple', 'Apple'), :]
Out[12]:
Count Price
Fruit Color
Apple Green 9 $0.99
Red 3 $1.29

Find everything between Limes and Pears

In [13]:
df.loc[slice('Lime', 'Pear'), :]
Out[13]:
Count Price
Fruit Color
Lime Green 99 $0.39
Pear Green 26 $2.79
Red 25 $2.59

So far we've only been querying by the first level of the index. How do we query by the second, the color?
First, let's make the "all apples" a bit more explicit, using a tuple of two slices.

In [14]:
df.loc[(slice('Apple', 'Apple'), slice(None)), :]
Out[14]:
Count Price
Fruit Color
Apple Green 9 $0.99
Red 3 $1.29

slice(None) is a wildcard of sorts. How about all red fruits?

In [15]:
df.loc[(slice(None), slice('Red', 'Red')), :]
Out[15]:
Count Price
Fruit Color
Apple Red 3 $1.29
Pear Red 25 $2.59

Typing slice() a lot is tedious, so Pandas has a (poorly documented) helper called IndexSlice that allows for some syntactic sugar. Here's all red fruits using that syntax.

In [16]:
idx = pandas.IndexSlice

df.loc[idx[:,['Red']], :]
Out[16]:
Count Price
Fruit Color
Apple Red 3 $1.29
Pear Red 25 $2.59

And finally all green Limes and Pears

In [17]:
df.loc[idx['Lime':'Pear','Green'],:]
Out[17]:
Count Price
Fruit Color
Lime Green 99 $0.39
Pear Green 26 $2.79

Slicing DataFrames can get quite complex. For more examples, see the Pandas docs