point

 

 Remember me

Register  |   Lost password?

 

« »

Quant Pythonista Blog Header

A blog about scientific Python, quant finance, statistics, data analysis, and hacking of all kinds


Even easier frequency tables in pandas 0.7.0

Sun, 22 Jan 2012 14:00:02 GMT

I put in a little work on a new crosstab function in the main pandas namespace. It’s basically a convenient shortcut to calling pivot_table to make it easy to compute cross-tabulations for a set of factors using pandas DataFrame or even vanilla NumPy arrays!

Here’s an example:

In [1]: from pandas import * In [2]: df = DataFrame({'a' : np.random.randint(0, 2, size=20),    ...:                 'b' : np.random.randint(0, 5, size=20),    ...:                 'c' : np.random.randint(0, 3, size=20)}) In [3]: df Out[3]:      a  b  c 0   0  1  2 1   1  4  1 2   1  4  2 3   1  2  2 4   0  0  0 5   0  0  2 6   1  2  2 7   1  2  0 8   1  1  1 9   1  2  0 10  1  4  1 11  1  1  2 12  0  1  2 13  1  1  0 14  1  4  1 15  1  4  0 16  0  4  1 17  0  3  1 18  1  0  1 19  0  4  1 In [4]: crosstab(df['b'], [df['a'], df['c']]) Out[4]:  a  0        1       c  0  1  2  0  1  2 b                   0  1  0  1  0  1  0 1  0  0  2  1  1  1 2  0  0  0  2  0  2 3  0  1  0  0  0  0 4  0  2  0  1  3  1

This makes it very easy to produce an easy-on-the-eyes frequency table. crosstab can also take NumPy arrays. Suppose we had 1 million draws from a normal distribution, and we wish to produce a histogram-like table showing the number of draws whose absolute values fall into the bins defined by [0, 1, 2, 3]. Also, let’s divide things up by sign. Using crosstab this becomes dead simple.

In [19]: arr = np.random.randn(1000000) In [20]: buckets = [0, 1, 2, 3] In [22]: crosstab(np.digitize(np.abs(arr), buckets), np.sign(arr),                                                        ....:          rownames=['bucket'], colnames=['sign'])    ....: Out[22]:  sign    -1.0    1.0    bucket                 1       341678  340498 2       136104  135999 3       21424   21607  4       1334    1356

Of course since the result is a DataFrame you can customize the indexes:

In [28]: table = crosstab(np.digitize(np.abs(arr), buckets),    ....:                  np.sign(arr), rownames=['bucket'],     ....:                  colnames=['sign'])    ....: In [29]: table.index = Index(['[0, 1)', '[1, 2)', '[2, 3)',     ....:                      '[3, inf)'], name="bin")    ....:                   In [30]: table.columns = Index(['negative', 'positive'], name="sign") In [31]: table Out[31]:  sign      negative  positive bin                          [0, 1)    341678    340498   [1, 2)    136104    135999   [2, 3)    21424     21607    [3, inf)  1334      1356

, , , , , , , ,