A blog about scientific Python, quant finance, statistics, data analysis, and hacking of all kinds
Even easier frequency tables in pandas 0.7.0
Sun, 22 Jan 2012 14:00:02 GMT
I put in a little work on a new crosstab function in the main pandas namespace. It’s basically a convenient shortcut to calling pivot_table to make it easy to compute cross-tabulations for a set of factors using pandas DataFrame or even vanilla NumPy arrays!
Here’s an example:
In [1]: from pandas import * In [2]: df = DataFrame({'a' : np.random.randint(0, 2, size=20), ...: 'b' : np.random.randint(0, 5, size=20), ...: 'c' : np.random.randint(0, 3, size=20)}) In [3]: df Out[3]: a b c 0 0 1 2 1 1 4 1 2 1 4 2 3 1 2 2 4 0 0 0 5 0 0 2 6 1 2 2 7 1 2 0 8 1 1 1 9 1 2 0 10 1 4 1 11 1 1 2 12 0 1 2 13 1 1 0 14 1 4 1 15 1 4 0 16 0 4 1 17 0 3 1 18 1 0 1 19 0 4 1 In [4]: crosstab(df['b'], [df['a'], df['c']]) Out[4]: a 0 1 c 0 1 2 0 1 2 b 0 1 0 1 0 1 0 1 0 0 2 1 1 1 2 0 0 0 2 0 2 3 0 1 0 0 0 0 4 0 2 0 1 3 1
This makes it very easy to produce an easy-on-the-eyes frequency table. crosstab can also take NumPy arrays. Suppose we had 1 million draws from a normal distribution, and we wish to produce a histogram-like table showing the number of draws whose absolute values fall into the bins defined by [0, 1, 2, 3]. Also, let’s divide things up by sign. Using crosstab this becomes dead simple.
In [19]: arr = np.random.randn(1000000) In [20]: buckets = [0, 1, 2, 3] In [22]: crosstab(np.digitize(np.abs(arr), buckets), np.sign(arr), ....: rownames=['bucket'], colnames=['sign']) ....: Out[22]: sign -1.0 1.0 bucket 1 341678 340498 2 136104 135999 3 21424 21607 4 1334 1356
Of course since the result is a DataFrame you can customize the indexes:
In [28]: table = crosstab(np.digitize(np.abs(arr), buckets), ....: np.sign(arr), rownames=['bucket'], ....: colnames=['sign']) ....: In [29]: table.index = Index(['[0, 1)', '[1, 2)', '[2, 3)', ....: '[3, inf)'], name="bin") ....: In [30]: table.columns = Index(['negative', 'positive'], name="sign") In [31]: table Out[31]: sign negative positive bin [0, 1) 341678 340498 [1, 2) 136104 135999 [2, 3) 21424 21607 [3, inf) 1334 1356
