Pivot Table and Counting - python

I have a data set indicating who has shopped at which stores.
ID Store
1 C
1 A
2 A
2 B
3 A
3 B
3 C
Can I use a pivot table to determine the frequency of a shopper going to other stores? I'm thinking like a 3X3 matrix where the columns and rows would indicate how many people went to both stores.
Desired output
A B C
A 3 2 2
B 2 3 1
C 2 1 3

You can create a conditional table of ID and Store with pd.crosstab() and then calculate the matrix product of its transpose and itself, which should produce what you need:
mat = pd.crosstab(df.ID, df.Store)
mat.T.dot(mat)
#Store A B C
#Store
# A 3 2 2
# B 2 2 1
# C 2 1 2
Note: Since only two IDs visited store B and C, I suppose the corresponding cells should be 2 instead of 3:

Another faster solution with groupby, unstack and dot:
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
df = df.T.dot(df)
print (df)
Store A B C
Store
A 3 2 2
B 2 2 1
C 2 1 2
Timings:
In [119]: %timeit (jez(df))
1000 loops, best of 3: 1.72 ms per loop
In [120]: %timeit (psi(df))
100 loops, best of 3: 7.07 ms per loop
Code for timings:
N = 1000
df = pd.DataFrame({'ID':np.random.choice(5, N),
'Store': np.random.choice(list('ABCDEFGHIJK'), N)})
print (df)
def jez(df):
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
return df.T.dot(df)
def psi(df):
mat = pd.crosstab(df.ID, df.Store)
return mat.T.dot(mat)
print (jez(df))
print (psi(df))

Related

When slicing a 1 row pandas dataframe the slice becomes a series

Why when I slice a pandas dataframe containing only 1 row, the slice becomes a pandas series?
How can I keep it a dataframe?
df=pd.DataFrame(data=[[1,2,3]],columns=['a','b','c'])
df
Out[37]:
a b c
0 1 2 3
a=df.iloc[0]
a
Out[39]:
a 1
b 2
c 3
Name: 0, dtype: int64
To avoid the intermediate step of re-converting back to a DataFrame, use double brackets when indexing:
a = df.iloc[[0]]
print(a)
a b c
0 1 2 3
Speed:
%timeit df.iloc[[0]]
192 µs per loop
%timeit df.loc[0].to_frame().T
468 µs per loop
Or you can slice by index
a=df.iloc[df.index==0]
a
Out[1782]:
a b c
0 1 2 3
Use to_frame() and T to transpose:
df.loc[0].to_frame()
0
a 1
b 2
c 3
and
df.loc[0].to_frame().T
a b c
0 1 2 3
OR
Option #2 use double brackets [[]]
df.iloc[[0]]
a b c
0 1 2 3

Compare two columns ( string formats) in two data frames while length of columns is not the same

The followings are two data frames:
Data frame A:
index codes
1 A
2 B
3 C
4 D
Data frame B
index cym
1 A
2 L
3 F
4 B
5 N
6 X
The length of A and B is not equal. I want to compare column "codes" (data frame A) with column "cym" (data frame B) and return the difference between these two columns plus the data in index column of data frame B. Output is like this:
index cym
2 L
3 F
5 N
6 X
I tried to solve it using merge and equals functions. But I could not generate the output.
You can use isin:
B[~B.cym.isin(A.codes)]
#index cym
#1 2 L
#2 3 F
#4 5 N
#5 6 X
The more verbose but faster version of #Psidom's answer.
mask = ~np.in1d(B.cym.values, A.codes.values)
pd.DataFrame(
B.values[mask],
B.index[mask],
B.columns
)
index cym
1 2 L
2 3 F
4 5 N
5 6 X
Timing
%timeit B[~B.cym.isin(A.codes)]
1000 loops, best of 3: 348 µs per loop
%%timeit
mask = ~np.in1d(B.cym.values, A.codes.values)
pd.DataFrame(
B.values[mask],
B.index[mask],
B.columns
)
10000 loops, best of 3: 194 µs per loop
For sake of completeness:
In [22]: B.query("cym not in #A.codes")
Out[22]:
index cym
0 2 L
1 3 F
2 5 N
3 6 X

Pandas: Keep rows if at least one of them contains certain value

I have the following dataframe in Pandas
letter number
------ -------
a 2
a 0
b 1
b 5
b 2
c 1
c 0
c 2
I'd like to keep all rows if at least one matching number is 0.
Result would be:
letter number
------ -------
a 2
a 0
c 1
c 0
c 2
as b has no matching number being 0
What is the best way to do this ?
Thanks !
You need filtration:
df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Another solution with transform where get size of 0 rows and filter by boolean indexing:
print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0 1
1 1
2 0
3 0
4 0
5 1
6 1
7 1
Name: number, dtype: int64
df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
EDIT:
Faster is not use groupby, better is loc with isin:
df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Comparing with another solution:
In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop
In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop
You can also do this without the groupby by working out which letters to keep then using isin. I think this is a bit neater personally:
>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
I suspect this would be faster than doing a groupby, that may not be relevant here though! A simple timeit would indicate this is the case:
>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop
>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop

Python Pandas, aggregate multiple columns from one

I'm new to pandas and I have a DataFrame of this kind :
name value
0 alpha a
1 beta b
2 gamma c
3 alpha a
4 beta b
5 beta a
6 gamma a
7 alpha c
which I would like to turn into one of this kind :
name a b c
0 alpha 2 0 1
1 beta 1 2 0
2 gamma 1 0 1
That is to say I would like to group by "name" and "value", then count them, and create a column for each value of "value" I find.
It is just a cross tabulation:
In [78]:
print pd.crosstab(df.name, df.value)
value a b c
name
alpha 2 0 1
beta 1 2 0
gamma 1 0 1
If you use groupby:
In [90]:
print df.groupby(['name', 'value']).agg(len).unstack().fillna(0)
value a b c
name
alpha 2 0 1
beta 1 2 0
gamma 1 0 1
The latter might be faster:
In [92]:
%timeit df.groupby(['name', 'value']).agg(len).unstack().fillna(0)
100 loops, best of 3: 3.26 ms per loop
In [93]:
%timeit pd.crosstab(df.name, df.value)
100 loops, best of 3: 7.5 ms per loop

Split a data frame using groupby and merge the subsets into columns

I have a large pandas.DataFrame that looks something like this:
test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
id score name
0 -0.652909 A
1 0.100885 A
2 0.410907 A
0 0.304012 B
1 -0.198157 B
2 -0.054764 B
0 0.358484 C
1 0.616415 C
2 0.389018 C
3 1.164172 C
So the index is non-unique but is unique if I group by the column name. I would like to split the data frame into subsections by name and then assemble (by means of an outer join) the score columns into one big new data frame and change the column names of the scores to the respective group key. What I have at the moment is:
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
df = df.join(sub["score"], how="outer")
df.columns.values[-1] = key
this yields the expected result:
id A B C
0 -0.652909 0.304012 0.358484
1 0.100885 -0.198157 0.616415
2 0.410907 -0.054764 0.389018
3 NaN NaN 1.164172
but seems not very pandas-ic. Is there a better way?
Edit: Based on the answers I ran some simple timings.
%%timeit
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
df = df.join(sub["score"], how="outer")
df.columns.values[-1] = key
100 loops, best of 3: 2.46 ms per loop
%%timeit
test.set_index([test.index, "name"]).unstack()
1000 loops, best of 3: 1.04 ms per loop
%%timeit
test.pivot_table("score", test.index, "name")
100 loops, best of 3: 2.54 ms per loop
So unstack seems the method of choice.
The function you look for is unstack. In order for pandas to know, what to unstack for, we will first create a MultiIndex where we add the column as last index. unstack() will then unstack (by default) based on the last index layer, so we get exactly what you want:
In[152]: test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
In[153]: test
Out[153]:
score name
0 -0.208392 A
1 -0.103659 A
2 1.645287 A
0 0.119709 B
1 -0.047639 B
2 -0.479155 B
0 -0.415372 C
1 -1.390416 C
2 -0.384158 C
3 -1.328278 C
In[154]: test.set_index([index, 'name'], inplace=True)
test.unstack()
Out[154]:
score
name A B C
0 -0.208392 0.119709 -0.415372
1 -0.103659 -0.047639 -1.390416
2 1.645287 -0.479155 -0.384158
3 NaN NaN -1.328278
I recently came across a similar problem, which was solved by using a pivot_table
a = """id score name
0 -0.652909 A
1 0.100885 A
2 0.410907 A
0 0.304012 B
1 -0.198157 B
2 -0.054764 B
0 0.358484 C
1 0.616415 C
2 0.389018 C
3 1.164172 C"""
df = pd.read_csv(StringIO.StringIO(a),sep="\s*")
df = df.pivot_table('score','id','name')
print df
Output:
name A B C
id
0 -0.652909 0.304012 0.358484
1 0.100885 -0.198157 0.616415
2 0.410907 -0.054764 0.389018
3 NaN NaN 1.164172

Categories

Resources