When slicing a 1 row pandas dataframe the slice becomes a series - python

Why when I slice a pandas dataframe containing only 1 row, the slice becomes a pandas series?
How can I keep it a dataframe?
df=pd.DataFrame(data=[[1,2,3]],columns=['a','b','c'])
df
Out[37]:
a b c
0 1 2 3
a=df.iloc[0]
a
Out[39]:
a 1
b 2
c 3
Name: 0, dtype: int64

To avoid the intermediate step of re-converting back to a DataFrame, use double brackets when indexing:
a = df.iloc[[0]]
print(a)
a b c
0 1 2 3
Speed:
%timeit df.iloc[[0]]
192 µs per loop
%timeit df.loc[0].to_frame().T
468 µs per loop

Or you can slice by index
a=df.iloc[df.index==0]
a
Out[1782]:
a b c
0 1 2 3

Use to_frame() and T to transpose:
df.loc[0].to_frame()
0
a 1
b 2
c 3
and
df.loc[0].to_frame().T
a b c
0 1 2 3
OR
Option #2 use double brackets [[]]
df.iloc[[0]]
a b c
0 1 2 3

Related

Compare two columns ( string formats) in two data frames while length of columns is not the same

The followings are two data frames:
Data frame A:
index codes
1 A
2 B
3 C
4 D
Data frame B
index cym
1 A
2 L
3 F
4 B
5 N
6 X
The length of A and B is not equal. I want to compare column "codes" (data frame A) with column "cym" (data frame B) and return the difference between these two columns plus the data in index column of data frame B. Output is like this:
index cym
2 L
3 F
5 N
6 X
I tried to solve it using merge and equals functions. But I could not generate the output.
You can use isin:
B[~B.cym.isin(A.codes)]
#index cym
#1 2 L
#2 3 F
#4 5 N
#5 6 X
The more verbose but faster version of #Psidom's answer.
mask = ~np.in1d(B.cym.values, A.codes.values)
pd.DataFrame(
B.values[mask],
B.index[mask],
B.columns
)
index cym
1 2 L
2 3 F
4 5 N
5 6 X
Timing
%timeit B[~B.cym.isin(A.codes)]
1000 loops, best of 3: 348 µs per loop
%%timeit
mask = ~np.in1d(B.cym.values, A.codes.values)
pd.DataFrame(
B.values[mask],
B.index[mask],
B.columns
)
10000 loops, best of 3: 194 µs per loop
For sake of completeness:
In [22]: B.query("cym not in #A.codes")
Out[22]:
index cym
0 2 L
1 3 F
2 5 N
3 6 X

Pandas: Keep rows if at least one of them contains certain value

I have the following dataframe in Pandas
letter number
------ -------
a 2
a 0
b 1
b 5
b 2
c 1
c 0
c 2
I'd like to keep all rows if at least one matching number is 0.
Result would be:
letter number
------ -------
a 2
a 0
c 1
c 0
c 2
as b has no matching number being 0
What is the best way to do this ?
Thanks !
You need filtration:
df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Another solution with transform where get size of 0 rows and filter by boolean indexing:
print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0 1
1 1
2 0
3 0
4 0
5 1
6 1
7 1
Name: number, dtype: int64
df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
EDIT:
Faster is not use groupby, better is loc with isin:
df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Comparing with another solution:
In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop
In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop
You can also do this without the groupby by working out which letters to keep then using isin. I think this is a bit neater personally:
>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
I suspect this would be faster than doing a groupby, that may not be relevant here though! A simple timeit would indicate this is the case:
>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop
>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop

Pivot Table and Counting

I have a data set indicating who has shopped at which stores.
ID Store
1 C
1 A
2 A
2 B
3 A
3 B
3 C
Can I use a pivot table to determine the frequency of a shopper going to other stores? I'm thinking like a 3X3 matrix where the columns and rows would indicate how many people went to both stores.
Desired output
A B C
A 3 2 2
B 2 3 1
C 2 1 3
You can create a conditional table of ID and Store with pd.crosstab() and then calculate the matrix product of its transpose and itself, which should produce what you need:
mat = pd.crosstab(df.ID, df.Store)
mat.T.dot(mat)
#Store A B C
#Store
# A 3 2 2
# B 2 2 1
# C 2 1 2
Note: Since only two IDs visited store B and C, I suppose the corresponding cells should be 2 instead of 3:
Another faster solution with groupby, unstack and dot:
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
df = df.T.dot(df)
print (df)
Store A B C
Store
A 3 2 2
B 2 2 1
C 2 1 2
Timings:
In [119]: %timeit (jez(df))
1000 loops, best of 3: 1.72 ms per loop
In [120]: %timeit (psi(df))
100 loops, best of 3: 7.07 ms per loop
Code for timings:
N = 1000
df = pd.DataFrame({'ID':np.random.choice(5, N),
'Store': np.random.choice(list('ABCDEFGHIJK'), N)})
print (df)
def jez(df):
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
return df.T.dot(df)
def psi(df):
mat = pd.crosstab(df.ID, df.Store)
return mat.T.dot(mat)
print (jez(df))
print (psi(df))

Pandas : vectorized operations on maximum values per row

I have the following pandas dataframe df:
index A B C
1 1 2 3
2 9 5 4
3 7 12 8
... ... ... ...
I want the maximum value of each row to remain unchanged, and all the other values to become -1. The output would thus look like this :
index A B C
1 -1 -1 3
2 9 -1 -1
3 -1 12 -1
... ... ... ...
By using df.max(axis = 1), I get a pandas Series with the maximum values per row. However, I'm not sure how to use these maximums optimally to create the result I need. I'm looking for a vectorized, fast implementation.
Consider using where:
>>> df.where(df.eq(df.max(1), 0), -1)
A B C
index
1 -1 -1 3
2 9 -1 -1
3 -1 12 -1
Here df.eq(df.max(1), 0) is a boolean DataFrame marking the row maximums; True values (the maximums) are left untouched whereas False values become -1. You can also use a Series or another DataFrame instead of a scalar if you like.
The operation can also be done inplace (by passing inplace=True).
You can create boolean mask by comparing by eq with max by rows, then apply inverted mask:
print df
A B C
index
1 1 2 3
2 9 5 4
3 7 12 8
print df.max(axis=1)
index
1 3
2 9
3 12
dtype: int64
mask = df.eq(df.max(axis=1), axis=0)
print mask
A B C
index
1 False False True
2 True False False
3 False True False
df[~mask] = -1
print df
A B C
index
1 -1 -1 3
2 9 -1 -1
3 -1 12 -1
All together:
df[~df.eq(df.max(axis=1), axis=0)] = -1
print df
A B C
index
1 -1 -1 3
2 9 -1 -1
3 -1 12 -1
Create an new dataframe the same size of df consisting of -1 for each value. Then use enumerate to get the first max value in a given row, using integer getting/setting of a scalar (iat).
df2 = pd.DataFrame(-np.ones(df.shape), columns=df.columns, index=df.index)
for row, col in enumerate(np.argmax(df.values, axis=1)):
df2.iat[row, col] = df.iat[row, col]
>>> df2
0 1 2
0 -1 -1 3
1 9 -1 -1
2 -1 12 -1
Timings
df = pd.DataFrame(np.random.randn(10000, 10000))
%%timeit
df2 = pd.DataFrame(-np.ones(df.shape))
for row, col in enumerate(np.argmax(df.values, axis=1)):
df2.iat[row, col] = df.iat[row, col]
1 loops, best of 3: 1.19 s per loop
%timeit df.where(df.eq(df.max(1), 0), -1)
1 loops, best of 3: 6.27 s per loop
# Using inplace=True
%timeit df.where(df.eq(df.max(1), 0), -1, inplace=True)
1 loops, best of 3: 5.58 s per loop
%timeit df[~df.eq(df.max(axis=1), axis=0)] = -1
1 loops, best of 3: 5.65 s per loop

Python Pandas Accessing values from second index in multi-indexed dataframe

I am not really sure how multi indexing works, so I maybe simply be trying to do the wrong things here. If I have a dataframe with
Value
A B
1 1 5.67
1 2 6.87
1 3 7.23
2 1 8.67
2 2 9.87
2 3 10.23
If I want to access the elements where B=2, how would I do that? df.ix[2] gives me the A=2. To get a particular value it seems df.ix[(1,2)] but that is the purpose of the B index if you can't access it directly?
You can use xs:
In [11]: df.xs(2, level='B')
Out[11]:
Value
A
1 6.87
2 9.87
alternatively:
In [12]: df1.xs(1, level=1)
Out[12]:
Value
A
1 5.67
2 8.67
Just as an alternative, you could use df.loc:
>>> df.loc[(slice(None),2),:]
Value
A B
1 2 6.87
2 2 9.87
The tuple accesses the indexes in order. So, slice(None) grabs all values from index 'A', the second position limits based on the second level index, where 'B'=2 in this example. The : specifies that you want all columns, but you could subet the columns there as well.
If you only want to return a cross-section, use xs (as mentioned by #Andy Hayden).
However, if you want to overwrite some values in the original dataframe, use pd.IndexSlice (with pd.loc) instead. Given a dataframe df:
In [73]: df
Out[73]:
col_1 col_2
index_1 index_2
1 1 5 6
1 5 6
2 5 6
2 2 5 6
if you want to overwrite with 0 all elements in col_1 where index_2 == 2 do:
In [75]: df.loc[pd.IndexSlice[:, 2], 'col_1'] = 0
In [76]: df
Out[76]:
col_1 col_2
index_1 index_2
1 1 5 6
1 5 6
2 0 6
2 2 0 6

Categories

Resources