How do I output a pandas groupby result -- including zero cross-terms -- to a csv file.
A toy example of exactly what I'm looking for:
I have a pandas dataframe that can be approximated as:
df = pd.DataFrame(np.random.choice(['A', 'B', 'C'], (10, 2)),
columns=['one', 'two'])
Which gave me the following:
one two
0 C C
1 C A
2 A B
3 B A
4 B C
5 B B
6 C C
7 A C
8 C B
9 C C
When I run groupby it works as expected:
grouped = df.groupby(['one', 'two']).size()
grouped
one two
A B 1
C 1
B A 1
B 1
C 1
C A 1
B 1
C 3
dtype: int64
However, I would like for the "A A 0" term to be included because I write this to a csv file:
grouped.to_csv("test1.csv", header=True)
!cat test1.csv
one,two,0
A,B,1
A,C,1
B,A,1
B,B,1
B,C,1
C,A,1
C,B,1
C,C,3
And I want the file to include the line: A,A,0.
You can do this with unstack:
grouped.unstack('two').fillna(0).stack()
which gives, for example, the following output:
one two
A A 2
B 1
C 1
B A 0
B 1
C 3
C A 2
B 0
C 0
Related
I have a data frame and an array as follows:
df = pd.DataFrame({'x': range(0,5), 'y' : range(1,6)})
s = np.array(['a', 'b', 'c'])
I would like to attach the array to every row of the data frame, such that I got a data frame as follows:
What would be the most efficient way to do this?
Just plain assignment:
# replace the first `s` with your desired column names
df[s] = [s]*len(df)
Try this:
for i in s:
df[i] = i
Output:
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You could use pandas.concat:
pd.concat([df, pd.DataFrame(s).T], axis=1).ffill()
output:
x y 0 1 2
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You can try using df.loc here.
df.loc[:, s] = s
print(df)
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
It has been a long time that I dealt with pandas library. I searched for it but could not come up with an efficient way, which might be a function existed in the library.
Let's say I have the dataframe below:
df1 = pd.DataFrame({'V1':['A','A','B'],
'V2':['B','C','C'],
'Value':[4, 1, 5]})
df1
And I would like to extend this dataset and populate all the combinations of categories and put its corresponding value as exactly the same.
df2 = pd.DataFrame({'V1':['A','B','A', 'C', 'B', 'C'],
'V2':['B','A','C','A','C','B'],
'Value':[4, 4 , 1, 1, 5, 5]})
df2
In other words, in df1, A and B has Value of 4 and I also want to have a row of that B and A has Value of 4 in the second dataframe. It is very similar to melting. I also do not want to use a for loop. I am looking for a more efficient way.
Use:
df = pd.concat([df1, df1.rename(columns={'V2':'V1', 'V1':'V2'})]).sort_index().reset_index(drop=True)
Output:
V1 V2 Value
0 A B 4
1 B A 4
2 A C 1
3 C A 1
4 B C 5
5 C B 5
Or np.vstack:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns)
V1 V2 Value
0 A B 4
1 A C 1
2 B C 5
3 B A 4
4 C A 1
5 C B 5
>>>
For correct order:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns, index=[*df1.index, *df1.index]).sort_index()
V1 V2 Value
0 A B 4
0 B A 4
1 A C 1
1 C A 1
2 B C 5
2 C B 5
>>>
And index reset:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns, index=[*df1.index, *df1.index]).sort_index().reset_index(drop=True)
V1 V2 Value
0 A B 4
1 B A 4
2 A C 1
3 C A 1
4 B C 5
5 C B 5
>>>
You can use methods assign and append:
df1.append(df1.assign(V1=df1.V2, V2=df1.V1), ignore_index=True)
Output:
V1 V2 Value
0 A B 4
1 A C 1
2 B C 5
3 B A 4
4 C A 1
5 C B 5
I'm conducting an experiment(using python 2.7, panda 0.23.4) where I have three levels of a stimulus {a,b,c} and present all different combinations to participants, and they have to choose which one was rougher? (example: Stimulus 1 = a , Stimulus 2=b, participant choose 1 indicating stimulus 1 was rougher)
After the experiment, I have a data frame with three columns like this:
import pandas as pd
d = {'Stim1': ['a', 'b', 'a', 'c', 'b', 'c'],
'Stim2': ['b', 'a', 'c', 'a', 'c', 'b'],
'Answer': [1, 2, 2, 1, 2, 1]}
df = pd.DataFrame(d)
Stim1 Stim2 Answer
0 a b 1
1 b a 2
2 a c 2
3 c a 1
4 b c 2
5 c b 1
For my analysis, the order of which stimulus came first doesn't matter. Stim1= a, Stim2= b is the same as Stim1= b, Stim2= a. I'm trying to figure out how can I swap Stim1 and Stim2 and flip their Answer to be like this:
Stim1 Stim2 Answer
0 a b 1
1 a b 1
2 a c 2
3 a c 2
4 b c 2
5 b c 2
I read that np.where can be used, but it would do one thing at a time, where I want to do two (swap and flip).
Is there some way to use another function to do swap and flip at the same time?
Can you try if this works for you?
import pandas as pd
import numpy as np
df = pd.DataFrame(d)
# keep a copy of the original Stim1 column
s = df['Stim1'].copy()
# sort the values
df[['Stim1', 'Stim2']] = np.sort(df[['Stim1', 'Stim2']].values)
# exchange the Answer if the order has changed
df['Answer'] = df['Answer'].where(df['Stim1'] == s, df['Answer'].replace({1:2,2:1}))
output:
Stim1 Stim2 Answer
0 a b 1
1 a b 1
2 a c 2
3 a c 2
4 b c 2
5 b c 2
You can start by building a boolean series that indicates which rows should be swapped or not:
>>> swap = df['Stim1'] > df['Stim2']
>>> swap
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
Then build the fully swapped dataframe as follows:
>>> swapped_df = pd.concat([
... df['Stim1'].rename('Stim2'),
... df['Stim2'].rename('Stim1'),
... 3 - df['Answer'],
... ], axis='columns')
>>> swapped_df
Stim2 Stim1 Answer
0 a b 2
1 b a 1
2 a c 1
3 c a 2
4 b c 1
5 c b 2
Finally, use .mask() to select initial rows or swapped rows:
>>> df.mask(swap, swapped_df)
Stim1 Stim2 Answer
0 a b 1
1 a b 1
2 a c 2
3 a c 2
4 b c 2
5 b c 2
NB .mask is roughly the same as .where, but it replaces rows where the series is True instead of keeping the rows that are True. This is exactly the same:
>>> swapped_df.where(swap, df)
Stim2 Stim1 Answer
0 b a 1
1 b a 1
2 c a 2
3 c a 2
4 c b 2
5 c b 2
let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Let me simplify my problem for easy explanation.
I have a pandas DataFrame table with the below format:
a b c
0 1 3 2
1 3 1 2
2 3 2 1
The numbers in each row present ranks of columns.
For example, the order of the first row is {a, c, b}.
How can I convert the above to the below ?
1 2 3
0 a c b
1 c a b
2 c b a
I googled all day long. But I couldn't find any solutions until now.
Looks like you are just mapping one value to another and renaming the columns, e.g.:
>>> df = pd.DataFrame({'a':[1,3,3], 'b':[3,1,2], 'c':[2,2,1]})
>>> df = df.applymap(lambda x: df.columns[x-1])
>>> df.columns = [1,2,3]
>>> df
1 2 3
0 a c b
1 c a b
2 c b a