I have a pandas DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 2, 3], ['a,b', 5, 6], ['c', 8, 9]])
0 1 2
0 a 2 3
1 a,b 5 6
2 c 8 9
I want to spread the first column to n columns (where n is the number of unique, comma-separated values, in this case 3). Each of the resulting columns shall be 1 if the value is present, and 0 else. Expected result is:
1 2 a c b
0 2 3 1 0 0
1 5 6 1 0 1
2 8 9 0 1 0
I came up with the following code, but it seems a bit circuitous to me.
>>> import re
>>> dfSpread = pd.get_dummies(df[0].str.split(',', expand=True)).\
rename(columns=lambda x: re.sub('.*_','',x))
>>> pd.concat([df.iloc[:,1:], dfSpread], axis = 1)
Is there a built-in function that does just that that I wasn't able to find?
Using get_dummies
df.set_index([1,2])[0].str.get_dummies(',').reset_index()
Out[229]:
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
You can use pop + concat here for an alternative version of Wen's answer.
pd.concat([df, df.pop(df.columns[0]).str.get_dummies(sep=',')], axis=1)
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
Related
SQL : Select Max(A) , Min (B) , C from Table group by C
I want to do the same operation in pandas on a dataframe. The closer I got was till :
DF2= DF1.groupby(by=['C']).max()
where I land up getting max of both the columns , how do i do more than one operation while grouping by.
You can use function agg:
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
Sample:
print DF1
A B C D
0 1 5 a a
1 7 9 a b
2 2 10 c d
3 3 2 c c
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
print DF2
A B
C
a 7 5
c 3 2
GroupBy-fu: improvements in grouping and aggregating data in pandas - nice explanations.
try agg() function:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=list('ABC'))
print(df)
print(df.groupby('C').agg({'A': max, 'B':min}))
Output:
A B C
0 2 3 0
1 2 2 1
2 4 0 1
3 0 1 4
4 3 3 2
5 0 4 3
6 2 4 2
7 3 4 0
8 4 2 2
9 3 2 1
10 2 3 1
11 4 1 0
12 4 3 2
13 0 0 1
14 3 1 1
15 4 1 1
16 0 0 0
17 4 0 1
18 3 4 0
19 0 2 4
A B
C
0 4 0
1 4 0
2 4 2
3 0 4
4 0 1
Alternatively you may want to check pandas.read_sql_query() function...
You can use the agg function
import pandas as pd
import numpy as np
df.groupby('something').agg({'column1': np.max, 'columns2': np.min})
I have a bunch of data frames. They all have the same columns but different amounts of rows. They look like this:
df_1
0
0 1
1 0
2 0
3 1
4 1
5 0
df_2
0
0 1
1 0
2 0
3 1
df_3
0
0 1
1 0
2 0
3 1
4 1
I have them all stored in a list.
Then, I have a numpy array where each item maps to a row in each individual df. The numpy array looks like this:
[3 1 1 2 4 0 6 7 2 1 3 2 5 5 5]
If I were to pd.concat my list of dataframes, then I could merge the np array onto the concatenated df. However, I want to preserve the individual df structure, so it should look like this:
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
0 1
0 1 6
1 0 7
2 0 2
3 1 1
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5
Considering the given dataframes & array as,
df1 = pd.DataFrame([1,0,0,1,1,0])
df2 = pd.DataFrame([1,0,0,1])
df3 = pd.DataFrame([1,0,0,1,1])
arr = np.array([3, 1, 1, 2, 4, 0, 6, 7, 2, 1, 3, 2, 5, 5, 5])
You can use numpy.split to split an array into multiple sub-arrays according to the given dataframes. Then you can append those arrays as columns to their respective dataframes.
Use:
dfs = [df1, df2, df3]
def get_indices(dfs):
"""
Returns the split indices inside the array.
"""
indices = [0]
for df in dfs:
indices.append(len(df) + indices[-1])
return indices[1:-1]
# split the given arr into multiple sections.
sections = np.split(arr, get_indices(dfs))
for df, s in zip(dfs, sections):
df[1] = s # append the section of array to dataframe
print(df)
This results:
# df1
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
#df2
0 1
0 1 6
1 0 7
2 0 2
3 1 1
# df3
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5
Supposing I have the two DataFrames shown below:
dd = pd.DataFrame([1,0, 3, 0, 5])
0
0 1
1 0
2 3
3 0
4 5
and
df = pd.DataFrame([2,4])
0
0 2
1 4
How can I broadcast the values of df into dd with step = 2 so I end up with
0
0 1
1 2
2 3
3 4
4 5
Another solution:
dd = pd.DataFrame([1, 0, 3, 0, 5])
df = pd.DataFrame([2, 4])
dd.iloc[1::2] = df.values
dd
# Out:
0
0 1
1 2
2 3
3 4
4 5
dd.values[1::2] = df.values
dd now contains:
0
0 1
1 2
2 3
3 4
4 5
Note that here step=2 condition is used. array[1::2] syntax means start from the array element with index 1, until the end, with step=2.
Change df.index by range and fill second DataFrame:
df.index = range(1, len(dd)+1, 2)[:len(df)]
print (df)
0
1 2
3 4
dd.loc[df.index] = df
print (dd)
0
0 1
1 2
2 3
3 4
4 5
See the example below.
Given a dataframe whose index has values repeated, how can I get a new dataframe with a hierarchical index whose first level is the original index and whose second level is 0, 1, 2, ..., n?
Example:
>>> df
0 1
a 2 4
a 4 6
b 7 8
b 2 4
c 3 7
>>> df2 = df.some_operation()
>>> df2
0 1
a 0 2 4
1 4 6
b 0 7 8
1 2 4
c 0 3 7
You can using cumcount
df.assign(level2=df.groupby(level=0).cumcount()).set_index('level2',append=True)
Out[366]:
0 1
level2
a 0 2 4
1 4 6
b 0 7 8
1 2 4
c 0 3 7
Can do the fake way (totally not recommended, don't use this):
>>> df.index=[v if i%2 else '' for i,v in enumerate(df.index)]
>>> df.insert(0,'',([0,1]*3)[:-1])
>>> df
0 1
0 2 4
a 1 4 6
0 7 8
b 1 2 4
0 3 7
>>>
Change index names and create a column which the column name is '' (empty string).
I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2