I have a dataframe like
df1 = pd.DataFrame({'name':['al', 'ben', 'cary'], 'bin':[1.0, 1.0, 3.0], 'score':[40, 75, 15]})
bin name score
0 1 al 40
1 1 ben 75
2 3 cary 15
and a dataframe like
df2 = pd.DataFrame({'bin':[1.0, 2.0, 3.0, 4.0, 5.0], 'x':[1, 1, 0, 0, 0],
'y':[0, 0, 1, 1, 0], 'z':[0, 0, 0, 1, 0]})
bin x y z
0 1 1 0 0
1 2 1 0 0
2 3 0 1 0
3 4 0 1 1
4 5 0 0 0
what I want to do is extend df1 with the columns ‘x’, ‘y’, and ‘z’, and fill with score only where the bin matches and the the respective ‘x’, ‘y’, ‘z’ value is 1, not 0.
I’ve gotten as far as
df3 = pd.merge(df1, df2, how='left', on=['bin'])
bin name score x y z
0 1 al 40 1 0 0
1 1 ben 75 1 0 0
2 3 cary 15 0 1 0
but I don't see an elegant way to get the score values into the correct 'x', 'y', etc columns (my real-life problem has over a hundred such columns so df3['x'] = df3['score'] * df3['x'] might be rather slow).
You can just get a list of the columns you want to multiply the scores by and then use the apply function:
cols = [each for each in df2.columns if each not in ('name', 'bin')]
df3 = pd.merge(df1, df2, how='left', on=['bin'])
df3[cols] = df3.apply(lambda x: x['score'] * x[cols], axis=1)
This may not be much faster than iterating, but is an idea.
Import numpy, define the columns covered in the operation
import numpy as np
columns = ['x','y','z']
score_col = 'score'
Contruct a numpy array of the score column, reshaped to match the number of columns in the operation.
score_matrix = np.repeat(df3[score_col].values, len(columns))
score_matrix = score_matrix.reshape(len(df3), len(columns))
Multiply by the the columns and assign back to the dataframe.
df3[columns] = score_matrix * df3[columns]
Related
I have data as you can see in the terminal. I need it to be converted to the Excel sheet format as you can see in the Excel sheet file by creating multi-levels in columns.
I researched this and reached many different things but cannot achieve my goal then, I reached "transpose", and it gave me the shape that I need but unfortunately that it did reshape from a column to a row instead where I got the wrong data ordering.
Current result:
Desired result:
What can I try next?
You can use pivot() function and reorder multi-column levels.
Before that, index/group data for repeated iterations/rounds:
data=[
(2,0,0,1),
(10,2,5,3),
(2,0,0,0),
(10,1,1,1),
(2,0,0,0),
(10,1,2,1),
]
columns = ["player_number", "cel1", "cel2", "cel3"]
df = pd.DataFrame(data=data, columns=columns)
df_nbr_plr = df[["player_number"]].groupby("player_number").agg(cnt=("player_number","count"))
df["round"] = list(itertools.chain.from_iterable(itertools.repeat(x, df_nbr_plr.shape[0]) for x in range(df_nbr_plr.iloc[0,0])))
[Out]:
player_number cel1 cel2 cel3 round
0 2 0 0 1 0
1 10 2 5 3 0
2 2 0 0 0 1
3 10 1 1 1 1
4 2 0 0 0 2
5 10 1 2 1 2
Now, pivot and reorder the colums levels:
df = df.pivot(index="round", columns="player_number").reorder_levels([1,0], axis=1).sort_index(axis=1)
[Out]:
player_number 2 10
cel1 cel2 cel3 cel1 cel2 cel3
round
0 0 0 1 2 5 3
1 0 0 0 1 1 1
2 0 0 0 1 2 1
This can be done with unstack after setting player__number as index. You have to reorder the Multiindex columns and fill missing values/delete duplicates though:
import pandas as pd
data = {"player__number": [2, 10 , 2, 10, 2, 10],
"cel1": [0, 2, 0, 1, 0, 1],
"cel2": [0, 5, 0, 1, 0, 2],
"cel3": [1, 3, 0, 1, 0, 1],
}
df = pd.DataFrame(data).set_index('player__number', append=True)
df = df.unstack('player__number').reorder_levels([1, 0], axis=1).sort_index(axis=1) # unstacking, reordering and sorting columns
df = df.ffill().iloc[1::2].reset_index(drop=True) # filling values and keeping only every two rows
df.to_excel('output.xlsx')
Output:
I have two dataframes df1 and df2. x,y values in df2 is a subset of x,y values in df1. For each x,y row in df2, I want to change the value of knn column in df1 to 0, where df2[x] = df1[x] and df2[y] = df1[y]. In the example below x,y values (1,1) and (1,2) are common therefore knn column in df1 will change to [0,0,0,0]. The last line in the code below is not working. I would appreciate any guidance.
import pandas as pd
df1_dict = {'x': ['1','1','1','1'],
'y': [1,2,3,4],
'knn': [1,1,0,0]
}
df2_dict = {'x': ['1','1'],
'y': [1,2]
}
df1 = pd.DataFrame(df1_dict, columns = ['x', 'y','knn'])
df2 = pd.DataFrame(df2_dict, columns = ['x', 'y'])
df1['knn']= np.where((df1['x']==df2['x']) and df1['y']==df2['y'], 0)
You can use merge here:
u = df1.merge(df2,on=['x','y'],how='left',indicator=True)
u = (u.assign(knn=np.where(u['_merge'].eq("both"),0,u['knn']))
.reindex(columns=df1.columns))
print(u)
x y knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0
You can use MultiIndex.isin:
c = ['x', 'y']
df1.loc[df1.set_index(c).index.isin(df2.set_index(c).index), 'knn'] = 0
x y knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0
Hi I have a data and I want to rename one of the column and select columns starts with t string.
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
'tr': [1,2,3,4,5],
'tk': [6,7,8,9,10],
'ak': [11,12,13,14,15]
}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score','tr','tk','ak'])
df
patient obs treatment score tr tk ak
0 1 1 0 strong 1 6 11
1 1 2 1 weak 2 7 12
2 1 3 0 normal 3 8 13
3 2 1 1 weak 4 9 14
4 2 2 0 strong 5 10 15
So I tried by following python-pandas-renaming-column-name-startswith
df.rename(columns = {'treatment':'treat'})[['score','obs',df[df.columns[pd.Series(df.columns).str.startswith('t')]]]]
but getting this error
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
How can I select the columns that starts with t ?
Thx
Convert to Series is not necessary, but if want add to another list of columns convert output to list:
cols = df.columns[df.columns.str.startswith('t')].tolist()
df = df[['score','obs'] + cols].rename(columns = {'treatment':'treat'})
Another idea is use 2 masks and chain by | for bitwise OR:
Notice:
Columns names are filtered from original columns names before rename in your solution, so is necessary rename later.
m1 = df.columns.str.startswith('t')
m2 = df.columns.isin(['score','obs'])
df = df.loc[:, m1 | m2].rename(columns = {'treatment':'treat'})
print (df)
obs treat score tr tk
0 1 0 strong 1 6
1 2 1 weak 2 7
2 3 0 normal 3 8
3 1 1 weak 4 9
4 2 0 strong 5 10
If need rename first, is necessary reassign back for filter by renamed columns names:
df = df.rename(columns = {'treatment':'treat'})
df = df.loc[:, df.columns.str.startswith('t') | df.columns.isin(['score','obs'])]
#Select columns startswith "t"
df = df[df.columns[df.columns.str.startswith('t')]]
#Rename your column
df.rename(columns = {'treatment':'treat'})
I am trying to count consecutive zeros (e.g. 2 consecutive zeros or 3 consecutive zeros) in groups and combine the results in a new dataframe.
raw_data = {'groups': ['x', 'x', 'x', 'x', 'x', 'x', 'x','z','y', 'y', 'y','y', 'y', 'z'],
'runs': [0, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 0, 2]}
df = pd.DataFrame(raw_data, columns = ['groups', 'runs'])
Example in the dataframe above, first I want to know how many 2 consecutive zeros are in each group and then I want to know how many 3 consecutive zeros are in each group.
I want the results (preferably in a dataframe):
group 2_0s 3_0s
x 1 1
y 1 0
z 0 0
I am hoping to find a generic way, as I want to be able to do the same for consecutive 1s and 2s as well.
Thanks.
You can use:
#get original unique sorted values of groups
orig = np.sort(df.groups.unique())
#add new groups for distinguish 0 in one group
df['g'] = (df.runs != df.runs.shift()).cumsum()
#filter only 0 values
df = df[df.runs == 0]
print (df)
groups runs g
0 x 0 1
1 x 0 1
2 x 0 1
5 x 0 3
6 x 0 3
11 y 0 6
12 y 0 6
#get size by groups and g
df = df.groupby(['groups', 'g']).size().reset_index(name='0')
print (df)
groups g 0
0 x 1 3
1 x 3 2
2 y 6 2
#get size by groups and 0, unstack
#reindex by original unique values, add suffix to column names
df1 = df.groupby(['groups','0'])
.size()
.unstack(fill_value=0)
.reindex(orig, fill_value=0)
.add_suffix('_0s')
print (df1)
0 2_0s 3_0s
groups
x 1 1
y 1 0
z 0 0
More generic solution:
df['g'] = (df.runs != df.runs.shift()).cumsum()
df = df.groupby(['groups', 'g', 'runs']).size().reset_index(name='0')
df1 = df.groupby(['groups','runs', '0']).size().unstack(level=[1,2]).fillna(0).astype(int)
print (df1)
runs 0 1 2
0 2 3 2 3 1
groups
x 1 1 1 0 0
y 1 0 0 1 0
z 0 0 0 0 2
I am using Python Pandas for the following. I have three dataframes, df1, df2 and df3. Each has the same dimensions, index and column labels. I would like to create a fourth dataframe that takes elements from df1 or df2 depending on the values in df3:
df1 = pd.DataFrame(np.random.randn(4, 2), index=list('0123'), columns=['A', 'B'])
df1
Out[67]:
A B
0 1.335314 1.888983
1 1.000579 -0.300271
2 -0.280658 0.448829
3 0.977791 0.804459
df2 = pd.DataFrame(np.random.randn(4, 2), index=list('0123'), columns=['A', 'B'])
df2
Out[68]:
A B
0 0.689721 0.871065
1 0.699274 -1.061822
2 0.634909 1.044284
3 0.166307 -0.699048
df3 = pd.DataFrame({'A': [1, 0, 0, 1], 'B': [1, 0, 1, 0]})
df3
Out[69]:
A B
0 1 1
1 0 0
2 0 1
3 1 0
The new dataframe, df4, has the same index and column labels and takes an element from df1 if the corresponding value in df3 is 1. It takes an element from df2 if the corresponding value in df3 is a 0.
I need a solution that uses generic references (e.g. ix or iloc) rather than actual column labels and index values because my dataset has fifty columns and four hundred rows.
As your DataFrames happen to be numeric, and the selector matrix happens to be of indicator variables, you can do the following:
>>> pd.DataFrame(
df1.as_matrix() * df3.as_matrix() + df1.as_matrix() * (1 - df3.as_matrix()),
index=df1.index,
columns=df1.columns)
I tried it by me and it works. Strangely enough, #Yakym Pirozhenko's answer - which I think is superior - doesn't work by me as well.
df4 = df1.where(df3.astype(bool), df2) should do it.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(10, size = (4,2)))
df2 = pd.DataFrame(np.random.randint(10, size = (4,2)))
df3 = pd.DataFrame(np.random.randint(2, size = (4,2)))
df4 = df1.where(df3.astype(bool), df2)
print df1, '\n'
print df2, '\n'
print df3, '\n'
print df4, '\n'
Output:
0 1
0 0 3
1 8 8
2 7 4
3 1 2
0 1
0 7 9
1 4 4
2 0 5
3 7 2
0 1
0 0 0
1 1 0
2 1 1
3 1 0
0 1
0 7 9
1 8 4
2 7 4
3 1 2