I have a main df like so:
index A B C
5 1 5 8
6 2 4 1
7 8 3 4
8 3 9 5
and an auxiliary df2 that I want to add to the main df like so:
index A B
5 4 2
6 4 3
7 7 1
8 6 2
Columns A & B are the same name, however the main df contains many columns that the secondary df2 does not. I want to sum the columns that are common and leave the others as is.
Output:
index A B C
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
Have tried variations of df.join, pd.merge and groupby but having no luck at the moment.
Last Attempt:
df.groupby('index').sum().add(df2.groupby('index').sum())
But this does not keep common columns.
pd.merge I am getting suffix _x and _y
Use add only with same columns by intersection:
c = df.columns.intersection(df2.columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C
index
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
If use only add, integers columns which not matched are converted to floats:
df = df.add(df2, fill_value=0)
print (df)
A B C
index
5 5 7 8.0
6 6 7 1.0
7 15 4 4.0
8 9 11 5.0
EDIT:
If possible strings common columns:
print (df)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
print (df2)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
Solution is similar, only filter first only numeric columns by select_dtypes:
c = df.select_dtypes(np.number).columns.intersection(df2.select_dtypes(np.number).columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C D
index
5 5 7 8 a
6 6 7 1 e
7 15 4 4 r
8 9 11 5 w
Not the cleanest way but it might work.
df_new = pd.DataFrame()
df_new['A'] = df['A'] + df2['A']
df_new['B'] = df['B'] + df2['B']
df_new['C'] = df['C']
print(df_new)
A B C
0 5 7 8
1 6 7 1
2 15 4 4
3 9 11 5
Related
I have the following df:
df = pd.DataFrame({'a': [1,2,3,4,2], 'b': [3,4,1,0,4], 'c':[1,2,3,1,0], 'd':[3,2,4,1,4]})
I want to generate a combination of totals from these 4 columns, which equals 4 x 3 x 2 = 24 total combinations minus duplicates. I want the results in the same df.
I want something that looks like this (partial results shown):
A combo of a_b is the same as b_a and therefore I wouldn't want such a calculation since its a duplicate.
Is there a way to calculate all combinations and exclude duplicate totals?
import itertools as it
orig_cols = df.columns
for r in range(2, df.shape[1] + 1):
for cols in it.combinations(orig_cols, r):
df["_".join(cols)] = df.loc[:, cols].sum(axis=1)
Needs some looping, but not on the dataframe itself, but rather the combinations. We get 2, 3, ..., N-1'th combinations of the column names where N is number of columns. Then form the new _-joined column as the sum.
In [11]: df
Out[11]:
a b c d a_b a_c a_d b_c b_d c_d a_b_c a_b_d a_c_d b_c_d a_b_c_d
0 1 3 1 3 4 2 4 4 6 4 5 7 5 7 8
1 2 4 2 2 6 4 4 6 6 4 8 8 6 8 10
2 3 1 3 4 4 6 7 4 5 7 7 8 10 8 11
3 4 0 1 1 4 5 5 1 1 2 5 5 6 2 6
4 2 4 0 4 6 2 6 4 8 4 6 10 6 8 10
So I have a data frame that has two columns, State and Cost, and a separate list of new "what-if" costs
State Cost
A 2
B 9
C 8
D 4
New_Cost_List = [1, 5, 10]
I'd like to replicate all the rows in my data set for each value of New_Cost, adding a new column for each New_Cost for each state.
State Cost New_Cost
A 2 1
B 9 1
C 8 1
D 4 1
A 2 5
B 9 5
C 8 5
D 4 5
A 2 10
B 9 10
C 8 10
D 4 10
I thought a for loop might be appropriate to iterate through, replicating my dataset for the length of the list and adding the values of the list as a new column:
for v in New_Cost_List:
df_new = pd.DataFrame(np.repeat(df.values, len(New_Cost_List), axis=0))
df_new.columns = df.columns
df_new['New_Cost'] = v
The output of this gives me the correct replication of State and Cost but the New_Cost value is 10 for each row. Clearly I'm not connecting how to get it to run through the list for each replicated set, so any suggestions? Or is there a better way to approach this?
EDIT 1
Reducing the number of values in the New_Cost_List from 4 to 3 so there's a difference in row count and length of the list.
Here is a way using the keys paramater of pd.concat():
(pd.concat([df]*len(New_Cost_List),
keys = New_Cost_List,
names = ['New_Cost',None])
.reset_index(level=0))
Output:
New_Cost State Cost
0 1 A 2
1 1 B 9
2 1 C 8
3 1 D 4
0 5 A 2
1 5 B 9
2 5 C 8
3 5 D 4
0 10 A 2
1 10 B 9
2 10 C 8
3 10 D 4
If i understand your question correctly, this should solve your problem.
df['New Cost'] = new_cost_list
df = pd.concat([df]*len(new_cost_list), ignore_index=True)
Output:
State Cost New Cost
0 A 2 1
1 B 9 5
2 C 8 10
3 D 4 15
4 A 2 1
5 B 9 5
6 C 8 10
7 D 4 15
8 A 2 1
9 B 9 5
10 C 8 10
11 D 4 15
12 A 2 1
13 B 9 5
14 C 8 10
15 D 4 15
You can use index.repeat and numpy.tile:
df2 = (df
.loc[df.index.repeat(len(New_Cost_List))]
.assign(**{'New_Cost': np.repeat(New_Cost_List, len(df))})
)
or, simply, with a cross merge:
df2 = df.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
output:
State Cost New_Cost
0 A 2 1
0 A 2 5
0 A 2 10
1 B 9 1
1 B 9 5
1 B 9 10
2 C 8 1
2 C 8 5
2 C 8 10
3 D 4 1
3 D 4 5
3 D 4 10
For the provided order:
(df
.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
.sort_values(by='New_Cost', kind='stable')
.reset_index(drop=True)
)
output:
State Cost New_Cost
0 A 2 1
1 B 9 1
2 C 8 1
3 D 4 1
4 A 2 5
5 B 9 5
6 C 8 5
7 D 4 5
8 A 2 10
9 B 9 10
10 C 8 10
11 D 4 10
there are two dataframes df_one and df_two I want to create a new data frame by with selective column from each of the dataframes
df_one
e b c d
1 2 3 4
5 6 7 8
6 2 4 8
9 2 5 6
and
df_two
e f g h
1 8 7 6
5 6 6 4
6 6 2 4
9 5 3 2
I want to create a new dataframe new_df
e b g h d
1 6 7 6 4
5 2 6 4 8
6 2 2 4 8
9 2 3 2 6
enter image description here
result = pd.merge(df_one, df_two, on='e')
result=result.loc[:,["e","b","g","h","d"]]
Use:
pd.merge(df1[["e", "b", "d"]], df2[["e", "g", "h"]], on="e")
I want to groupby DataFrame and get the nlargest data of column 'C'.
while the return is series, not DataFrame.
dftest = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).sort_index()
the result get a series.
2 1
4 2
5 2
6 3
7 3
8 4
9 4
Name: C, dtype: int64
I hope the result is DataFrame, like
A B C
2 3 A 1
4 5 A 2
5 6 B 2
6 7 A 3
7 8 B 3
8 9 B 4
9 10 B 4
******update**************
sorry, the column 'A' in fact does not series integers, the dftest might be more like
dftest = pd.DataFrame({'A':['Feb','Flow','Air','Flow','Feb','Beta','Cat','Feb','Beta','Air'],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
and the result should be
A B C
2 Air A 1
4 Feb A 2
5 Beta B 2
6 Cat A 3
7 Feb B 3
8 Beta B 4
9 Air B 4
It may be a bit clumsy, but it does what you asked:
dfn= dftest.groupby('B').apply(lambda
grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).reset_index().rename(columns=
{'level_1':'A'})
dfn.A = dfn.A+1
dfn=dfn[['A','B','C']].sort_values(by='A')
Thanks to my friends, the follow code works for me.
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp.nlargest(n=int(grp['C'].count()*0.8),columns='C').sort_index())
the dfn is
In [8]:dfn
Out[8]:
A B C
2 3 A 1
4 5 A 2
6 7 A 3
5 6 B 2
7 8 B 3
8 9 B 4
9 10 B 4
my previous code is deal with series, the later one is deal with DataFrame.
I did not figure out how to solve the following question!
consider the following data set:
df = pd.DataFrame(data=np.array([['a',1, 2, 3], ['a',4, 5, 6],
['b',7, 8, 9], ['b',10, 11 , 12]]),
columns=['id','A', 'B', 'C'])
id A B C
a 1 2 3
a 4 5 6
b 7 8 9
b 10 11 12
I need to group the data by id and in each group duplicate the first row and add it to the dataset like the following data set:
id A B C A B C
a 1 2 3 1 2 3
a 4 5 6 1 2 3
b 7 8 9 7 8 9
b 10 11 12 7 8 9
I really appreciate it for your help.
I did the following steps, however I could not expand it :
df1 = df.loc [0:0 , 'A' :'C']
df3 = pd.concat([df,df1],axis=1)
Use groupby + first, and then concatenate df with this result:
v = df.groupby('id').transform('first')
pd.concat([df, v], 1)
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9
cumcount + where+ffill
v=df.groupby('id').cumcount()==0
pd.concat([df,df.iloc[:,1:].where(v).ffill()],1)
Out[57]:
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9
One can also try drop_duplicates and merge.
df_unique = df.drop_duplicates("id")
df.merge(df_unique, on="id", how="left")
id A_x B_x C_x A_y B_y C_y
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9