I have below dataframes (df1 and df2) in pandas. What I want to achieve is to multiply df1 with df2 matching column header and create df3. Expected results are,
df3 = pd.DataFrame([{'A':2,'B':2.2,'C':20},
{'A':2.5,'B':2.8,'C':24},
{'A':3.0,'B':2.8,'C':24.8}])
I've tried to use df3 = df1.mul(df2,axis=1) but it is not working. It produced a lot of NaN and give extra 2 columns. Can anyone share some hints?
df1 = pd.DataFrame([{'A':20,'B':22,'C':25},
{'A':25,'B':28,'C':30},
{'A':30,'B':28,'C':31}])
df2 = pd.DataFrame([{'X':'A','Y':0.1},
{'X':'B','Y':0.1},
{'X':'C','Y':0.8}])
I changed df2 to s2 -- is this what you're looking for?
df1 = pd.DataFrame([{'A':20,'B':22,'C':25},
{'A':25,'B':28,'C':30},
{'A':30,'B':28,'C':31}])
s2 = pd.Series(data=[0.1, 0.1, 0.8],
index=['A', 'B', 'C'])
df1.mul(s2)
The result is:
A B C
0 2.0 2.2 20.0
1 2.5 2.8 24.0
2 3.0 2.8 24.8
Get the columns to align on the index, multiply, and unstack to get back your result
df1.stack().mul(df2.set_index("X").Y, level=-1).unstack()
A B C
0 2.0 2.2 20.0
1 2.5 2.8 24.0
2 3.0 2.8 24.8
Note : This works for more rows (50 as you mentioned in the comments)
Related
I have two dataframes like below,
import numpy as np
import pandas as pd
df1 = pd.DataFrame({1: np.zeros(5), 2: np.zeros(5)}, index=['a','b','c','d','e'])
and
df2 = pd.DataFrame({'category': [1,1,2,2], 'value':[85,46, 39, 22]}, index=[0, 1, 3, 4])
The value from second dataframe is required to be assigned in first dataframe such that the index and column relationship is maintained. The second dataframe index is iloc based and has column category which is actually containing column names of first dataframe. The value is value to be assigned.
Following is the my solution with expected output,
for _category in df2['category'].unique():
df1.loc[df1.iloc[df2[df2['category'] == _category].index.tolist()].index, _category] = df2[df2['category'] == _category]['value'].values
Is there a pythonic way of doing so without the for loop?
One option is to pivot and update:
df3 = df1.reset_index()
df3.update(df2.pivot(columns='category', values='value'))
df3 = df3.set_index('index').rename_axis(None)
Alternative, reindex df2 (in two steps, numerical and by label), and combine_first with df1:
df3 = (df2
.pivot(columns='category', values='value')
.reindex(range(max(df2.index)+1))
.set_axis(df1.index)
.combine_first(df1)
)
output:
1 2
a 85.0 0.0
b 46.0 0.0
c 0.0 0.0
d 0.0 39.0
e 0.0 22.0
Here's one way by replacing the 0s in df1 with NaN; pivotting df2 and filling in the NaNs in df1 with df2:
out = (df1.replace(0, pd.NA).reset_index()
.fillna(df2.pivot(columns='category', values='value'))
.set_index('index').rename_axis(None).fillna(0))
Output:
1 2
a 85.0 0.0
b 46.0 0.0
c 0.0 0.0
d 0.0 39.0
e 0.0 22.0
Hello I have the following dataframe
df = pd.DataFrame(data={'grade_1':['A','B','C'],
'grade_1_count': [19,28,32],
'grade_2': ['pass','fail',np.nan],
'grade_2_count': [39,18, np.nan]})
whereby some grades as missing, and need to be inserted in to the grade_n column according to the values in this dictionary
grade_dict = {'grade_1':['A','B','C','D','E','F'],
'grade_2' : ['pass','fail','not present', 'borderline']}
and the corresponding row value in the _count column should be filled with np.nan
so the expected output is like this
expected_df = pd.DataFrame(data={'grade_1':['A','B','C','D','E','F'],
'grade_1_count': [19,28,32,0,0,0],
'grade_2': ['pass','fail','not preset','borderline', np.nan, np.nan],
'grade_2_count': [39,18,0,0,np.nan,np.nan]})
so far I have this rather inelegant code that creates a column that includes all the correct categories for the grades, but i cannot reinsert it in to the dataframe, or fill the count columns with zeros (where the np.nans just reflect empty cells due to coercing columns with different lengths of rows) I hope that makes sense. any advice would be great. thanks
x=[]
for k, v in grade_dict.items():
out = df[k].reindex(grade_dict[k], axis=0, fill_value=0)
x = pd.concat([out], axis=1)
x[k] = x.index
x = x.reset_index(drop=True)
df[k] = x.fillna(np.nan)
Here is a solution using two consecutive merges:
# set up combinations
from itertools import zip_longest
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
# merge
(df2.merge(df.filter(like='grade_1'),
on='grade_1', how='left')
.merge(df.filter(like='grade_2'),
on='grade_2', how='left')
.sort_index(axis=1)
)
output:
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D NaN borderline NaN
4 E NaN None NaN
5 F NaN None NaN
multiple merges:
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
for col in grade_dict:
df2 = df2.merge(df.filter(like=col),
on=col, how='left')
df2
If you only need to merge on grade_1 without updating the non-NaNs of grade_2, you can cast grade_dict into a df and then use combine_first:
print (df.set_index("grade_1").combine_first(pd.DataFrame(grade_dict.values(),
index=grade_dict.keys()).T.set_index("grade_1"))
.fillna({"grade_1_count": 0}).reset_index())
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D 0.0 borderline NaN
4 E 0.0 None NaN
5 F 0.0 None NaN
I have a pandas DataFrame as shown below. I want to identify the index values of the columns in df that match a given string (more specifically, a string that matches the column names after 'sim-' or 'act-').
# Sample df
import pandas as pd
df = pd.DataFrame({
'sim-prod1': [1, 1.4],
'sim-prod2': [2, 2.1],
'act-prod1': [1.1, 1],
'act-prod2': [2.5, 2]
})
# Get unique prod values from df.columns
prods = pd.Series(df.columns[1:]).str[4:].unique()
prods
array(['prod2', 'prod1'], dtype=object)
I now want to loop through prods and identify the columns where prod1 and prod2 occur, and then use those columns to create new dataframes. How can I do this? In R I could use the which function to do this easily. Example dataframes I want to obtain are below.
df_prod1
sim_prod1 act_prod1
0 1.0 1.1
1 1.4 1.0
df_prod2
sim_prod2 act_prod2
0 2.0 2.5
1 2.1 2.0
Try groupby with axis=1:
for prod, d in df.groupby(df.columns.str[-4:], axis=1):
print(f'this is {prod}')
print(d)
print('='*20)
Output:
this is rod1
sim-prod1 act-prod1
0 1.0 1.1
1 1.4 1.0
====================
this is rod2
sim-prod2 act-prod2
0 2.0 2.5
1 2.1 2.0
====================
Now, to have them as variables:
dfs = {prod:d for prod, d in df.groupby(df.columns.str[-4:], axis=1)}
Try this, storing the parts of the dataframe as a dictionary:
df_dict = dict(tuple(df.groupby(df.columns.str[4:], axis=1)))
print(df_dict['prod1'])
print('\n')
print(df_dict['prod2'])
Output:
sim-prod1 act-prod1
0 1.0 1.1
1 1.4 1.0
sim-prod2 act-prod2
0 2.0 2.5
1 2.1 2.0
You can also do this without using groupby() and for loop by:-
df_prod2=df[df.columns[df.columns.str.contains(prods[0])]]
df_prod1=df[df.columns[df.columns.str.contains(prods[1])]]
I would like to fill my first dataframe with data from the second dataframe. Since I don't need and any special condition I suppose combine_first function looks like the right choice for me.
Unfortunately when I try to combine two dataframes result is still the original dataframe.
My code:
import pandas as pd
df1 = pd.DataFrame({'Gen1': [5, None, 3, 2, 1],
'Gen2': [1, 2, None, 4, 5]})
df2 = pd.DataFrame({'Gen1': [None, 4, None, None, None],
'Gen2': [None, None, 3, None, None]})
df1.combine_first(df2)
Then, When I print(df1) I get df1 as I initiate it on the second row.
Where did I make a mistake?
For me working nice if assign back output, but very similar method DataFrame.update working inplace:
df = df1.combine_first(df2)
print (df)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
df1.update(df2)
print (df1)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
combine_first returns a dataframe which has the change and not updating the existing dataframe so you should get the return dataframe
df1=df1.combine_first(df2)
I have at pandas dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,5,3],
'B': [4,2,6]})
df['avg'] = df.mean(axis=1)
df[df<df['avg']]
I would like keep all the values in the dataframe that are below the average value in column df['avg']. When I perform the below operation I am returned all NAN's
df[df<df['avg']]
If I set up a for loop I can get the boolean of what I want.
col_names = ['A', 'B']
for colname in col_names:
df[colname] = df[colname]<df['avg']
What I am searching for would look like this:
df_desired = pd.DataFrame({
'A':[1,np.nan,3],
'B':[np.nan,2,np.nan],
'avg' :[2.5, 3.5, 4.5]
})
How do I do this? There has to be a pythonic way to do this.
You can use .mask(..) [pandas-doc] here. We can use numpy's broadcasting to generate an array of booleans that are higher than the given average:
>>> df.mask(df.values > df['avg'].values[:,None])
A B avg
0 1.0 NaN 2.5
1 NaN 2.0 3.5
2 3.0 NaN 4.5
I think this is somewhat more idiomatic, and clearer, than the accepted solution:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 3],
'B': [4, 2, 6]})
print(df)
df['avg'] = df.mean(axis=1)
print(df)
df[df[['A', 'B']].ge(df['avg'], axis=0)] = np.NaN
print(df)
Output:
A B
0 1 4
1 5 2
2 3 6
A B avg
0 1 4 2.5
1 5 2 3.5
2 3 6 4.5
A B avg
0 1.0 NaN 2.5
1 NaN 2.0 3.5
2 3.0 NaN 4.5
Speaking of the accepted solution, it is no longer recommended to use .values in order to convert a Pandas DataFrame or Series to a NumPy Array. Fortunately, we don't actually need to use it at all here:
df.mask(df > df['avg'][:, np.newaxis])