I have a pandas DataFrame as shown below. I want to identify the index values of the columns in df that match a given string (more specifically, a string that matches the column names after 'sim-' or 'act-').
# Sample df
import pandas as pd
df = pd.DataFrame({
'sim-prod1': [1, 1.4],
'sim-prod2': [2, 2.1],
'act-prod1': [1.1, 1],
'act-prod2': [2.5, 2]
})
# Get unique prod values from df.columns
prods = pd.Series(df.columns[1:]).str[4:].unique()
prods
array(['prod2', 'prod1'], dtype=object)
I now want to loop through prods and identify the columns where prod1 and prod2 occur, and then use those columns to create new dataframes. How can I do this? In R I could use the which function to do this easily. Example dataframes I want to obtain are below.
df_prod1
sim_prod1 act_prod1
0 1.0 1.1
1 1.4 1.0
df_prod2
sim_prod2 act_prod2
0 2.0 2.5
1 2.1 2.0
Try groupby with axis=1:
for prod, d in df.groupby(df.columns.str[-4:], axis=1):
print(f'this is {prod}')
print(d)
print('='*20)
Output:
this is rod1
sim-prod1 act-prod1
0 1.0 1.1
1 1.4 1.0
====================
this is rod2
sim-prod2 act-prod2
0 2.0 2.5
1 2.1 2.0
====================
Now, to have them as variables:
dfs = {prod:d for prod, d in df.groupby(df.columns.str[-4:], axis=1)}
Try this, storing the parts of the dataframe as a dictionary:
df_dict = dict(tuple(df.groupby(df.columns.str[4:], axis=1)))
print(df_dict['prod1'])
print('\n')
print(df_dict['prod2'])
Output:
sim-prod1 act-prod1
0 1.0 1.1
1 1.4 1.0
sim-prod2 act-prod2
0 2.0 2.5
1 2.1 2.0
You can also do this without using groupby() and for loop by:-
df_prod2=df[df.columns[df.columns.str.contains(prods[0])]]
df_prod1=df[df.columns[df.columns.str.contains(prods[1])]]
Related
I have two dataframes like below,
import numpy as np
import pandas as pd
df1 = pd.DataFrame({1: np.zeros(5), 2: np.zeros(5)}, index=['a','b','c','d','e'])
and
df2 = pd.DataFrame({'category': [1,1,2,2], 'value':[85,46, 39, 22]}, index=[0, 1, 3, 4])
The value from second dataframe is required to be assigned in first dataframe such that the index and column relationship is maintained. The second dataframe index is iloc based and has column category which is actually containing column names of first dataframe. The value is value to be assigned.
Following is the my solution with expected output,
for _category in df2['category'].unique():
df1.loc[df1.iloc[df2[df2['category'] == _category].index.tolist()].index, _category] = df2[df2['category'] == _category]['value'].values
Is there a pythonic way of doing so without the for loop?
One option is to pivot and update:
df3 = df1.reset_index()
df3.update(df2.pivot(columns='category', values='value'))
df3 = df3.set_index('index').rename_axis(None)
Alternative, reindex df2 (in two steps, numerical and by label), and combine_first with df1:
df3 = (df2
.pivot(columns='category', values='value')
.reindex(range(max(df2.index)+1))
.set_axis(df1.index)
.combine_first(df1)
)
output:
1 2
a 85.0 0.0
b 46.0 0.0
c 0.0 0.0
d 0.0 39.0
e 0.0 22.0
Here's one way by replacing the 0s in df1 with NaN; pivotting df2 and filling in the NaNs in df1 with df2:
out = (df1.replace(0, pd.NA).reset_index()
.fillna(df2.pivot(columns='category', values='value'))
.set_index('index').rename_axis(None).fillna(0))
Output:
1 2
a 85.0 0.0
b 46.0 0.0
c 0.0 0.0
d 0.0 39.0
e 0.0 22.0
Let's say I have a dataframe as shown.
I have a list now like [6,7,6]. How do I fill these to the my 3 desired columns i.e,[one,Two,Four] of dataframe? Notice that I have not given column 'three'. Final Dataframe should look like:
You can append a Series:
df = pd.DataFrame([[2, 4, 4, 8]],
columns=['One', 'Two', 'Three', 'Four'])
values = [6, 3, 6]
lst = ['One', 'Two', 'Four']
df = df.append(pd.Series(values, index=lst), ignore_index=True)
or a dict:
df = df.append(dict(zip(lst, values)), ignore_index=True)
output:
One Two Three Four
0 2.0 4.0 4.0 8.0
1 6.0 3.0 NaN 6.0
you could do:
columnstobefilled = ["One","Two","Four"]
elementsfill = [6,3,6]
for column,element in zip(columnstobefilled,elementsfill):
df[column] = element
Since you want the list values to be in specific places you have to specify where each value should go. One way to include this is to use a key value pair object, a dictionary. Once you create that you can use append to include it as a row in your dataframe:
d = {'one':6,'Two':7,'Four':6}
df.append(d,ignore_index=True)
one Two Three Four
0 2.0 4.0 4.0 8.0
1 6.0 7.0 NaN 6.0
Dataset:
df = pd.DataFrame({'one':2,'Two':4,'Three':4,'Four':8},
index=[0])
import pandas as pd
df = pd.DataFrame({'One':2, 'Two':4, 'Three':4, 'Four':8}, index=[0])
new_row = {'One':6, 'Two':7, 'Three':None, 'Four':6}
df.append(new_row, ignore_index=True)
print(df)
output:
One Two Three Four
0 2.0 4.0 4.0 8.0
1 6.0 7.0 NaN 6.0
I have below dataframes (df1 and df2) in pandas. What I want to achieve is to multiply df1 with df2 matching column header and create df3. Expected results are,
df3 = pd.DataFrame([{'A':2,'B':2.2,'C':20},
{'A':2.5,'B':2.8,'C':24},
{'A':3.0,'B':2.8,'C':24.8}])
I've tried to use df3 = df1.mul(df2,axis=1) but it is not working. It produced a lot of NaN and give extra 2 columns. Can anyone share some hints?
df1 = pd.DataFrame([{'A':20,'B':22,'C':25},
{'A':25,'B':28,'C':30},
{'A':30,'B':28,'C':31}])
df2 = pd.DataFrame([{'X':'A','Y':0.1},
{'X':'B','Y':0.1},
{'X':'C','Y':0.8}])
I changed df2 to s2 -- is this what you're looking for?
df1 = pd.DataFrame([{'A':20,'B':22,'C':25},
{'A':25,'B':28,'C':30},
{'A':30,'B':28,'C':31}])
s2 = pd.Series(data=[0.1, 0.1, 0.8],
index=['A', 'B', 'C'])
df1.mul(s2)
The result is:
A B C
0 2.0 2.2 20.0
1 2.5 2.8 24.0
2 3.0 2.8 24.8
Get the columns to align on the index, multiply, and unstack to get back your result
df1.stack().mul(df2.set_index("X").Y, level=-1).unstack()
A B C
0 2.0 2.2 20.0
1 2.5 2.8 24.0
2 3.0 2.8 24.8
Note : This works for more rows (50 as you mentioned in the comments)
I would like to fill my first dataframe with data from the second dataframe. Since I don't need and any special condition I suppose combine_first function looks like the right choice for me.
Unfortunately when I try to combine two dataframes result is still the original dataframe.
My code:
import pandas as pd
df1 = pd.DataFrame({'Gen1': [5, None, 3, 2, 1],
'Gen2': [1, 2, None, 4, 5]})
df2 = pd.DataFrame({'Gen1': [None, 4, None, None, None],
'Gen2': [None, None, 3, None, None]})
df1.combine_first(df2)
Then, When I print(df1) I get df1 as I initiate it on the second row.
Where did I make a mistake?
For me working nice if assign back output, but very similar method DataFrame.update working inplace:
df = df1.combine_first(df2)
print (df)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
df1.update(df2)
print (df1)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
combine_first returns a dataframe which has the change and not updating the existing dataframe so you should get the return dataframe
df1=df1.combine_first(df2)
I have at pandas dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,5,3],
'B': [4,2,6]})
df['avg'] = df.mean(axis=1)
df[df<df['avg']]
I would like keep all the values in the dataframe that are below the average value in column df['avg']. When I perform the below operation I am returned all NAN's
df[df<df['avg']]
If I set up a for loop I can get the boolean of what I want.
col_names = ['A', 'B']
for colname in col_names:
df[colname] = df[colname]<df['avg']
What I am searching for would look like this:
df_desired = pd.DataFrame({
'A':[1,np.nan,3],
'B':[np.nan,2,np.nan],
'avg' :[2.5, 3.5, 4.5]
})
How do I do this? There has to be a pythonic way to do this.
You can use .mask(..) [pandas-doc] here. We can use numpy's broadcasting to generate an array of booleans that are higher than the given average:
>>> df.mask(df.values > df['avg'].values[:,None])
A B avg
0 1.0 NaN 2.5
1 NaN 2.0 3.5
2 3.0 NaN 4.5
I think this is somewhat more idiomatic, and clearer, than the accepted solution:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 3],
'B': [4, 2, 6]})
print(df)
df['avg'] = df.mean(axis=1)
print(df)
df[df[['A', 'B']].ge(df['avg'], axis=0)] = np.NaN
print(df)
Output:
A B
0 1 4
1 5 2
2 3 6
A B avg
0 1 4 2.5
1 5 2 3.5
2 3 6 4.5
A B avg
0 1.0 NaN 2.5
1 NaN 2.0 3.5
2 3.0 NaN 4.5
Speaking of the accepted solution, it is no longer recommended to use .values in order to convert a Pandas DataFrame or Series to a NumPy Array. Fortunately, we don't actually need to use it at all here:
df.mask(df > df['avg'][:, np.newaxis])