I'm trying to update some columns of a dataframe where some condition is met (only some lines will meet the condition).
I'm using apply with loc. My function returns a pandas series.
The problem is that the columns are updates with NaN.
Simplifying my problem, we can consider the following dataframe df_test:
col1 col2 col3 col4
0 A 1 1 2
1 B 2 1 2
2 A 3 1 2
3 B 4 1 2
I now want to update col3 and col4 when col1=A. For that I'll use the apply method
df_test.loc[df_test['col1']=='A', ['col3', 'col4']] = df_test[df_test['col1']=='A'].apply(lambda row: pd.Series([10,20]), axis=1)
Doing that I get:
col1 col2 col3 col4
0 A 1 NaN NaN
1 B 2 1.0 2.0
2 A 3 NaN NaN
3 B 4 1.0 2.0
If instead of pd.Series([10, 20]) I use np.array([10, 20]) or [10, 20] I get the following error
ValueError: shape mismatch: value array of shape (2,2) could not be broadcast to indexing result of shape (2,)
What do I need to return to obtain
col1 col2 col3 col4
0 A 1 10 20
1 B 2 1 2
2 A 3 10 20
3 B 4 1 2
thanks!
Related
I have a dataframe as such:
Col1 Col2 Col3.... Col64 Col1 Volume Col2 Volume....Col64 Volume.... Col1 Value Col2 Value...Col 64 Value
2 3 4 5 5 7 9 3 5
3 4 5 11 8 6 5 6 5
5 3 4 6 10 11 5 3 4
I want to multiply Col1 with Col1 Volume and then divide by Col1 Value and place the value in a new column called 'Col1 result'
similarly multiply Col2 with Col2 Volume and then divide by Col2 Value and place the value in a new column called 'Col2 result'
I wish to do this for every row of those columns.
Output should be as such and these columns should be appended to the existing dataframe.
Col1 Result Col2 Result
3.33 4.2
6 4.8
16.6 8.25
...
How can I perform this operation? It also has to be 1 to 1 multiplication, that is only the first row of Col1 should be multiplied with Col1 Volume and divided by first row of Col1 Value.
Doing it manually would take a lot of time.
Use DataFrame.filter for get all columns with Volume and Value with $ for end of string, remove substrings and then filter df by columns from df1, multiple and divide columns with DataFrame.add_suffix, replace missing columns 0 and append to original DataFrame:
df1 = df.filter(regex='Volume$').rename(columns=lambda x: x.replace(' Volume',''))
df2 = df.filter(regex='Value$').rename(columns=lambda x: x.replace(' Value',''))
df = df.join(df[df1.columns].mul(df1).div(df2).add_suffix(' Result').fillna(0))
print (df)
Col1 Col2 Col3 Col64 Col1 Volume Col2 Volume Col64 Volume \
0 2 3 4 5 5 7 9
1 3 4 5 11 8 6 5
Col1 Value Col2 Value Col64 Value Col1 Result Col2 Result Col64 Result
0 3 5 7 3.333333 4.2 6.428571
1 6 5 7 4.000000 4.8 7.857143
I have two dataframes, first one is:
col1 col2 col3
1 14 2 6
2 12 3 3
3 9 4 2
Second one is:
col4 col5 col6
2 14 2 6
3 12 3 3
I want to concatenate them and get the index values from second one and row values from the first one.
The result will be like this:
col1 col2 col3
2 12 3 3
3 9 4 2
My solution was pd.concat([df2, df1, axis=1)]).drop(df2, axis=1) but I believe there is more efficient way to do this.
You can use index from df2 with loc function on df1:
df1.loc[df2.index]
Output:
col1 col2 col3
2 12 3 3
3 9 4 2
Hello I have a df such as :
I wondered how I can subset row where :
COL1 contains a string "ok"
COL2 > 4
COL3 < 4
here is an exemple
COL1 COL2 COL3
AB_ok_7 5 2
AB_ok_4 2 5
AB_uy_2 5 2
AB_ok_2 2 2
U_ok_7 12 3
I should display only :
COL1 COL2 COL3
AB_ok_7 5 2
U_ok_7 12 3
Like this:
In [2288]: df[df['COL1'].str.contains('ok') & df['COL2'].gt(4) & df['COL3'].lt(4)]
Out[2288]:
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3
You can use boolean indexing and chaining all the conditions.
m = df['COL1'].str.contains('ok')
m1 = df['COL2'].gt(4)
m2 = df['COL3'].lt(4)
df[m & m1 & m2]
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3
I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks
first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2
maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2
What's the syntax for combining mean and a min on a dataframe? I want to group by 2 columns, calculate the mean within a group for col3 and keep the min value of col4. Would something like
groupeddf = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean().min('col4')
work? If not, what's the correct syntax? Thank you!
EDIT
Okay, so the question wasn't quite clear without an example. I'll update it now. Also changes in text above.
I have:
ungrouped
col1 col2 col3 col4
1 2 3 4
1 2 4 1
2 4 2 1
2 4 1 3
2 3 1 3
Wanted output is grouped by columns 1-2, mean for column 3 (and actually some more columns on the data, this is simplified) and the minimum of col4:
grouped
col1 col2 col3 col4
1 2 3.5 1
2 4 1.5 1
2 3 1 3
I think you need first mean and then min of column col4:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean()['col4'].min()
or min of Series:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
Sample:
nongrouped = pd.DataFrame({'col1':[1,1,3],
'col2':[1,1,6],
'col3':[1,1,9],
'col4':[1,3,5]})
print (nongrouped)
col1 col2 col3 col4
0 1 1 1 1
1 1 1 1 3
2 3 6 9 5
print (nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean())
1 1 1 2
3 6 9 5
Name: col4, dtype: int64
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
print (min_val)
2
EDIT:
You need aggregate:
groupeddf = nongrouped.groupby(['col1', 'col2'], sort=False)
.agg({'col3':'mean','col4':'min'})
.reset_index()
.reindex(columns=nongrouped.columns)
print (groupeddf)
col1 col2 col3 col4
0 1 2 3.5 1
1 2 4 1.5 1
2 2 3 1.0 3