I have a dataframe as such:
Col1 Col2 Col3.... Col64 Col1 Volume Col2 Volume....Col64 Volume.... Col1 Value Col2 Value...Col 64 Value
2 3 4 5 5 7 9 3 5
3 4 5 11 8 6 5 6 5
5 3 4 6 10 11 5 3 4
I want to multiply Col1 with Col1 Volume and then divide by Col1 Value and place the value in a new column called 'Col1 result'
similarly multiply Col2 with Col2 Volume and then divide by Col2 Value and place the value in a new column called 'Col2 result'
I wish to do this for every row of those columns.
Output should be as such and these columns should be appended to the existing dataframe.
Col1 Result Col2 Result
3.33 4.2
6 4.8
16.6 8.25
...
How can I perform this operation? It also has to be 1 to 1 multiplication, that is only the first row of Col1 should be multiplied with Col1 Volume and divided by first row of Col1 Value.
Doing it manually would take a lot of time.
Use DataFrame.filter for get all columns with Volume and Value with $ for end of string, remove substrings and then filter df by columns from df1, multiple and divide columns with DataFrame.add_suffix, replace missing columns 0 and append to original DataFrame:
df1 = df.filter(regex='Volume$').rename(columns=lambda x: x.replace(' Volume',''))
df2 = df.filter(regex='Value$').rename(columns=lambda x: x.replace(' Value',''))
df = df.join(df[df1.columns].mul(df1).div(df2).add_suffix(' Result').fillna(0))
print (df)
Col1 Col2 Col3 Col64 Col1 Volume Col2 Volume Col64 Volume \
0 2 3 4 5 5 7 9
1 3 4 5 11 8 6 5
Col1 Value Col2 Value Col64 Value Col1 Result Col2 Result Col64 Result
0 3 5 7 3.333333 4.2 6.428571
1 6 5 7 4.000000 4.8 7.857143
Related
I have two dataframes, first one is:
col1 col2 col3
1 14 2 6
2 12 3 3
3 9 4 2
Second one is:
col4 col5 col6
2 14 2 6
3 12 3 3
I want to concatenate them and get the index values from second one and row values from the first one.
The result will be like this:
col1 col2 col3
2 12 3 3
3 9 4 2
My solution was pd.concat([df2, df1, axis=1)]).drop(df2, axis=1) but I believe there is more efficient way to do this.
You can use index from df2 with loc function on df1:
df1.loc[df2.index]
Output:
col1 col2 col3
2 12 3 3
3 9 4 2
I have a data frame like this,
col1 col2 col3
1 2 3
2 3 4
4 2 3
7 2 8
8 3 4
9 3 3
15 1 12
Now I want to group those rows where there difference between two consecutive col1 rows is less than 3. and sum other column values, create another column(col4) with the last value of the group,
So the final data frame will look like,
col1 col2 col3 col4
1 7 10 4
7 8 15 9
using for loop to do this is tedious, looking for some pandas shortcuts to do it most efficiently.
You can do a named aggregation on groupby:
(df.groupby(df.col1.diff().ge(3).cumsum(), as_index=False)
.agg(col1=('col1','first'),
col2=('col2','sum'),
col3=('col3','sum'),
col4=('col1','last'))
)
Output:
col1 col2 col3 col4
0 1 7 10 4
1 7 8 15 9
2 15 1 12 15
update without named aggregation you can do some thing like this:
groups = df.groupby(df.col1.diff().ge(3).cumsum())
new_df = groups.agg({'col1':'first', 'col2':'sum','col3':'sum'})
new_df['col4'] = groups['col1'].last()
I have a dataframe like this:
df1
col1 col2 col3 col4
1 2 A S
3 4 A P
5 6 B R
7 8 B B
I have another data frame:
df2
col5 col6 col3
9 10 A
11 12 R
I want to join these two data frame if any value of col3 and col4 of df1 matches with col3 values of df2 it will join.
the final data frame will look like:
df3
col1 col2 col3 col5 col6
1 2 A 9 10
3 4 A 9 10
5 6 R 11 12
If col3 value presents in df2 then it will join via col3 values else it will join via col4 values if it presents in col3 values of df2
How to do this in most efficient way using pandas/python?
Use double merge with default inner join, for second filter out rows matched in df3, last concat together:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2[~df2['col3'].isin(df1['col3'])], on='col3'))
df = pd.concat([df3, df4],ignore_index=True)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9 10
1 3 4 A 9 10
2 5 6 R 11 12
EDIT: Use left join and last combine_first:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3', how='left')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2, on='col3', how='left'))
df = df3.combine_first(df4)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9.0 10.0
1 3 4 A 9.0 10.0
2 5 6 B 11.0 12.0
3 7 8 B NaN NaN
What's the syntax for combining mean and a min on a dataframe? I want to group by 2 columns, calculate the mean within a group for col3 and keep the min value of col4. Would something like
groupeddf = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean().min('col4')
work? If not, what's the correct syntax? Thank you!
EDIT
Okay, so the question wasn't quite clear without an example. I'll update it now. Also changes in text above.
I have:
ungrouped
col1 col2 col3 col4
1 2 3 4
1 2 4 1
2 4 2 1
2 4 1 3
2 3 1 3
Wanted output is grouped by columns 1-2, mean for column 3 (and actually some more columns on the data, this is simplified) and the minimum of col4:
grouped
col1 col2 col3 col4
1 2 3.5 1
2 4 1.5 1
2 3 1 3
I think you need first mean and then min of column col4:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean()['col4'].min()
or min of Series:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
Sample:
nongrouped = pd.DataFrame({'col1':[1,1,3],
'col2':[1,1,6],
'col3':[1,1,9],
'col4':[1,3,5]})
print (nongrouped)
col1 col2 col3 col4
0 1 1 1 1
1 1 1 1 3
2 3 6 9 5
print (nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean())
1 1 1 2
3 6 9 5
Name: col4, dtype: int64
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
print (min_val)
2
EDIT:
You need aggregate:
groupeddf = nongrouped.groupby(['col1', 'col2'], sort=False)
.agg({'col3':'mean','col4':'min'})
.reset_index()
.reindex(columns=nongrouped.columns)
print (groupeddf)
col1 col2 col3 col4
0 1 2 3.5 1
1 2 4 1.5 1
2 2 3 1.0 3
Given two pandas dataframes:
df1 = pd.read_csv(file1, names=['col1','col2','col3'])
df2 = pd.read_csv(file2, names=['col1','col2','col3'])
I'd like to remove all the rows in df2 where the values of either col1 or col2 (or both) do not exist in df1.
Doing the following:
df2 = df2[(df2['col1'] in set(df1['col1'])) & (df2['col2'] in set(df1['col2']))]
yields:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I think you can try isin:
df2 = df2[(df2['col1'].isin(df1['col1'])) & (df2['col2'].isin(df1['col2']))]
df1 = pd.DataFrame({'col1':[1,2,3,3],
'col2':[4,5,6,2],
'col3':[7,8,9,5]})
print (df1)
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
3 3 2 5
df2 = pd.DataFrame({'col1':[1,2,3,5],
'col2':[4,7,4,1],
'col3':[7,8,9,1]})
print (df2)
col1 col2 col3
0 1 4 7
1 2 7 8
2 3 4 9
3 5 1 1
df2 = df2[(df2['col1'].isin(df1['col1'])) & (df2['col2'].isin(df1['col2'].unique()))]
print (df2)
col1 col2 col3
0 1 4 7
2 3 4 9
Another solution is merge, because inner join (how='inner') is by default, but it works only for values with same position in both DataFrames:
print (pd.merge(df1, df2))
col1 col2 col3
0 1 4 7