I have 2 dataframes from the following
df :
Name
1 A
2 B
3 C
4 C
5 D
6 D
7 D
8 D
and df_value :
Name Value
1 A 50
2 B 100
3 C 200
4 D 800
I want to merge both dataframes (into df), but with the new Value being worth the df_value Value divided by the number of occurences of Name in df
Output :
Name Value
1 A 50
2 B 100
3 C 100
4 C 100
5 D 200
6 D 200
7 D 200
8 D 200
A appears once, has a Value of 50 in df_value, so its value is 50. Same logic for B.
C appears 2 times, has a value of 200 in df_value, so its value is 200 / 2 = 100
D appears 4 times, has a value of 800 in df_value, so its value is 800 / 4 = 200
I'm pretty sure there's a really easy way to do that but I can't find it.
Thanks in advance.
Use Series.map by Name column and Series from df_value and divide mapped values of Series.value_counts:
df['Value'] = (df['Name'].map(df_value.set_index('Name')['Value'])
.div(df['Name'].map(df['Name'].value_counts())))
print (df)
Name Value
1 A 50.0
2 B 100.0
3 C 100.0
4 C 100.0
5 D 200.0
6 D 200.0
7 D 200.0
8 D 200.0
Another solution, thank you #sammywemmy is mapping by already divided values:
df1.assign(Value=df1.Name.map(df2.set_index("Name").Value.div(df1.Name.value_counts())))
Solution with merge is possible, also added anothe alternative for counts by GroupBy.transform:
df['Value'] = (df.merge(df_value, on='Name', how='left')['Value']
.div(df.groupby('Name')['Name'].transform('size')))
If it is important to keep existing dataframes as is and there is no restriction of using 2 lines of code:
df1 = df.merge(df_value, on='Name', how='left')
df1['Value'] = df1.groupby('Name')[['Value']].transform(lambda x: x/len(x))
Otherwise one liner solution that modifies existing 'df' a bit.
df['Value'] = df.merge(df_value, on='Name', how='left').groupby('Name')[['Value']].transform(lambda x: x/len(x))
Both give same output with different variable names:
Name Value
0 A 50.0
1 B 100.0
2 C 100.0
3 C 100.0
4 D 200.0
5 D 200.0
6 D 200.0
7 D 200.0
Related
I have two data frames with similar data, and I would like to substract matching values. Example :
df1:
Letter FREQ Diff
0 A 20 NaN
1 B 12 NaN
2 C 5 NaN
3 D 4 NaN
df2:
Letter FREQ
0 A 19
1 B 11
3 D 2
If we can find the same letter in the column "Letter", I would like to create a new column with the subtraction of the two frequency columns.
Expected output :
df1:
Letter FREQ Diff
0 A 20 1
1 B 12 1
2 C 5 5
3 D 4 2
I have tried to begin like this, but obviously it doesn't work
for i in df1.Letter:
for j in df2.Letter:
if i == j:
df1.Difference[j] == (df1.Frequency[i] - df2.Frequency[j])
else:
pass
Thank you for your help!
Use df.merge with fillna:
In [1101]: res = df1.merge(df2, on='Letter', how='outer')
In [1108]: res['difference'] = (res.Frequency_x - res.Frequency_y).fillna(res.Frequency_x)
In [1110]: res = res.drop('Frequency_y', 1).rename(columns={'Frequency_x': 'Frequency'})
In [1111]: res
Out[1111]:
Letter Frequency difference
0 A 20 1.0
1 B 12 1.0
2 C 5 5.0
3 D 4 2.0
I have a DataFrame that looks like this:
df = pd.DataFrame({'ID':['A','B','A','C','C'], 'value':[2,4,9,1,3.5]})
df
ID value
0 A 2.0
1 B 4.0
2 A 9.0
3 C 1.0
4 C 3.5
What I need to do is to go through ID column and for each unique value, find that row, and multiply the corresponding row in value column based on the reference that I have.
For example, if I have the following reference:
if A multiply by 10
if B multiply by 3
if C multiply by 2
Then the desired output would be:
df
ID value
0 A 2.0*10
1 B 4.0*3
2 A 9.0*10
3 C 1.0*2
4 C 3.5*2
Thanks in advance.
Use Series.map with dictionary for Series used for multiple column value:
d = {'A':10, 'B':3,'C':2}
df['value'] = df['value'].mul(df['ID'].map(d))
print (df)
ID value
0 A 20.0
1 B 12.0
2 A 90.0
3 C 2.0
4 C 7.0
Detail:
print (df['ID'].map(d))
0 10
1 3
2 10
3 2
4 2
Name: ID, dtype: int64
I am joining two tables left_table and right_table on non-unique keys that results in row explosion. I then want to aggregate rows to match the number of rows in left_table. To do this I aggregate over left_table columns.
Weirdly, when I save the table the columns in left_table double. It seems like columns of left_table become an index for resulting dataframe...
Left table
k1 k2 s v c target
0 1 3 20 40 2 2
1 1 2 10 20 1 1
2 1 2 10 80 2 1
Right table
k11 k22 s2 v2
0 1 2 0 100
1 2 3 30 200
2 1 2 10 300
Left join
k1 k2 s v c target s2 v2
0 1 3 20 40 2 2 NaN NaN
1 1 2 10 20 1 1 0.0 100.0
2 1 2 10 20 1 1 10.0 300.0
3 1 2 10 80 2 1 0.0 100.0
4 1 2 10 80 2 1 10.0 300.0
Aggregation code
dic = {}
keys_to_agg_over = left_table_col_names
for col in numeric_cols:
if col in all_cols:
dic[col] = 'median'
left_join = left_join.groupby(keys_to_agg_over).aggregate(dic)
After aggregation (doubled number of left table cols)
k1 k2 s v c target s2 v2
k1 k2 s v c target
1 2 10 20 1 1 1 2 10 20 1 1 5.0 200.0
80 2 1 1 2 10 80 2 1 5.0 200.0
3 20 40 2 2 1 3 20 40 2 2 NaN NaN
Saved to csv file
k1,k2,s,v,c,target,k1,k2,s,v,c,target,s2,v2
1,2,10,20,1,1,1,2,10,20,1,1,5.0,200.0
1,2,10,80,2,1,1,2,10,80,2,1,5.0,200.0
1,3,20,40,2,2,1,3,20,40,2,2,,
I tried resetting index, as left_join.reset_index() but I get
ValueError: cannot insert target, already exists
How to fix the issue of column-doubling?
You have a couple of options:
Store csv not including the index: I guess you are using the to_csv method to store the result in a csv. By default it includes you index columns in the generated csv. you can do to_csv(index=False) to avoid storing them.
reset_index dropping it: you can use left_join.reset_index(drop=True) in order to discard the index columns and not add them in the dataframe. By default reset_index adds the current index columns to the dataframe, generating the ValueError you obtain.
It seems like you are using:
left_join = left_table.merge(right_table, left_on = ["k1", "k2"], "right_on" = ["k11", "k22"] , how = "left")
This will result in a dataframe with repeated rows since indexes 1 and 2 from the left table both can be joined to indexes 0 and 2 of the right table. If that is the behavior you expected, and just want to get rid of duplicated rows you can try using:
left_join = left_join.drop_duplicates()
Before aggregating. This solution won't stop duplicating rows, it will rather eliminate them to not cause any trouble.
You can also pass the parameter as_index = False in the groupby function like this:
left_join = left_join.groupby(keys_to_agg_over, as_index = False).aggregate(dic)
To stop geting the "grouping columns" as indexes.
I'm trying to make a cumulative dataframe from various fragmented small dataframes.
For example, I have fragmented small dataframe A and B such as:
a c
100 1 2
200 5 6
a b d
100 2 3 8
200 9 1 9
A = pd.DataFrame(data = [[1,2],[5,6]], index=[100,200], columns=['a','c'])
B = pd.DataFrame(data = [[2,3,8],[9,1,9]], index=[100,200], columns=['a','b','d'])
and I want to add these into a cumulative dataframe c
a b c d
100 0 0 0 0
200 0 0 0 0
C = pd.DataFrame(data = [[0,0,0,0],[0,0,0,0]], index=[100,200], columns=['a','b','c','d'])
So what I want to do is, add A and B to C to make:
a b c d
100 3 3 2 8
200 14 1 6 9
what I did first was something like:
C[A.columns] += A
C[B.columns] += B
Which works fine, and brings the desired output.
However, in my real application, efficiency issues rise since the size of A, B, and C is quite big, and there are many, many fragmented dataframes like A and B
Therefore, I have looked for some alternative methods, and I found pandas.eval pretty powerful in big-sized matrix operations.
What I tried is:
import pandas as pd
C = pd.eval('C+A')
C = pd.eval('C+B')
However, In this case, columns not included in A or B becomes NaN...
out:
a b c d
100 1.0 NaN 2.0 NaN
200 5.0 NaN 6.0 NaN
out:
a b c d
100 3.0 NaN NaN NaN
200 14.0 NaN NaN NaN
Any suggestions to make the desired operation more efficiently? Any suggestions would be appreciated (I don't necessarily have to use pd.eval)
Thank you in advance!
Sometime using pd.eval is tricky.
Here you need to use the columns that are being added in the C df and we need to pass the list of columns using a variable when using pd.eval.
a_cols = A.columns
b_cols = B.columns
C[a_cols] = pd.eval('C[a_cols]+A', engine='python')
C[b_cols] = pd.eval('C[b_cols]+B', engine='python')
out:
a b c d
100 3 3 2 8
200 14 1 6 9
And for the engine we need to use 'python' instead of 'numexpr' which is default for engine.
After encountering this code:
I was confused about the usage of both .apply and lambda. Firstly does .apply apply the desired change to all elements in all the columns specified or each column one by one? Secondly, does x in lambda x: iterate through every element in specified columns or columns separately? Thirdly, does x.min or x.max give us the minimum or maximum of all the elements in specified columns or minimum and maximum elements of each column separately? Any answer explaining the whole process would make me more than grateful.
Thanks.
I think here is the best avoid apply - loops under the hood and working with subset of DataFrame by columns from list:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
c = ['B','C','D']
So first select minimal values of selected columns and similar maximal:
print (df[c].min())
B 4
C 2
D 0
dtype: int64
Then subtract and divide:
print ((df[c] - df[c].min()))
B C D
0 0 5 1
1 1 6 3
2 0 7 5
3 1 2 7
4 1 0 1
5 0 1 0
print (df[c].max() - df[c].min())
B 1
C 7
D 7
dtype: int64
df[c] = (df[c] - df[c].min()) / (df[c].max() - df[c].min())
print (df)
A B C D E F
0 a 0.0 0.714286 0.142857 5 a
1 b 1.0 0.857143 0.428571 3 a
2 c 0.0 1.000000 0.714286 6 a
3 d 1.0 0.285714 1.000000 9 b
4 e 1.0 0.000000 0.142857 2 b
5 f 0.0 0.142857 0.000000 4 b
EDIT:
For debug apply is best create custom function:
def f(x):
#for each loop return column
print (x)
#return scalar - min
print (x.min())
#return new Series - column
print ((x-x.min())/ (x.max() - x.min()))
return (x-x.min())/ (x.max() - x.min())
df[c] = df[c].apply(f)
print (df)
Check if the data are really being normalised. Because x.min and x.max may simply take the min and max of a single value, hence no normalisation would occur.