I am joining two tables left_table and right_table on non-unique keys that results in row explosion. I then want to aggregate rows to match the number of rows in left_table. To do this I aggregate over left_table columns.
Weirdly, when I save the table the columns in left_table double. It seems like columns of left_table become an index for resulting dataframe...
Left table
k1 k2 s v c target
0 1 3 20 40 2 2
1 1 2 10 20 1 1
2 1 2 10 80 2 1
Right table
k11 k22 s2 v2
0 1 2 0 100
1 2 3 30 200
2 1 2 10 300
Left join
k1 k2 s v c target s2 v2
0 1 3 20 40 2 2 NaN NaN
1 1 2 10 20 1 1 0.0 100.0
2 1 2 10 20 1 1 10.0 300.0
3 1 2 10 80 2 1 0.0 100.0
4 1 2 10 80 2 1 10.0 300.0
Aggregation code
dic = {}
keys_to_agg_over = left_table_col_names
for col in numeric_cols:
if col in all_cols:
dic[col] = 'median'
left_join = left_join.groupby(keys_to_agg_over).aggregate(dic)
After aggregation (doubled number of left table cols)
k1 k2 s v c target s2 v2
k1 k2 s v c target
1 2 10 20 1 1 1 2 10 20 1 1 5.0 200.0
80 2 1 1 2 10 80 2 1 5.0 200.0
3 20 40 2 2 1 3 20 40 2 2 NaN NaN
Saved to csv file
k1,k2,s,v,c,target,k1,k2,s,v,c,target,s2,v2
1,2,10,20,1,1,1,2,10,20,1,1,5.0,200.0
1,2,10,80,2,1,1,2,10,80,2,1,5.0,200.0
1,3,20,40,2,2,1,3,20,40,2,2,,
I tried resetting index, as left_join.reset_index() but I get
ValueError: cannot insert target, already exists
How to fix the issue of column-doubling?
You have a couple of options:
Store csv not including the index: I guess you are using the to_csv method to store the result in a csv. By default it includes you index columns in the generated csv. you can do to_csv(index=False) to avoid storing them.
reset_index dropping it: you can use left_join.reset_index(drop=True) in order to discard the index columns and not add them in the dataframe. By default reset_index adds the current index columns to the dataframe, generating the ValueError you obtain.
It seems like you are using:
left_join = left_table.merge(right_table, left_on = ["k1", "k2"], "right_on" = ["k11", "k22"] , how = "left")
This will result in a dataframe with repeated rows since indexes 1 and 2 from the left table both can be joined to indexes 0 and 2 of the right table. If that is the behavior you expected, and just want to get rid of duplicated rows you can try using:
left_join = left_join.drop_duplicates()
Before aggregating. This solution won't stop duplicating rows, it will rather eliminate them to not cause any trouble.
You can also pass the parameter as_index = False in the groupby function like this:
left_join = left_join.groupby(keys_to_agg_over, as_index = False).aggregate(dic)
To stop geting the "grouping columns" as indexes.
Related
Given a DataFrame
df1 :
value mesh
0 10 2
1 12 3
2 5 2
obtain a new DataFrame df2 in which for each value of df1 there are mesh values, each one obtained by dividing the corresponding value of df1 by its mesh:
df2 :
value/mesh
0 5
1 5
2 4
3 4
4 4
5 2.5
6 2.5
More general:
df1 :
value mesh_value other_value
0 10 2 0
1 12 3 1
2 5 2 2
obtain:
df2 :
value/mesh_value other_value
0 5 0
1 5 0
2 4 1
3 4 1
4 4 1
5 2.5 2
6 2.5 2
You can do map
df2['new'] = df2['value/mesh'].map(dict(zip(df1.eval('value/mesh'),df1.index)))
Out[243]:
0 0
1 0
2 1
3 1
4 1
5 2
6 2
Name: value/mesh, dtype: int64
Try as follows:
Use Series.div for value / mesh_value, and apply Series.reindex using np.repeat with df.mesh_value as the input array for the repeats parameter.
Next, use pd.concat to combine the result with df.other_value along axis=1.
Finally, rename the column with result of value / mesh_value (its default name will be 0) using df.rename, and chain df.reset_index to reset to a standard index.
df2 = pd.concat([df.value.div(df.mesh_value).reindex(
np.repeat(df.index,df.mesh_value)),df.other_value], axis=1)\
.rename(columns={0:'value_mesh_value'}).reset_index(drop=True)
print(df2)
value_mesh_value other_value
0 5.0 0
1 5.0 0
2 4.0 1
3 4.0 1
4 4.0 1
5 2.5 2
6 2.5 2
Or slightly different:
Use df.assign to add a column with the result of df.value.div(df.mesh_value), and reindex / rename in same way as above.
Use df.drop to get rid of columns that you don't want (value, mesh_value) and use df.iloc to change the column order (e.g. we want ['value_mesh_value','other_value'] instead of other way around (hence: [1,0]). And again, reset index.
We put all of this between brackets and assign it to df2.
df2 = (df.assign(tmp=df.value.div(df.mesh_value)).reindex(
np.repeat(df.index,df.mesh_value))\
.rename(columns={'tmp':'value_mesh_value'})\
.drop(columns=['value','mesh_value']).iloc[:,[1,0]]\
.reset_index(drop=True))
# same result
I have two data frames with similar data, and I would like to substract matching values. Example :
df1:
Letter FREQ Diff
0 A 20 NaN
1 B 12 NaN
2 C 5 NaN
3 D 4 NaN
df2:
Letter FREQ
0 A 19
1 B 11
3 D 2
If we can find the same letter in the column "Letter", I would like to create a new column with the subtraction of the two frequency columns.
Expected output :
df1:
Letter FREQ Diff
0 A 20 1
1 B 12 1
2 C 5 5
3 D 4 2
I have tried to begin like this, but obviously it doesn't work
for i in df1.Letter:
for j in df2.Letter:
if i == j:
df1.Difference[j] == (df1.Frequency[i] - df2.Frequency[j])
else:
pass
Thank you for your help!
Use df.merge with fillna:
In [1101]: res = df1.merge(df2, on='Letter', how='outer')
In [1108]: res['difference'] = (res.Frequency_x - res.Frequency_y).fillna(res.Frequency_x)
In [1110]: res = res.drop('Frequency_y', 1).rename(columns={'Frequency_x': 'Frequency'})
In [1111]: res
Out[1111]:
Letter Frequency difference
0 A 20 1.0
1 B 12 1.0
2 C 5 5.0
3 D 4 2.0
I have following df
A B
0 1 10
1 2 20
2 NaN 5
3 3 1
4 NaN 2
5 NaN 3
6 1 10
7 2 50
8 Nan 80
9 3 5
Consisting of repeating sequences from 1-3 seperated by a variable number of NaN's.I want to groupby each this sequences from 1-3 and get the minimum value of column B within these sequences.
Desired Output something like:
B_min
0 1
6 5
Many thanks beforehand
draj
Idea is first remove rows by missing values by DataFrame.dropna, then use GroupBy.cummin by helper Series created by compare A for equal by Series.eq and Series.cumsum, last data cleaning to one column DataFrame:
df = (df.dropna(subset=['A'])
.groupby(df['A'].eq(1).cumsum())['B']
.min()
.reset_index(drop=True)
.to_frame(name='B_min'))
print (df)
B_min
0 1
1 5
All you need to df.groupby() and apply min(). Is this what you are expecting?
df.groupby('A')['B'].min()
Output:
A
1 10
2 20
3 1
Nan 80
If you don't want the NaNs in your group you can drop them using df.dropna()
df.dropna().groupby('A')['B'].min()
I have two data frames df1 - which holds a 'grouped inventory' of items grouped by numerical values A, B and C. For each item there is a sum column which should reflect the total price of all the items I have of that particular type. Initially I have set the sum column to zero.
df2 is a list of items I have with A, B, C and the price of the item.
df1 (Initial Inventory):
A B C SUM
1 1 1 0
1 1 2 0
1 2 2 0
2 2 2 0
df2 (List of items):
A B C PRICE
2 2 2 30
1 1 2 100
1 1 2 110
1 1 2 105
So my code should convert df1 into:
df1 (expected output):
A B C SUM
1 1 1 0
1 1 2 315
1 2 2 0
2 2 2 30
Explanation: My list of items (df2) contains one item coded as 2,2,2 which has a value of 30 and contains three items coded as 1,1,2 which has values of 100 + 110 + 105 = 315. So I update the inventory table df1, to reflect that I have a total value of 30 for items coded 2,2,2 and total value of 315 for items coded 1,1,2. I have 0 in value for items coded 1,1,1 and 1,2,2 - since they aren't found in my items list.
What would be the most efficient way to do this?
I would rather not use loops since df1 is 720 rows and df2 is 10,000 rows.
You can try to merge on columns "A", "B", and "C" with how="left". (df2_sum below is a subset of df1, so we choose left here.)
df2_sum = df2.groupby(["A", "B", "C"])["PRICE"].sum().reset_index()
df1.merge(df2_sum, on=["A","B","C"], how="left").fillna(0)
A B C SUM PRICE
0 1 1 1 0 0.0
1 1 1 2 0 315.0
2 1 2 2 0 0.0
3 2 2 2 0 30.0
You can then add PRICE to your SUM column.
i have a dataframe df1
id value
1 100
2 100
3 100
4 100
5 100
i have another dataframe df2
id value
2 50
5 30
i want to replace these values for id's in df2 with the values in df1.
final modified df1:
id value
1 100
2 50
3 100
4 100
5 30
i will be running this in a loop. i'e df2, will change time to time (df1, outside loop)
what would be the best way to change the values?
Use combine_first, but first set_index by id in both DataFrames:
Notice: id column in df2 has to be unique.
df = df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
print (df)
id value
0 1 100.0
1 2 50.0
2 3 100.0
3 4 100.0
4 5 30.0
A loc based solution -
i = df1.set_index('id')
j = df2.set_index('id')
i.loc[j.index, 'value'] = j['value']
df2 = i.reset_index()
df2
id value
0 1 100
1 2 50
2 3 100
3 4 100
4 5 30