df1 have one column(total) with 2 values 5000 and 1000 each with id A & B respectively. df2 have one column(marks) with 10 values where first 5(100,200,300,400,500) values have id A and next 5 values have id B(10,20,30,40,50).
Now I have to get expected output as
id final_value
- A 50
- A 25
- A 16.6
- A 12.5
- A 10
- B 100
- B 50
- B 33.3
- B 25
- B 20
my code is
new_df = df1['total']/df2['marks']
But I got output as
A 50
B 100
Remaining NaN
pandas division is using both "columns" (series), element by element.
If you want to divide using 'id' as a link, you have to merge your dataframes before :
df1 = pd.DataFrame([[5000, 'A'], [1000, 'B']], columns=['test', 'id'])
df2 = pd.DataFrame([[100, 'A'], [200, 'A'], [300, 'A'], [400, 'A'], [500, 'A'], [10, 'B'], [20, 'B'], [30, 'B'], [40, 'B'], [50, 'B']],columns=['marks', 'id'])
df3 = df1.merge(df2, on='id')
df3['test']/df3['marks']
Setup:
df1 = pd.DataFrame({'total': [5000, 1000]}, index = ['A', 'B'])
df2 = pd.DataFrame({'marks': [100, 200, 300, 400, 500, 10, 20, 30, 40, 50]}, index = ['A','A','A','A','A','B','B','B','B','B'])
Interestingly enough this works:
df2['total'] = df1['total']
df2['final_value'] = df2['total'] / df2['marks']
And then you can just drop the rows and copy answer to new df if you want it as you stated:
new_df = df2[['final_value']]
df2 = df2.drop(['total', 'final_value'], axis = 1)
Assuming your data looks like this:
df1 = pd.DataFrame(dict(id=['A', 'B'], total=[5000,1000]))
df2 = pd.DataFrame(dict(id=['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],
vals=[100,200,300,400,500,10,20,30,40,50]))
you can get the new column you're interested in by first merging the two dataframes on the id column and then applying a lambda function to divide the total by the value provided in df1. Specifically:
df2['final_result'] = df2.merge(df1, on='id').apply(lambda x: round(x.total/x.vals, 1), axis=1)
And if you only want the id and final_result columns, you can just select those:
df2[['id', 'final_result']]
Your data should now look like you expected:
id final_result
0 A 50.0
1 A 25.0
2 A 16.7
3 A 12.5
4 A 10.0
5 B 100.0
6 B 50.0
7 B 33.3
8 B 25.0
9 B 20.0
Note that in the lambda function I also applied some rounding to get just 1 decimal as you indicated.
Try:
>>> df1.set_index("id").rename(columns={"total": "marks"}).div(df2.set_index("id")).round(1).reset_index()
id marks
0 A 50.0
1 A 25.0
2 A 16.7
3 A 12.5
4 A 10.0
5 B 100.0
6 B 50.0
7 B 33.3
8 B 25.0
9 B 20.0
It leverages the fact, that for any arithmetical operations between 2 data frames pandas will autofit both dataframe attributes by index, and columns (so the arithmetic operation will be index x on index x and column a on column a)
Related
I have df like this
Web R_Val B_Cost R_Total B_Total
A 20 2 20 1
B 30 3 10 2
C 40 1 30 1
I would like to multiply the column started with R_ together and B_ together and in real data there are many more. This is just dummy data, what could be the best solution to achieve this?
Web R_Val B_Cost R_Total B_Total R_New B_New
A 20 2 20 1
B 30 3 10 2
C 40 1 30 1
Check the answer I just posted on your other question:
How to multiply specific column from dataframe with one specific column in same dataframe?
dfr = pd.DataFrame({
'Brand' : ['A', 'B', 'C', 'D', 'E', 'F'],
'price' : [10, 20, 30, 40, 50, 10],
'S_Value' : [2,4,2,1,1,1],
'S_Factor' : [2,1,1,2,1,1]
})
pre_fixes = ['S_']
for prefix in pre_fixes:
coltocal = [col for col in dfr.columns if col.startswith(prefix)]
for col in coltocal:
dfr.loc[:,col+'_new'] = dfr.price*dfr[col]
dfr
I read this excellent guide to pivoting but I can't work out how to apply it to my case. I have tidy data like this:
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'case': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', ],
... 'perf_var': ['num', 'time', 'num', 'time', 'num', 'time', 'num', 'time'],
... 'perf_value': [1, 10, 2, 20, 1, 30, 2, 40]
... }
... )
>>>
>>> df
case perf_var perf_value
0 a num 1
1 a time 10
2 a num 2
3 a time 20
4 b num 1
5 b time 30
6 b num 2
7 b time 40
What I want is:
To use "case" as the columns
To use the "num" values as the index
To use the "time" values as the value.
to give:
case a b
1.0 10 30
2.0 20 40
All the pivot examples I can see have the index and values in separate columns, but the above seems like a valid/common "tidy" data case to me (I think?). Is it possible to pivot from this?
You need a bit of preprocessing to get your final result :
(df.assign(num=np.where(df.perf_var == "num",
df.perf_value,
np.nan),
time=np.where(df.perf_var == "time",
df.perf_value,
np.nan))
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.loc[:, ["case", "num", "time"]]
.drop_duplicates()
.pivot("num", "case", "time"))
case a b
num
1.0 10.0 30.0
2.0 20.0 40.0
An alternative route to the same end point :
(
df.set_index(["case", "perf_var"], append=True)
.unstack()
.droplevel(0, 1)
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.drop_duplicates()
.droplevel(0)
.set_index("num", append=True)
.unstack(0)
.rename_axis(index=None)
)
This question was very hard to word..
Here is some sample code for a reproducible example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([['a', 1, 10, 1], ['a', 2, 20, 1], ['b', 1, 4, 1], ['c', 1, 2, 1], ['e', 2, 10, 1]])
df2 = pd.DataFrame([['a', 1, 15, 2], ['a', 2, 20, 2], ['c', 1, 2, 2]])
df3 = pd.DataFrame([['d', 1, 10, 3], ['e', 2, 20, 3], ['f', 1, 15, 3]])
df1.columns = ['name', 'id', 'price', 'part']
df2.columns = ['name', 'id', 'price', 'part']
df3.columns = ['name', 'id', 'price', 'part']
result = pd.DataFrame([['a', 1, 10, 15, 'missing'],
['a', 2, 20, 20, 'missing'],
['b', 1, 4, 'missing', 'missing'],
['c', 1, 2, 2, 'missing'],
['e', 2, 10, 'missing', 20],
['d', 1, 'missing', 'missing', 10],
['f', 1, 'missing', 'missing', 15]])
result.columns = ['name', 'id', 'pricepart1', 'pricepart2', 'pricepart3']
So there are three DataFrames:
df1
name id price part
0 a 1 10 1
1 a 2 20 1
2 b 1 4 1
3 c 1 2 1
4 e 2 10 1
df2
name id price part
0 a 1 15 2
1 a 2 20 2
2 c 1 2 2
df3
name id price part
0 d 1 10 3
1 e 2 20 3
2 f 1 15 3
The name and id is like a composite key. It may be present in all three DataFrames, just two of the three DataFrames, in just 1 of the DataFrames. To represent which DataFrame the name, id came from, a part column exists in df1, df2 and df3.
The result I'm looking for is given by the result DataFrame.
name id pricepart1 pricepart2 pricepart3
0 a 1 10 15 missing
1 a 2 20 20 missing
2 b 1 4 missing missing
3 c 1 2 2 missing
4 e 2 10 missing 20
5 d 1 missing missing 10
6 f 1 missing missing 15
Basically, I want EVERY name, id pair to be accounted for. Even if the SAME name, id comes in both df1 and df2, I want separate columns for price from each of the part even if the prices in both the parts/DataFrames are the same.
In the results DataFrame, take row1, a 1 10 15 missing
What this represents is, the name, id pair a 1 had a price of 10 in df1, 15 in df2, and missing in df3.
If the row value is missing for a specific pricepart that means, the name, id pair did not appear in that particular DataFrame!
I've used the part to represent the DataFrame! so, you can asusme that part is ALWAYS 1 in df1, ALWAYS 2 in df2 and ALWAYS 3 in df3.
So far.. I literally just did, pd.concat([df1, df2, df3])
Not sure if this approach is going to lead to a dead end..
Keep in mind that the original three DataFrames are 62245 rows × 4 columns EACH. And each DataFrame may or may not contain the name, id pair. If the name, id pair is present in EVEN 1 of the DataFrames, and not the others, I wanted that to be accounted for with a missing for the other DataFrames.
You can use pd.merge whilst using how='outer'
# Change column names and remove 'part' column
df1 = df1.rename(columns={'price':'pricepart1'}).drop('part', axis=1)
df2 = df2.rename(columns={'price':'pricepart2'}).drop('part', axis=1)
df3 = df3.rename(columns={'price':'pricepart3'}).drop('part', axis=1)
# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
# Fill na values with 'missing'
df = df.fillna('missing')
Out[]:
name id pricepart1 pricepart2 pricepart3
0 a 1 10 15 missing
1 a 2 20 20 missing
2 b 1 4 missing missing
3 c 1 2 2 missing
4 e 2 10 missing 20
5 d 1 missing missing 10
6 f 1 missing missing 15
I have 2 dataframes, each with 2 columns (shown in the picture). I'm trying to define a function or perform an operation to scan df2 on df1 and store
df2["values"] in df1["values"] if df2["ID"] matches df1["ID"].
I want the result as shown in New_df1 (picture)
I have tried a for loop with function append() but it's really tricky to make it work...
You can do this via pandas.concat, sorting and dropping druplicates:
import pandas as pd, numpy as np
df1 = pd.DataFrame([[i, np.nan] for i in list('abcdefghik')],
columns=['ID', 'Values'])
df2 = pd.DataFrame([['a', 2], ['c', 5], ['e', 4], ['g', 7], ['h', 1]],
columns=['ID', 'Values'])
res = pd.concat([df1, df2], axis=0)\
.sort_values(['ID', 'Values'])\
.drop_duplicates('ID')
print(res)
# ID Values
# 0 a 2.0
# 1 b NaN
# 1 c 5.0
# 3 d NaN
# 2 e 4.0
# 5 f NaN
# 3 g 7.0
# 4 h 1.0
# 8 i NaN
# 9 k NaN
I am not sure if this is possible. I have two dataframes df1 and df2 which are presented like this:
df1 df2
id value id value
a 5 a 3
c 9 b 7
d 4 c 6
f 2 d 8
e 2
f 1
They will have many more entries in reality than presented here. I would like to create a third dataframe df3 based on the values in df1 and df2. Any values in df1 would take precedence over values in df2 when writing to df3 (if the same id is present in both df1 and df2) so in this example I would return:
df3
id value
a 5
b 7
c 9
d 4
e 2
f 2
I have tried using df2 as the base (df2 will have all of the id's present for the whole universe) and then overwriting the value for id's that are present in df1, but cannot find the merge syntax to do this.
You could use combine_first, provided that you first make the DataFrame index id (so that the values get aligned by id):
In [80]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[80]:
id value
0 a 5.0
1 b 7.0
2 c 9.0
3 d 4.0
4 e 2.0
5 f 2.0
Since you mentioned merging, you might be interested in seeing that
you could merge df1 and df2 on id, and then use fillna to replace NaNs in df1's the value column with values from df2's value column:
df1 = pd.DataFrame({'id': ['a', 'c', 'd', 'f'], 'value': [5, 9, 4, 2]})
df2 = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [3, 7, 6, 8, 2, 1]})
result = pd.merge(df2, df1, on='id', how='left', suffixes=('_x', ''))
result['value'] = result['value'].fillna(result['value_x'])
result = result[['id', 'value']]
print(result)
yields the same result, though the first method is simpler.