i have a table in pandas df
id product_1 count
1 100 10
2 200 20
3 100 30
4 400 40
5 500 50
6 200 60
7 100 70
also i have another table in dataframe df2
product score
100 5
200 10
300 15
400 20
500 25
600 30
700 35
i have to create a new column score in my first df, taking values of score from df2 with respect to product_1.
my final output should be. df =
id product_1 count score
1 100 10 5
2 200 20 10
3 100 30 5
4 400 40 20
5 500 50 25
6 200 60 10
7 100 70 5
Any ideas how to achieve it?
Use map:
df['score'] = df['product_1'].map(df2.set_index('product')['score'].to_dict())
print (df)
id product_1 count score
0 1 100 10 5
1 2 200 20 10
2 3 100 30 5
3 4 400 40 20
4 5 500 50 25
5 6 200 60 10
6 7 100 70 5
Or merge:
df = pd.merge(df,df2, left_on='product_1', right_on='product', how='left')
print (df)
id product_1 count product score
0 1 100 10 100 5
1 2 200 20 200 10
2 3 100 30 100 5
3 4 400 40 400 20
4 5 500 50 500 25
5 6 200 60 200 10
6 7 100 70 100 5
EDIT by comment:
df['score'] = df['product_1'].map(df2.set_index('product')['score'].to_dict())
df['final_score'] = (df['count'].mul(0.6).div(df.id)).add(df.score.mul(0.4))
print (df)
id product_1 count score final_score
0 1 100 10 5 8.0
1 2 200 20 10 10.0
2 3 100 30 5 8.0
3 4 400 40 20 14.0
4 5 500 50 25 16.0
5 6 200 60 10 10.0
6 7 100 70 5 8.0
Related
I’ve been trying to code python equivalent of excel sumif
Excel:
Sumif($A$1:$A$20,A1,$C$1:$C$20)
enter code here
Pandas df:
A C Term
1 10 1
1 20 2
1 10 3
1 10 4
2 30 5
2 30 6
2 30 7
3 20 8
3 10 9
3 10 10
3 10 11
3 10 12
Output df - I want output df with ‘fwdSum’ as follows
—————————
A C Term fwdSum
1 10 1 50
1 20 2 50
1 10 3 50
1 10 4 50
2 30 5 90
2 30 6 90
2 30 7 90
3 20 8 60
3 10 9 60
3 10 10 60
3 10 11 60
3 10 12 60
I tried creating another df with groupby and sum and then later merge
Please can anyone suggest the best Way to achieve this?
df['fwdSum'] = df.groupby('A')['C'].transform('sum')
print(df)
Prints:
A C Term fwdSum
0 1 10 1 50
1 1 20 2 50
2 1 10 3 50
3 1 10 4 50
4 2 30 5 90
5 2 30 6 90
6 2 30 7 90
7 3 20 8 60
8 3 10 9 60
9 3 10 10 60
10 3 10 11 60
11 3 10 12 60
It seemed I had a simple problem of pivoting a pandas Table, but unfortunately, the problem seems a bit complicated to me.
I am providing a tiny sample table and the output I am looking to give the example of the problem I am facing:
Say, I have a table like this:
df =
AF BF AT BT
1 4 100 70
2 7 102 66
3 11 200 90
4 13 300 178
5 18 403 200
So I need it into a wide/pivot format but the parameter name in each case will be set as the same. ( I am not looking to subset the string if possible)
My output table should like the following:
dfout =
PAR F T
A 1 100
B 4 70
A 2 102
B 7 66
A 3 200
B 11 90
A 4 300
B 13 178
A 5 403
B 18 200
I tried pivoting, but not able to achieve the desired output. Any help will be immensely appreciated. Thanks.
You can use pandas wide_to_long, but first you have to reorder the columns:
pd.wide_to_long(
df.rename(columns=lambda x: x[::-1]).reset_index(),
stubnames=["F", "T"],
i="index",
sep="",
j="PAR",
suffix=".",
).reset_index("PAR")
PAR F T
index
0 A 1 100
1 A 2 102
2 A 3 200
3 A 4 300
4 A 5 403
0 B 4 70
1 B 7 66
2 B 11 90
3 B 13 178
4 B 18 200
Alternatively, you could use the pivot_longer function from the pyjanitor, to reshape the data :
# pip install pyjanitor
import janitor
df.pivot_longer(names_to=("PAR", ".value"), names_pattern=r"(.)(.)")
PAR F T
0 A 1 100
1 B 4 70
2 A 2 102
3 B 7 66
4 A 3 200
5 B 11 90
6 A 4 300
7 B 13 178
8 A 5 403
9 B 18 200
Update: Using data from #jezrael:
df
C AF BF AT BT
0 10 1 4 100 70
1 20 2 7 102 66
2 30 3 11 200 90
3 40 4 13 300 178
4 50 5 18 403 200
pd.wide_to_long(
df.rename(columns=lambda x: x[::-1]),
stubnames=["F", "T"],
i="C",
sep="",
j="PAR",
suffix=".",
).reset_index()
C PAR F T
0 10 A 1 100
1 20 A 2 102
2 30 A 3 200
3 40 A 4 300
4 50 A 5 403
5 10 B 4 70
6 20 B 7 66
7 30 B 11 90
8 40 B 13 178
9 50 B 18 200
if you use the pivot_longer function:
df.pivot_longer(index="C", names_to=("PAR", ".value"), names_pattern=r"(.)(.)")
C PAR F T
0 10 A 1 100
1 10 B 4 70
2 20 A 2 102
3 20 B 7 66
4 30 A 3 200
5 30 B 11 90
6 40 A 4 300
7 40 B 13 178
8 50 A 5 403
9 50 B 18 200
pivot_longer is being worked on; in the next release of pyjanitor it should be much better. But pd.wide_to_long can solve your task pretty easily. The other answers can easily solve it as well.
Idea is create MultiIndex in columns by first and last letter and then use DataFrame.stack for reshape, last some data cleaning in MultiIndex in index:
df.columns= [df.columns.str[-1], df.columns.str[0]]
df = df.stack().reset_index(level=0, drop=True).rename_axis('PAR').reset_index()
print (df)
PAR F T
0 A 1 100
1 B 4 70
2 A 2 102
3 B 7 66
4 A 3 200
5 B 11 90
6 A 4 300
7 B 13 178
8 A 5 403
9 B 18 200
EDIT:
print (df)
C AF BF AT BT
0 10 1 4 100 70
1 20 2 7 102 66
2 30 3 11 200 90
3 40 4 13 300 178
4 50 5 18 403 200
df = df.set_index('C')
df.columns = pd.MultiIndex.from_arrays([df.columns.str[-1],
df.columns.str[0]], names=[None,'PAR'])
df = df.stack().reset_index()
print (df)
C PAR F T
0 10 A 1 100
1 10 B 4 70
2 20 A 2 102
3 20 B 7 66
4 30 A 3 200
5 30 B 11 90
6 40 A 4 300
7 40 B 13 178
8 50 A 5 403
9 50 B 18 200
Let's try:
(pd.wide_to_long(df.reset_index(),stubnames=['A','B'],
i='index',
j='PAR', sep='', suffix='[FT]')
.stack().unstack('PAR').reset_index(level=1)
)
Output:
PAR level_1 F T
index
0 A 1 100
0 B 4 70
1 A 2 102
1 B 7 66
2 A 3 200
2 B 11 90
3 A 4 300
3 B 13 178
4 A 5 403
4 B 18 200
I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 0
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 0
A 50 2017-04-18 6 0
B 200 2017-12-08 25 0
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 0
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 0
From the above I would like replace 0 salary with groupby median of the the column product.
Explanation:
A : 15, 20, 20, 25, 45
So the median = 20.
B : 60, 80, 90, 100, 100, 110
So the median = 95.
Expected Output
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 20
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 95
A 50 2017-04-18 6 20
B 200 2017-12-08 25 95
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 20
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 95
You can try this using masking 0 values using pd.Series.mask and use np.nanmedian here.
fill_vals = df.salary.mask(df.salary.eq(0)).groupby(df['product']).transform(np.nanmedian)
df.assign(salary = df.salary.mask(df.salary.eq(0), fill_vals))
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25
1 A 50 2017-01-03 4 20
2 B 200 2016-12-24 10 100
3 A 50 2017-01-18 3 20
4 B 200 2017-01-28 15 80
5 A 50 2017-01-18 6 15
6 B 200 2017-01-28 20 95
7 A 50 2017-04-18 6 20
8 B 200 2017-12-08 25 95
9 A 50 2017-11-18 6 20
10 B 200 2017-08-21 20 90
11 B 200 2017-12-28 30 110
12 A 50 2018-03-18 10 20
13 B 300 2018-06-08 45 100
14 B 300 2018-09-20 50 60
15 A 50 2018-11-18 8 45
16 B 300 2018-11-28 35 95
OR
Using np.where
df['salary'] = (np.where(df['salary']==0,df['salary'].replace(0,np.nan).
groupby(df['product']).transform('median'),df['salary']))
first use .groupby and .transform the column to shown the grouped by median. Finally, locate salaries that are 0 with .loc and set them equal to the median salary.
#NOTE - the below line of code uses `median` instead of `np.nanmedian`. These will return different results...
#To anyone reading this, please know which one to use according to your situation...
#As you can see the outputs are different between Chester's answer and mine.
df.loc[df['salary'] == 0, 'salary'] = df.groupby('product')['salary'].transform('median')
df
output:
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25.0
1 A 50 2017-01-03 4 20.0
2 B 200 2016-12-24 10 100.0
3 A 50 2017-01-18 3 17.5
4 B 200 2017-01-28 15 80.0
5 A 50 2017-01-18 6 15.0
6 B 200 2017-01-28 20 80.0
7 A 50 2017-04-18 6 17.5
8 B 200 2017-12-08 25 80.0
9 A 50 2017-11-18 6 20.0
10 B 200 2017-08-21 20 90.0
11 B 200 2017-12-28 30 110.0
12 A 50 2018-03-18 10 17.5
13 B 300 2018-06-08 45 100.0
14 B 300 2018-09-20 50 60.0
15 A 50 2018-11-18 8 45.0
16 B 300 2018-11-28 35 80.0
df
Name Run ID1 ID2
0 A 18 100 500
1 B 19 150 550
2 C 18 200 600
3 D 15 250 650
I then have a variable named max_runs = 20
What I want to do is get the data into this format below. Essentially copy each unique row max_runs - df['Run'] times
df_output
Name Run ID1 ID2
1 A 19 100 500
2 A 20 100 500
3 B 20 150 550
4 C 19 200 600
5 C 20 200 600
6 D 16 250 650
7 D 17 250 650
8 D 18 250 650
9 D 19 250 650
10 D 20 250 650
Thanks for any help and let me know if I need to explain further
You can use repeat to repeat the rows and assign to modify the new run:
(df.loc[df.index.repeat(20-df.Run)]
.assign(Run=lambda x: x.groupby(level=0).cumcount().add(x.Run+1))
.reset_index()
)
Output:
index Name Run ID1 ID2
0 0 A 19 100 500
1 0 A 20 100 500
2 1 B 20 150 550
3 2 C 19 200 600
4 2 C 20 200 600
5 3 D 16 250 650
6 3 D 17 250 650
7 3 D 18 250 650
8 3 D 19 250 650
9 3 D 20 250 650
I have the following dataframe in python:
months = [1,2,3,4,5,6,7,8,9,10,11,12]
data1 = [100,200,300,400,500,600,700,800,900,1000,1100,1200]
df = pd.DataFrame({
'month' : months,
'd1' : data1,
'd2' : 0,
});
and I want to calculate the column d2, in the following way:
month d1 d2
0 1 100 101.0
1 2 200 303.0
2 3 300 606.0
3 4 400 1010.0
4 5 500 1515.0
5 6 600 2121.0
6 7 700 2828.0
7 8 800 3636.0
8 9 900 4545.0
9 10 1000 5555.0
10 11 1100 6666.0
11 12 1200 7878.0
I am doing it in the following way:
df['d2'] = (df['d2'].shift(1) + df['d1']) + df['month']
but the result is not what was expected:
month d1 d2
0 1 100 NaN
1 2 200 202.0
2 3 300 303.0
3 4 400 404.0
4 5 500 505.0
5 6 600 606.0
6 7 700 707.0
7 8 800 808.0
8 9 900 909.0
9 10 1000 1010.0
10 11 1100 1111.0
11 12 1200 1212.0
I do not know if I am clear in my request, I thank who can help me.
IIUC, you're looking for cumsum:
df['d2'] = (df.d1+df.month).cumsum()
>>> df
month d1 d2
0 1 100 101
1 2 200 303
2 3 300 606
3 4 400 1010
4 5 500 1515
5 6 600 2121
6 7 700 2828
7 8 800 3636
8 9 900 4545
9 10 1000 5555
10 11 1100 6666
11 12 1200 7878
What you need is cumulative sum :)
df['d2'] = df.d1.cumsum()
print(df)
month d1 d2
0 1 100 100
1 2 200 300
2 3 300 600
3 4 400 1000
4 5 500 1500
5 6 600 2100
6 7 700 2800
7 8 800 3600
8 9 900 4500
9 10 1000 5500
10 11 1100 6600
11 12 1200 7800