Combining Rows Based on Column Value - python

I have a sample similar to the problem I am running into. Here, I have company name and revenue for 3 years. The revenue is given in 3 different datasets. When I concatenate the data, it looks as follows:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10.0 NaN NaN
1 company_2 20.0 NaN NaN
2 company_3 30.0 NaN NaN
3 company_1 NaN 20.0 NaN
4 company_2 NaN 30.0 NaN
5 company_3 NaN 40.0 NaN
6 company_1 NaN NaN 50.0
7 company_2 NaN NaN 60.0
8 company_3 NaN NaN 70.0
9 company_4 NaN NaN 80.0
What I am trying to do is have company name followed by the actual revenue columns. In a sense drop the duplicate company_name rows and put that data into the corresponding company_name. An image of my desired output:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10 20 50
1 company_2 20 30 60
2 company_3 30 40 70
3 company_4 0 0 80

Use melt and pivot_table:
out = (df.melt('company_name').dropna()
.pivot_table('value', 'company_name', 'variable', fill_value=0)
.rename_axis(columns=None).reset_index())
print(out)
# Output
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10 20 50
1 company_2 20 30 60
2 company_3 30 40 70
3 company_4 0 0 80

You can try:
df.set_index('company_name').stack().unstack().reset_index()
Or
df.groupby('company_name', as_index=False).first()
Output:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10.0 20.0 50.0
1 company_2 20.0 30.0 60.0
2 company_3 30.0 40.0 70.0
3 company_4 NaN NaN 80.0

I would say your concat might not be the join you should be using, but instead try df_merge pd.merge(df1, df2, how = 'inner', left_on = 'company', left_on = 'company') Then you can do that against with df_merge (your newly merged data) and the next dataframe. This should keep everything in line with each other and only add columns that they do not share. If they don't only have the 2 columns you are looking at you might need to do a little more cleaning of the data to get only the results you are looking for, but that should for the most part get you started and your data all in the correct place.

Related

How to apply a function/impute on an interval in Pandas

I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below):
Date
orders
1991-01-01
nan
1991-02-01
nan
1991-03-01
24
1991-04-01
nan
1991-05-01
nan
1991-06-01
nan
1991-07-01
nan
1991-08-01
34
1991-09-01
nan
1991-10-01
nan
1991-11-01
22
1991-12-01
nan
I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look:
Date
orders
1991-01-01
8
1991-02-01
16
1991-03-01
24
1991-04-01
18
1991-05-01
12
1991-06-01
6
1991-07-01
17
1991-08-01
34
1991-09-01
30
1991-10-01
26
1991-11-01
22
1991-12-01
11
I am lost on how to do this in Pandas however. Any ideas?
Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups:
df['Date'] = pd.to_datetime(df['Date'])
f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1]
df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders']
.transform(f))
print (df)
Date orders
0 1991-01-01 8.0
1 1991-02-01 16.0
2 1991-03-01 24.0
3 1991-04-01 18.0
4 1991-05-01 12.0
5 1991-06-01 6.0
6 1991-07-01 17.0
7 1991-08-01 34.0
8 1991-09-01 30.0
9 1991-10-01 26.0
10 1991-11-01 22.0
11 1991-12-01 11.0

Time Diff on vertical dataframe in Python

I have a dataframe, df that looks like this
Date Value
10/1/2019 5
10/2/2019 10
10/3/2019 15
10/4/2019 20
10/5/2019 25
10/6/2019 30
10/7/2019 35
I would like to calculate the delta for a period of 7 days
Desired output:
Date Delta
10/1/2019 30
This is what I am doing: A user has helped me with a variation of the code below:
df['Delta']=df.iloc[0:,1].sub(df.iloc[6:,1]), Date=pd.Series
(pd.date_range(pd.Timestamp('2019-10-01'),
periods=7, freq='7d'))[['Delta','Date']]
Any suggestions is appreciated
Let us try shift
s = df.set_index('Date')['Value']
df['New'] = s.shift(freq = '-6 D').reindex(s.index).values
df['DIFF'] = df['New'] - df['Value']
df
Out[39]:
Date Value New DIFF
0 2019-10-01 5 35.0 30.0
1 2019-10-02 10 NaN NaN
2 2019-10-03 15 NaN NaN
3 2019-10-04 20 NaN NaN
4 2019-10-05 25 NaN NaN
5 2019-10-06 30 NaN NaN
6 2019-10-07 35 NaN NaN

Pandas: Merge on 2 columns

I'm working with a large dataset and have the following issue:
Let's say i'm measuring the input of a substance ("sub-input") into a medium ("id"). For each sub-input i have calculated the year in which it is going to reach the other side of the medium ("y-arrival"). Sometimes several sub-input's arrive in the same year and sometimes no substance arrives in a year.
Example:
import pandas as pd
import numpy as np
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year= [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
in1 = [20,40,10,30,50,80,
60,10,10,40,np.NaN,np.NaN,
np.NaN,120,30,70,60,90]
arr = [2002,2004,2004,2004,2005,np.NaN,
1991,1992,np.NaN,1995,1995,np.NaN,
2001,2002,2004,2004,2005,np.NaN]
dictex3 ={"id":ids,"year":year,"sub-input":in1, "y-arrival":arr}
dfex3 = pd.DataFrame(dictex3)
I have then calculated the sum of "sub-input" for each "y-arrival" using the following code:
dfex3["input_sum_tf"] = dfex3.groupby(["id","y-arrival"])["sub-input"].transform(sum)
print(dfex3)
id year sub-input y-arrival input_sum_tf
0 1 2000 20.0 2002.0 20.0
1 1 2001 40.0 2004.0 80.0
2 1 2002 10.0 2004.0 80.0
3 1 2003 30.0 2004.0 80.0
4 1 2004 50.0 2005.0 50.0
5 1 2005 80.0 NaN NaN
6 2 1990 60.0 1991.0 60.0
7 2 1991 10.0 1992.0 10.0
8 2 1992 10.0 NaN NaN
9 2 1993 40.0 1995.0 40.0
10 2 1994 NaN 1995.0 40.0
11 2 1995 NaN NaN NaN
12 3 2000 NaN 2001.0 0.0
13 3 2001 120.0 2002.0 120.0
14 3 2002 30.0 2004.0 100.0
15 3 2003 70.0 2004.0 100.0
16 3 2004 60.0 2005.0 60.0
17 3 2005 90.0 NaN NaN
Now, for each "id" the sum of the inputs that reach the destination at a "y-arrival" has been calculated.
The goal is to reorder these values so that for each id and each year, the sum of the sub-inputs that will arrive in that year can be shown. Example:
id = 1, year = 2000 --> no y-arrival = 2000 --> = NaN
id = 1, year = 2001 --> no y-arrival = 2001 --> = NaN
id = 1, year = 2002 --> y-arrival = 2002 has an input_sum_tf = 20 --> = 20
id = 1, year = 2003 --> no y-arrival = 2003 --> = NaN
id = 1, year = 2004 --> y-arrival = 2004 has an input_sum_tf = 80 --> = 80
The "input_sum_tf" is the sum of the substances that arrive in a given year. The value "80" for year 2004 is the sum of the sub-input from the years 2001, 2002, 2003 because all of these arrive in year 2004 (y-arrival = 2004).
The result ("input_sum") should look like this:
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
My approach:
I tried solving this by using the merge-function of pandas on two columns, but the result isn't quite right. So far my code only works for the first 5 columns.
dfex3['input_sum'] = dfex3.merge(dfex3, left_on=['id','y-arrival'],
right_on=['id','year'],
how='right')['input_sum_tf_x']
dfex3["input_sum"]
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 80.0
6 80.0
7 50.0
8 NaN
9 60.0
10 10.0
11 NaN
12 NaN
13 40.0
14 40.0
15 NaN
16 0.0
17 120.0
Any help would be much appreciated!
The issue is your code is trying to merge on 'year' and 'y-arrival', so its making multiple matches when you only want one match. E.g. Row 4 where year=2004 will match 3 times where y-arrival=2004 (rows 1-3), hence the duplicates of 80 in the output rows 4-6.
Use groupby to get the last row for each id/y-arrival combo (also looks like you don't want matches where 'input_sum_tf' is zero):
df_last = dfex3.groupby(['id', 'y-arrival']).last().reset_index()
df_last = df_last[df_last['input_sum_tf'] != 0]
Then merge:
dfex3.merge(df_last,
left_on=['id', 'year'],
right_on=['id', 'y-arrival'],
how='left')['input_sum_tf_y']
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0

Get "Last Purchase Year" from Sales Data Pivot in Pandas

I have pivoted the Customer ID against their year of purchase, so that I know how many times each customer purchased in different years:
Customer ID 1996 1997 ... 2019 2020
100000000000001 7 7 ... NaN NaN
100000000000002 8 8 ... NaN NaN
100000000000003 7 4 ... NaN NaN
100000000000004 NaN NaN ... 21 24
100000000000005 17 11 ... 18 NaN
My desired result is to append the column names with the latest year of purchase, and thus the number of years since their last purchase:
Customer ID 1996 1997 ... 2019 2020 Last Recency
100000000000001 7 7 ... NaN NaN 1997 23
100000000000002 8 8 ... NaN NaN 1997 23
100000000000003 7 4 ... NaN NaN 1997 23
100000000000004 NaN NaN ... 21 24 2020 0
100000000000005 17 11 ... 18 NaN 2019 1
Here is what I tried:
df_pivot["Last"] = 2020
k = 2020
while math.isnan(df_pivot2[k]):
df_pivot["Last"] = k-1
k = k-1
df_pivot["Recency"] = 2020 - df_pivot["Last"]
However what I got is "TypeError: cannot convert the series to <class 'float'>"
Could anyone help me to get the result I need?
Thanks a lot!
Dennis
You can get last year of purchase using notna + cumsum and idxmax along axis=1 then subtract this last year of purchase from the max year to compute Recency:
c = df.filter(regex=r'\d+').columns
df['Last'] = df[c].notna().cumsum(1).idxmax(1)
df['Recency'] = c.max() - df['Last']
Customer ID 1996 1997 2019 2020 Last Recency
0 100000000000001 7.0 7.0 NaN NaN 1997 23
1 100000000000002 8.0 8.0 NaN NaN 1997 23
2 100000000000003 7.0 4.0 NaN NaN 1997 23
3 100000000000004 NaN NaN 21.0 24.0 2020 0
4 100000000000005 17.0 11.0 18.0 NaN 2019 1
one idea is to apply "applymap(float)" to your dataFrame
Documentation from pandas

when i add grouping function for creating a new column in DF, it's not working as expected

Group by result
empdf.groupby('deptno')['sal'].max()
deptno
10 5000.0
20 3000.0
30 2850.0
I joined this result to my DF empdf, but the result is not coming. Below is query and result.
empdf.assign(maxsal_dept = empdf.groupby('deptno')['sal'].max())
empno
ename
job
mgr
hiredate
sal
comm
deptno
totalsal
rnk
dnsrnk
maxsal_dept
0
7839 KING PRESIDENT NaN 1981-11-17 00:00:00 5000.0 50.0 10 5050.0 1 1 NaN
1
7698 BLAKE MANAGER 7839.0 1981-05-01 00:00:00 2850.0 285.0 30 3135.0 5 4 NaN
2
7782 CLARK MANAGER 7839.0 1981-06-09 00:00:00 2450.0 24.5 10 2474.5 6 5 NaN
3
7566 JONES MANAGER 7839.0 1981-04-02 00:00:00 2975.0 NaN 20 2975.0 4 3 NaN
4
7788 SCOTT ANALYST 7566.0 1987-04-19 00:00:00 3000.0 NaN 20 3000.0 2 2 NaN
5
7902 FORD ANALYST 7566.0 1981-12-03 00:00:00 3000.0 NaN 20 3000.0 3 2 NaN
6
7369 SMITH CLERK 7902.0 1980-12-17 00:00:00 800.0 NaN 20 800.0 14 12 NaN
By grouping result is
i want to add this group to DF for creating new column, but it's not giving right result. Highlited column in yellow color.
You have to use transform
Try the below snippet
empdf['maxsal_deptl']= empdf.groupby('deptno')['sal'].transform('max')

Categories

Resources