I have a sample similar to the problem I am running into. Here, I have company name and revenue for 3 years. The revenue is given in 3 different datasets. When I concatenate the data, it looks as follows:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10.0 NaN NaN
1 company_2 20.0 NaN NaN
2 company_3 30.0 NaN NaN
3 company_1 NaN 20.0 NaN
4 company_2 NaN 30.0 NaN
5 company_3 NaN 40.0 NaN
6 company_1 NaN NaN 50.0
7 company_2 NaN NaN 60.0
8 company_3 NaN NaN 70.0
9 company_4 NaN NaN 80.0
What I am trying to do is have company name followed by the actual revenue columns. In a sense drop the duplicate company_name rows and put that data into the corresponding company_name. An image of my desired output:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10 20 50
1 company_2 20 30 60
2 company_3 30 40 70
3 company_4 0 0 80
Use melt and pivot_table:
out = (df.melt('company_name').dropna()
.pivot_table('value', 'company_name', 'variable', fill_value=0)
.rename_axis(columns=None).reset_index())
print(out)
# Output
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10 20 50
1 company_2 20 30 60
2 company_3 30 40 70
3 company_4 0 0 80
You can try:
df.set_index('company_name').stack().unstack().reset_index()
Or
df.groupby('company_name', as_index=False).first()
Output:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10.0 20.0 50.0
1 company_2 20.0 30.0 60.0
2 company_3 30.0 40.0 70.0
3 company_4 NaN NaN 80.0
I would say your concat might not be the join you should be using, but instead try df_merge pd.merge(df1, df2, how = 'inner', left_on = 'company', left_on = 'company') Then you can do that against with df_merge (your newly merged data) and the next dataframe. This should keep everything in line with each other and only add columns that they do not share. If they don't only have the 2 columns you are looking at you might need to do a little more cleaning of the data to get only the results you are looking for, but that should for the most part get you started and your data all in the correct place.
I'm working with a large dataset and have the following issue:
Let's say i'm measuring the input of a substance ("sub-input") into a medium ("id"). For each sub-input i have calculated the year in which it is going to reach the other side of the medium ("y-arrival"). Sometimes several sub-input's arrive in the same year and sometimes no substance arrives in a year.
Example:
import pandas as pd
import numpy as np
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year= [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
in1 = [20,40,10,30,50,80,
60,10,10,40,np.NaN,np.NaN,
np.NaN,120,30,70,60,90]
arr = [2002,2004,2004,2004,2005,np.NaN,
1991,1992,np.NaN,1995,1995,np.NaN,
2001,2002,2004,2004,2005,np.NaN]
dictex3 ={"id":ids,"year":year,"sub-input":in1, "y-arrival":arr}
dfex3 = pd.DataFrame(dictex3)
I have then calculated the sum of "sub-input" for each "y-arrival" using the following code:
dfex3["input_sum_tf"] = dfex3.groupby(["id","y-arrival"])["sub-input"].transform(sum)
print(dfex3)
id year sub-input y-arrival input_sum_tf
0 1 2000 20.0 2002.0 20.0
1 1 2001 40.0 2004.0 80.0
2 1 2002 10.0 2004.0 80.0
3 1 2003 30.0 2004.0 80.0
4 1 2004 50.0 2005.0 50.0
5 1 2005 80.0 NaN NaN
6 2 1990 60.0 1991.0 60.0
7 2 1991 10.0 1992.0 10.0
8 2 1992 10.0 NaN NaN
9 2 1993 40.0 1995.0 40.0
10 2 1994 NaN 1995.0 40.0
11 2 1995 NaN NaN NaN
12 3 2000 NaN 2001.0 0.0
13 3 2001 120.0 2002.0 120.0
14 3 2002 30.0 2004.0 100.0
15 3 2003 70.0 2004.0 100.0
16 3 2004 60.0 2005.0 60.0
17 3 2005 90.0 NaN NaN
Now, for each "id" the sum of the inputs that reach the destination at a "y-arrival" has been calculated.
The goal is to reorder these values so that for each id and each year, the sum of the sub-inputs that will arrive in that year can be shown. Example:
id = 1, year = 2000 --> no y-arrival = 2000 --> = NaN
id = 1, year = 2001 --> no y-arrival = 2001 --> = NaN
id = 1, year = 2002 --> y-arrival = 2002 has an input_sum_tf = 20 --> = 20
id = 1, year = 2003 --> no y-arrival = 2003 --> = NaN
id = 1, year = 2004 --> y-arrival = 2004 has an input_sum_tf = 80 --> = 80
The "input_sum_tf" is the sum of the substances that arrive in a given year. The value "80" for year 2004 is the sum of the sub-input from the years 2001, 2002, 2003 because all of these arrive in year 2004 (y-arrival = 2004).
The result ("input_sum") should look like this:
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
My approach:
I tried solving this by using the merge-function of pandas on two columns, but the result isn't quite right. So far my code only works for the first 5 columns.
dfex3['input_sum'] = dfex3.merge(dfex3, left_on=['id','y-arrival'],
right_on=['id','year'],
how='right')['input_sum_tf_x']
dfex3["input_sum"]
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 80.0
6 80.0
7 50.0
8 NaN
9 60.0
10 10.0
11 NaN
12 NaN
13 40.0
14 40.0
15 NaN
16 0.0
17 120.0
Any help would be much appreciated!
The issue is your code is trying to merge on 'year' and 'y-arrival', so its making multiple matches when you only want one match. E.g. Row 4 where year=2004 will match 3 times where y-arrival=2004 (rows 1-3), hence the duplicates of 80 in the output rows 4-6.
Use groupby to get the last row for each id/y-arrival combo (also looks like you don't want matches where 'input_sum_tf' is zero):
df_last = dfex3.groupby(['id', 'y-arrival']).last().reset_index()
df_last = df_last[df_last['input_sum_tf'] != 0]
Then merge:
dfex3.merge(df_last,
left_on=['id', 'year'],
right_on=['id', 'y-arrival'],
how='left')['input_sum_tf_y']
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
I have a large dataset from Eurostat. I am trying to move all year columns 1960-2019 to a single column called "year". How can I do this?
A sample of the data:
unit sex age geo\time 2019 2018 2017 2016 2015 2014 ... 1969 1968 1967 1960
0 NR F TOTAL AD 37388.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN
1 NR F TOTAL AL 1432833.0 1431715.0 1423050.0 1417141.0 1424597.0 1430827.0 ... NaN NaN NaN NaN
i have the following dataframe
id value year audit
1 21 2007 NaN
1 36 2008 2011
1 7 2009 Nan
2 44 2007 NaN
2 41 2008 Nan
2 15 2009 Nan
3 51 2007 NaN
3 15 2008 2011
3 51 2009 Nan
4 10 2007 NaN
4 12 2008 Nan
4 24 2009 2011
5 30 2007 2011
5 35 2008 Nan
5 122 2009 Nan
Basically, I want to create another variable audit2 where all the cells are 2011, if at least one audit is 2011, for each id.
I tried to put an if-statement inside a loop, but I cannot get any results
I would like to get this new dataframe
id value year audit audit2
1 21 2007 NaN 2011
1 36 2008 2011 2011
1 7 2009 Nan 2011
2 44 2007 NaN NaN
2 41 2008 Nan NaN
2 15 2009 Nan NaN
3 51 2007 NaN 2011
3 15 2008 2011 2011
3 51 2009 Nan 2011
4 10 2007 NaN 2011
4 12 2008 Nan 2011
4 24 2009 2011 2011
5 30 2007 2011 2011
5 35 2008 Nan 2011
5 122 2009 Nan 2011
Could you help me please?
df.groupby('id')['audit'].transform(lambda s: s[s.first_valid_index()] if s.first_valid_index() else np.nan)
output:
>>> df
0 2011.0
1 2011.0
2 2011.0
3 NaN
4 NaN
5 NaN
6 2011.0
7 2011.0
8 2011.0
9 2011.0
10 2011.0
11 2011.0
12 2011.0
13 2011.0
14 2011.0
Name: audit, dtype: float64
I have a dataframe like this:
index = [0,1,2,3,4,5]
s = pd.Series([1,1,1,2,2,2],index= index)
t = pd.Series([2007,2008,2011,2006,2007,2009],index= index)
f = pd.Series([2,4,6,8,10,12],index= index)
pp =pd.DataFrame(np.c_[s,t,f],columns = ["group","year","amount"])
pp
group year amount
0 1 2007 2
1 1 2008 4
2 1 2011 6
3 2 2006 8
4 2 2007 10
5 2 2009 12
I want to add lines in between missing years for each group. My desire dataframe is like this:
group year amount
0 1.0 2007 2.0
1 1.0 2008 4.0
2 1.0 2009 NaN
3 1.0 2010 NaN
4 1.0 2011 6
5 1.0 2006 8.0
6 2.0 2007 10.0
7 2.0 2008 NaN
8 2.0 2009 12.0
Is there any way to do it for a large dataframe?
First change year to datetime:
df.year = pd.to_datetime(df.year, format='%Y')
set_index with resample
df.set_index('year').groupby('group').amount.resample('Y').mean().reset_index()
group year amount
0 1 2007-12-31 2.0
1 1 2008-12-31 4.0
2 1 2009-12-31 NaN
3 1 2010-12-31 NaN
4 1 2011-12-31 6.0
5 2 2006-12-31 8.0
6 2 2007-12-31 10.0
7 2 2008-12-31 NaN
8 2 2009-12-31 12.0