Python: Merge on 2 columns - python

I'm working with a large dataset. The following is an example, calculated with a smaller dataset.
In this example i got the measurements of the pollution of 3 rivers for different timespans. Each year, the amount pollution of a river is measured at a measuring station downstream ("pollution"). It has already been calculated, in which year the river water was polluted upstream ("year_of_upstream_pollution"). My goal ist to create a new column ["result_of_upstream_pollution"], which contains the amount of pollution connected to the "year_of_upstream_pollution". For this, the data from the "pollution"-column has to be reassigned.
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
20,22,20,25,18,21,
30,19,15,10,26,28]
dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)
river_id year pollution year_of_upstream_pollution
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
Example: river_id = 1, year = 2000, year_of_upstream_pollution = 2002
value of the pollution-column in year 2002 = 20
Therefore: result_of_upstream_pollution = 20
The resulting column should look like this:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
My own approach:
### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])
# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)
len(listr) # = 28
This results in the following ValueError:
"Length of values does not match length of index"
My explanation for this is, that the values in the "year"-column of "dfr3" are not unique, which leads to several numbers being assigned to each year and explains why: len(listr) = 28
I haven't been able to find a way around this error yet. Please keep in mind that the real dataset is much larger than this one. Any help would be much appreciated!

As you said in the title, this is merge on two column:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(df)
Output:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN

I just realized that this solution doesn't seem to be working for me.
When i execute the code, this is what happens:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(dfr1)
river_id year pollution year_of_upstream_pollution \
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 22.0
6 20.0
7 25.0
8 18.0
9 15.0
10 15.0
11 10.0
12 26.0
13 28.0
14 NaN
15 NaN
16 NaN
17 NaN
For some reason, this code doesn't seem to be handling the "NaN" values in the right way.
If there is an "NaN"-value (in the column: "year_of_upstream_pollution"), there shouldnt be a value in "result_of_upstream_pollution".
Equally, the ids 14,15 and 16 all have values for the "year_of_upstream_pollution" which has matching data in the "pollution-column" and therefore should also have values in the result-column.
On top of that, it seems that all values after the first "NaN" (at id = 5) are assigned the wrong values.
#Quang Hoang Thank you very much for trying to solve my problem! Could you maybe explain why my results differ from yours?
Does anyone know how i can get this code to work?

Related

Python Program to Sum of each Row and each Column of a Matrix

I want to create a Python script to calculate the x value and y value (like matrix).
content of "budget.txt" file:
Budget Jan Feb Mar Apr May Jun Sum
Milk 10 20 31 52 7 11
Eggs 1 5 1 16 4 58
Bread 22 36 17 8 21 16
Butter 4 5 8 11 36 2
Total
The script will calculate the budget.txt file and show results in column "Sum" and Row "Total".
Here is my code:
import sys
budget_file = sys.arvg[1]
df = open(budget_file).read()
print(df)
Output: I can read the file. Now my question is how to sum the values of row and column-wise?
Given your dataframe, I would approach the calculations in two steps:
Compute the Sum of each row of the data frame (see How To Sum DataFrame Rows
Compute the Sum of each Column
Given the following dataframe:
Budget Jan Feb Mar Apr May Jun
0 Milk 10 20 31 52 7 11
1 Eggs 1 5 1 16 4 58
2 Bread 22 36 17 8 21 16
3 Butter 4 5 8 11 36 2
To Compute the Sum across the rows of a dataframe
#list columns to sum across
df['Sum'] = df[list(df.columns)[1:]].sum(axis=1)
This produces the following df:
Budget Jan Feb Mar Apr May Jun Sum
0 Milk 10.0 20.0 31.0 52.0 7.0 11.0 131.0
1 Eggs 1.0 5.0 1.0 16.0 4.0 58.0 85.0
2 Bread 22.0 36.0 17.0 8.0 21.0 16.0 120.0
3 Butter 4.0 5.0 8.0 11.0 36.0 2.0 66.0
To add a row with the sum of all columns:
df.append(pd.Series(df.sum(),name='Total'))
This will yield a dataframe like:
Budget Jan Feb Mar Apr May Jun Sum
0 Milk 10 20 31 52 7 11 262.0
1 Eggs 1 5 1 16 4 58 170.0
2 Bread 22 36 17 8 21 16 240.0
3 Butter 4 5 8 11 36 2 132.0
Total MilkEggsBreadButter 37 66 57 87 68 87 804.0
If the string summation under budget is bothersome you can reset that specific cell to an empty string.

Replace last value(s) of group with NaN

My goal is to replace the last value (or the last several values) of each id with NaN. My real dataset is quite large and has groups of different sizes.
Example:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2010,2011,2012,2013,2014,2015]
percent = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110]
dictex ={"id":ids,"year":year,"percent [%]": percent}
dfex = pd.DataFrame(dictex)
print(dfex)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 50
5 1 2005 110
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 52
11 2 1995 80
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 50
17 3 2015 110
My goal is to replace the last 1 / or 2 / or 3 values of the "percent [%]" column for each id (group) with NaN.
The result should look like this: (here: replace the last 2 values of each id)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 NaN
17 3 2015 NaN
I know there should be a relatively easy solution for this, but i'm new to python and simply haven't been able to figure out an elegant way.
Thanks for the help!
try using groupby, tail and index to find the index of those rows that will be modified and use loc to change the values
nrows = 2
idx = df.groupby('id').tail(nrows).index
df.loc[idx, 'percent [%]'] = np.nan
#output
id year percent [%]
0 1 2000 120.0
1 1 2001 70.0
2 1 2002 37.0
3 1 2003 40.0
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140.0
7 2 1991 100.0
8 2 1992 90.0
9 2 1993 5.0
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60.0
13 3 2011 40.0
14 3 2012 70.0
15 3 2013 60.0
16 3 2014 NaN
17 3 2015 NaN

How to order DataFrame in python with special structure of date?

I have a DataFrame in python, the cell value is the purchase quantity like:
code 1/18 2/18 3/18 4/18 5/18
1 NaN 15 15 16 14
2 NaN NaN 30 23 24
3 24 21 23 NaN 26
I want to order the code in terms of the date they were first purchased, the result would be:
code 1/18 2/18 3/18 4/18 5/18
3 24 21 23 NaN 26
1 NaN 15 15 16 14
2 NaN NaN 30 23 24
Please help!
I think need specify columns for sorting by indexing - here by all columns without first:
print (df.columns[1:].tolist())
['1/18', '2/18', '3/18', '4/18', '5/18']
df = df.sort_values(by=df.columns[1:].tolist())
print (df)
code 1/18 2/18 3/18 4/18 5/18
2 3 24.0 21.0 23 NaN 26
0 1 NaN 15.0 15 16.0 14
1 2 NaN NaN 30 23.0 24
If first column is index:
print (df.columns.tolist())
['1/18', '2/18', '3/18', '4/18', '5/18']
df = df.sort_values(by=df.columns.tolist())
print (df)
1/18 2/18 3/18 4/18 5/18
code
3 24.0 21.0 23 NaN 26
1 NaN 15.0 15 16.0 14
2 NaN NaN 30 23.0 24

Pandas - insert rows where data is missing

I have a dataset, here is an example:
df = DataFrame({"Seconds_left":[5,10,15,25,30,35,5,10,15,30], "Team":["ATL","ATL","ATL","ATL","ATL","ATL","SAS","SAS","SAS","SAS"], "Fouls": [1,2,3,3,4,5,5,4,1,1]})
Fouls Seconds_left Team
0 1 5 ATL
1 2 10 ATL
2 3 15 ATL
3 3 25 ATL
4 4 30 ATL
5 5 35 ATL
6 5 5 SAS
7 4 10 SAS
8 1 15 SAS
9 1 30 SAS
Now I would like to insert rows where data in the Seconds_left column is missing:
Id Fouls Seconds_left Team
0 1 5 ATL
1 2 10 ATL
2 3 15 ATL
3 NaN 20 ATL
4 3 25 ATL
5 4 30 ATL
6 5 35 ATL
7 5 5 SAS
8 4 10 SAS
9 1 15 SAS
10 NaN 20 SAS
11 NaN 25 SAS
12 1 30 SAS
13 NaN 35 SAS
I tried already with reindexing etc. but obviously it does not function as there are duplicates.
Has somebody got any idea how to solve this?
Thanks!
Create a MultiIndex and reindex + reset_index:
idx = pd.MultiIndex.from_product([df['Team'].unique(),
np.arange(5, df['Seconds_left'].max()+1, 5)],
names=['Team', 'Seconds_left'])
df.set_index(['Team', 'Seconds_left']).reindex(idx).reset_index()
Out:
Team Seconds_left Fouls
0 ATL 5 1.0
1 ATL 10 2.0
2 ATL 15 3.0
3 ATL 20 NaN
4 ATL 25 3.0
5 ATL 30 4.0
6 ATL 35 5.0
7 SAS 5 5.0
8 SAS 10 4.0
9 SAS 15 1.0
10 SAS 20 NaN
11 SAS 25 NaN
12 SAS 30 1.0
13 SAS 35 NaN
An approach using groupby and merge:
df_left = pd.DataFrame({'Seconds_left':[5,10,15,20,25,30,35]})
df_out = df.groupby('Team', as_index=False).apply(lambda x: x.merge(df_left, how='right', on='Seconds_left'))
df_out['Team'] = df_out['Team'].fillna(method='ffill')
df_out = df_out.reset_index(drop=True).sort_values(by=['Team','Seconds_left'])
print(df_out)
Output:
Fouls Seconds_left Team
0 1.0 5 ATL
1 2.0 10 ATL
2 3.0 15 ATL
6 NaN 20 ATL
3 3.0 25 ATL
4 4.0 30 ATL
5 5.0 35 ATL
7 5.0 5 SAS
8 4.0 10 SAS
9 1.0 15 SAS
11 NaN 20 SAS
12 NaN 25 SAS
10 1.0 30 SAS
13 NaN 35 SAS
import pandas as pd
import numpy as np
df = pd.DataFrame(columns = ['a', 'b'])
df.loc[len(df)] = [1,np.NaN]

Division in pandas not working as it should

I have two dataframes each with one column. I'm pasting them exactly as they print below:
Top: (it has no column names as it is the result of a Top = Df1.groupby('col1')['att1'].diff().dropna()
1 15.566667
3 5.066667
5 57.266667
7 -10.366667
9 18.966667
11 50.966667
13 -5.633333
15 -14.266667
17 18.933333
19 3.100000
21 35.966667
23 -17.566667
25 -8.066667
27 -6.366667
29 7.133333
31 -2.633333
33 3.333333
35 -23.800000
37 2.333333
39 -53.533333
41 -17.300000
dtype: float64
Bottom: which is the result of Bottom = np.sqrt(Df2.groupby('ID')['Col2'].sum()/n)
ID
12868123 1.029001
757E13D7 1.432014
79731492 2.912770
799EFB29 1.826576
7D44062A 1.736757
7D4C0E2F 1.943503
7DBA169D 0.650023
7E558E2B 1.256287
7E8B3815 1.491974
7EB80123 0.558717
7FFB607D 1.505221
8065A321 1.809937
80EFE91B 2.064825
811F1B1E 0.992645
82B67C94 0.980618
833C27AE 0.969195
83957B28 0.469914
8447B85D 1.477168
84877498 0.872973
8569499D 2.215307
8617B7D9 1.033294
Name: Col2, dtype: float64
I want the divide those two columns values by each other.
Top/Bottom
I get the following:
1 NaN
3 NaN
5 NaN
7 NaN
9 NaN
11 NaN
13 NaN
15 NaN
17 NaN
19 NaN
21 NaN
23 NaN
25 NaN
27 NaN
29 NaN
31 NaN
33 NaN
35 NaN
37 NaN
39 NaN
41 NaN
12868123 NaN
757E13D7 NaN
79731492 NaN
799EFB29 NaN
7D44062A NaN
7D4C0E2F NaN
7DBA169D NaN
7E558E2B NaN
7E8B3815 NaN
7EB80123 NaN
7FFB607D NaN
8065A321 NaN
80EFE91B NaN
811F1B1E NaN
82B67C94 NaN
833C27AE NaN
83957B28 NaN
8447B85D NaN
84877498 NaN
8569499D NaN
8617B7D9 NaN
dtype: float64
I tried resetting the index column, it didn't help. Not sure why it's not working.
Problem is with different index values, because arithmetic opearations align Series by indices, so need cast to numpy array by values:
print (Top/Bottom.values)
1 15.127942
3 3.538141
5 19.660552
7 -5.675464
9 10.920737
11 26.224126
13 -8.666359
15 -11.356216
17 12.690123
19 5.548426
21 23.894609
23 -9.705679
25 -3.906707
27 -6.413841
29 7.274324
31 -2.717031
33 7.093496
35 -16.111911
37 2.672858
39 -24.165198
41 -16.742573
Name: col, dtype: float64
Solution with div:
print (Top.div(Bottom.values))
1 15.127942
3 3.538141
5 19.660552
7 -5.675464
9 10.920737
11 26.224126
13 -8.666359
15 -11.356216
17 12.690123
19 5.548426
21 23.894609
23 -9.705679
25 -3.906707
27 -6.413841
29 7.274324
31 -2.717031
33 7.093496
35 -16.111911
37 2.672858
39 -24.165198
41 -16.742573
dtype: float64
But if assign one index values to another, you can use:
Top.index = Bottom.index
print (Top/Bottom)
ID
12868123 15.127942
757E13D7 3.538141
79731492 19.660552
799EFB29 -5.675464
7D44062A 10.920737
7D4C0E2F 26.224126
7DBA169D -8.666359
7E558E2B -11.356216
7E8B3815 12.690123
7EB80123 5.548426
7FFB607D 23.894609
8065A321 -9.705679
80EFE91B -3.906707
811F1B1E -6.413841
82B67C94 7.274324
833C27AE -2.717031
83957B28 7.093496
8447B85D -16.111911
84877498 2.672858
8569499D -24.165198
8617B7D9 -16.742573
dtype: float64
And if get error like:
ValueError: operands could not be broadcast together with shapes (20,) (21,)
problem is with different length of Series.
I arrived here because I was looking how to divide a column by a subset of itself.
I found a solution which is not reported here
Suppose you have a df like
d = {'mycol1':[0,0,1,1,2,2],'mycol2':[1,2,3,6,4,8]}
df = pd.DataFrame(data=d)
i.e.
mycol1 mycol2
0 0 1
1 0 2
2 1 3
3 1 6
4 2 4
5 2 8
And now you want to divide mycol2 for a subset composed by the first two values
df['mycol2'].div(df[df['mycol1']==0.0]['mycol2'])
will result in
0 1.0
1 1.0
2 NaN
3 NaN
4 NaN
5 NaN
because of the index problem reported by jezreal.
The solution is to simply use concat to concatenate the subset to match the length of the original df.
Nrows = df[df['mycol1'==0.0]]['mycol2'].shape[0]
Nrows_tot = df['mycol2'].shape[0]
times_longer = int(Nrows_tot/Nrows)
df['mycol3'] = df['mycol2'].div(pd.concat([df[df['mycol1']==0.0]['mycol2']]*times_longer,ignore_index=True))

Categories

Resources