I have two tables (as DataFrames) in Python. One is as follows:
Country Year totmigrants
Afghanistan 2000
Afghanistan 2001
Afghanistan 2002
Afghanistan 2003
Afghanistan 2004
Afghanistan 2005
Afghanistan 2006
Afghanistan 2007
Afghanistan 2008
Algeria 2000
Algeria 2001
Algeria 2002
...
Zimbabwe 2008
the other one is for each single year (9 seperate DataFrames overall 2000-2008):
Year=2000
---------------------------------------
Country totmigrants Gender Total
Afghanistan 73 M 70
Afghanistan F 3
Albania 11 M 5
Albania F 6
Algeria 52 M 44
...
Zimbabwe F 1
I want to join them together, the first table being outer join.
I had this in my mind but this only works for merging by columns:
new=pd.merge(table1,table2,how='left',on=['Country', 'Year'])
What I wanted to see is, from each data frame for each year total number of migrants, F and M occur in new columns in the first table as:
Country Year totmigrants F M
Afghanistan 2000 73 3 70
Afghanistan 2001 table3
Afghanistan 2002 table4
Afghanistan 2003 ...
Afghanistan 2004
Afghanistan 2005
Afghanistan 2006
Afghanistan 2007
Afghanistan 2008
Algeria 2000 52 8 44
Algeria 2001 table3 ...
Algeria 2002 table4 ...
...
Zimbabwe 2008 ... ...
Is there a specific method for this merging, or what function do I need to use?
Here's how to combine the data from the yearly dataframes. Let's assume that the yearly dataframes somehow have been stored in a dictionary:
df = {2000: ..., 2001: ..., ..., 2008: ...}
yearly = []
for N in df.keys():
tmp = df[N].pivot('Country','Gender','Total').fillna(0).astype(int)
tmp['Year'] = N # Store the year
tmp['totmigrants'] = tmp['M'] + tmp['F']
yearly.append(tmp)
df = pd.concat(yearly)
print(df)
#Gender F M Year totmigrants
#Country
#Afghanistan 3 70 2000 73
#Albania 6 5 2000 11
#Algeria 0 44 2000 44
#Zimbabwe 1 0 2000 1
Now you can merge df with the first dataframe using ['Country','Year'] as the keys.
I am not sure you need the first table. I did the following, I hope it helps.
data2000 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 73, 'M', 70],
['2','Afghanistan', None, 'F', 3],
['3','Albania', 11, 'M', 5],
['4','Albania', None ,'F', 6]])
data2001 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 75, 'M', 60],
['2','Afghanistan', None, 'F', 15],
['3','Albania', 15, 'M', 11],
['4','Albania', None ,'F', 4]])
# and so on
datas = {'2000':data2000, '2001':data2001}
reg_dfs = []
for year,data in datas.items():
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
new=pd.merge(df,df,how='inner',on=['Country'])
reg_df = new.query('Gender_x == "M" & Gender_y == "F"' )[['Country', 'Total_x', 'Total_y', 'totmigrants_x']]
reg_df.columns = ['Country', 'M', 'F', 'Total']
reg_df['Year'] = year
reg_dfs.append(reg_df)
print(pd.concat(reg_dfs).sort(['Country']))
# Country M F Total Year
#1 Afghanistan 70 3 73 2000
#1 Afghanistan 60 15 75 2001
#5 Albania 5 6 11 2000
#5 Albania 11 4 15 2001
Related
I need to find the variability from the long-term mean for monthly data from 1991 to 2021. I have data that looks like this that is a 204,3 size:
dfavgs =
plant_name month power_kwh
0 ARIZONA I 1 10655.989885
1 ARIZONA I 2 9789.542672
2 ARIZONA I 3 7889.403154
3 ARIZONA I 4 7965.595843
4 ARIZONA I 5 9299.316756
.. ... ... ...
199 SANTANA II 8 16753.999870
200 SANTANA II 9 17767.383616
201 SANTANA II 10 17430.005363
202 SANTANA II 11 16628.784139
203 SANTANA II 12 15167.085560
My large monthly by year df looks like this with size 6137,4:
dfmonthlys:
plant_name year month power_kwh
0 ARIZONA I 1991 1 9256.304704
1 ARIZONA I 1991 2 8851.689732
2 ARIZONA I 1991 3 7649.949328
3 ARIZONA I 1991 4 6728.544028
4 ARIZONA I 1991 5 8601.165457
... ... ... ...
6132 SANTANA II 2020 9 16481.202361
6133 SANTANA II 2020 10 15644.358737
6134 SANTANA II 2020 11 14368.804306
6135 SANTANA II 2020 12 15473.958468
6136 SANTANA II 2021 1 13161.219086
My new df "dfvar" should look like this showing the monthly deviation from long-term mean by year - i don't think these values are correct below:
plant_name year month Var
0 ARIZONA I 1991 1 -0.250259
1 ARIZONA I 1991 2 -0.283032
2 ARIZONA I 1991 3 -0.380370
3 ARIZONA I 1991 4 -0.455002
4 ARIZONA I 1991 5 -0.303324
I could do this easily in MATLAB but i'm not sure how to do this using pandas which i need to learn. Thank you very much. I've tried this below which gives me a series but there seem to be unexpected NaN's at the end rows:
t = dfmonthlys['power_kwh']/dfavgs.loc[:,'power_kwh'] - 1
the output from above looks like this:
t
Out[159]:
0 -0.131352
1 -0.095802
2 -0.030351
3 -0.155299
4 -0.075076
6132 NaN
6133 NaN
6134 NaN
6135 NaN
6136 NaN
Name: power_kwh, Length: 6137, dtype: float64
This is an example code of how you could do it. merge the dfavgs to the monthly data by month and plant name and then assign the calculation to a new column.
import numpy as np
import pandas as pd
dfavgs = {'plant_name':np.append(np.repeat(["ARIZONA I"], 12) , np.repeat("SANTANA II", 12)),
'month': np.tile(range(1, 13), 2),
'mnth_power_kwh': np.concatenate(([10655, 9789, 7889, 7965, 9299],
range(8000, 1500, -1000), range(12000, 500, -1000)))}
dfavgs=pd.DataFrame(dfavgs)
dfmonthlys = {'plant_name':np.append(np.repeat("ARIZONA I", 24), np.repeat("SANTANA II", 24)),
'year': np.tile(np.repeat([1991, 1992], 12), 2),
'month': np.tile(np.tile(range(1, 13), 2), 2),
'power_kwh': np.concatenate(([9256, 8851, 7649, 6728, 8601],
range(7000, 500, -1000),
range(13000, 1500, -1000),
range(25000, 1500, -1000)))}
dfmonthlys=pd.DataFrame(dfmonthlys)
merg=pd.merge(dfmonthlys, dfavgs, how="left", on=["month", "plant_name"])\
.assign(diff = lambda x: x["power_kwh"]/x["mnth_power_kwh"]-1)
print merg
I have data recorded in the format as below,
Input
name year value
Afghanistan 1800 68
Albania 1800 23
Algeria 1800 54
Afghanistan 1801 59
Albania 1801 38
Algeria 1801 72
---
Afghanistan 2040 142
Albania 2040 165
Algeria 2040 120
I would like to resample all of my data which is recorded for years 1800 to 2040 using 1 month and exactly use the format as shown below,
Expected output
name year value
Afghanistan Jan 1800 5.6667
Afghanistan Feb 1800 11.3333
Afghanistan Mar 1800 17.0000
Afghanistan Apr 1800 22.6667
Afghanistan May 1800 28.3333
Afghanistan Jun 1800 34.0000
Afghanistan Jul 1800 39.6667
Afghanistan Aug 1800 45.3333
Afghanistan Sep 1800 51.0000
Afghanistan Oct 1800 56.6667
Afghanistan Nov 1800 62.3333
Afghanistan Dec 1800 68.0000
Albania Jan 1800 1.9167
Albania Feb 1800 3.8333
Albania Mar 1800 5.7500
Albania Apr 1800 7.6667
Albania May 1800 9.5833
Albania Jun 1800 11.5000
Albania Jul 1800 13.4167
Albania Aug 1800 15.3333
Albania Sep 1800 17.2500
Albania Oct 1800 19.1667
Albania Nov 1800 21.0833
Albania Dec 1800 23.0000
Algeria Jan 1800 4.5000
Algeria Feb 1800 9.0000
Algeria Mar 1800 13.5000
Algeria Apr 1800 18.0000
Algeria May 1800 22.5000
Algeria Jun 1800 27.0000
Algeria Jul 1800 31.5000
Algeria Aug 1800 36.0000
Algeria Sep 1800 40.5000
Algeria Oct 1800 45.0000
Algeria Nov 1800 49.5000
Algeria Dec 1800 54.000
I would like my data to look as above for all of the years, i.e from 1800 - 2040.
The value column is interpolated.
NB: My model will accept months as abbreviations like above.
My closest trial is as below but did not produce the expected result.
data['year'] = pd.to_datetime(data.year, format='%Y')
data.head(3)
name year value
Afghanistan 1800-01-01 00:00:00 68
Albania 1800-01-01 00:00:00 23
Algeria 1800-01-01 00:00:00 54
resampled = (data.groupby(['name']).apply(lambda x: x.set_index('year').resample('M').interpolate()))
resampled.head(3)
name year name value
Afghanistan 1800-01-31 00:00:00 NaN NaN
1800-02-28 00:00:00 NaN NaN
1800-03-31 00:00:00 NaN NaN
Your thoughts will save me here.
Apart from the imputeTS package for inter- as well as extrapolation, I only use base R in this solution.
res <- do.call(rbind, by(dat, dat$name, function(x) {
## expanding years to year-months
ex <- do.call(rbind, lapply(1:nrow(x), function(i) {
yr <- x$year[i]
data.frame(name=x$name[1],
year=seq.Date(as.Date(ISOdate(yr, 1, 1)),
as.Date(ISOdate(yr, 12, 31)), "month"),
value=x$value[i])
}))
## set values to NA except 01-01s
ex[!grepl("01-01", ex$year), "value"] <- NA
transform(ex,
## impute values linearly
value=imputeTS::na_interpolation(ex$value),
## format dates for desired output
year=strftime(ex$year, format="%b-%Y")
)
}))
Result
res[c(1:3, 13:15, 133:135, 145:147, 265:268, 277:279), ] ## sample rows
# name year value
# A.1 A Jan-1800 71.00000
# A.2 A Feb-1800 73.08333
# A.3 A Mar-1800 75.16667
# A.13 A Jan-1801 96.00000
# A.14 A Feb-1801 93.75000
# A.15 A Mar-1801 91.50000
# B.1 B Jan-1800 87.00000
# B.2 B Feb-1800 83.08333
# B.3 B Mar-1800 79.16667
# B.13 B Jan-1801 40.00000
# B.14 B Feb-1801 40.50000
# B.15 B Mar-1801 41.00000
# C.1 C Jan-1800 47.00000
# C.2 C Feb-1800 49.00000
# C.3 C Mar-1800 51.00000
# C.4 C Apr-1800 53.00000
# C.13 C Jan-1801 71.00000
# C.14 C Feb-1801 72.83333
# C.15 C Mar-1801 74.66667
Data
set.seed(42)
dat <- transform(expand.grid(name=LETTERS[1:3],
year=1800:1810),
value=sample(23:120, 33, replace=TRUE))
Here's a tidyverse approach that also requires the zoo package for the interpolation part.
library(dplyr)
library(tidyr)
library(zoo)
df <- data.frame(country = rep(c("Afghanistan", "Algeria"), each = 3),
year = rep(seq(1800,1802), times = 2),
value = rep(seq(3), times = 2),
stringsAsFactors = FALSE)
df2 <- df %>%
# make a grid of all country/year/month possibilities within the years in df
tidyr::expand(year, month = seq(12)) %>%
# join that to the original data frame to add back the values
left_join(., df) %>%
# put the result in chronological order
arrange(country, year, month) %>%
# group by country so the interpolation stays within those sets
group_by(country) %>%
# make a version of value that is NA except for Dec, then use na.approx to replace
# the NAs with linearly interpolated values
mutate(value_i = ifelse(month == 12, value, NA),
value_i = zoo::na.approx(value_i, na.rm = FALSE))
Note that the resulting column, value_i, is NA until the first valid observation, in December of the first observed year. So here's what the tail of df2 looks like.
> tail(df2)
# A tibble: 6 x 5
# Groups: country [1]
year month country value value_i
<int> <int> <chr> <int> <dbl>
1 1802 7 Algeria 3 2.58
2 1802 8 Algeria 3 2.67
3 1802 9 Algeria 3 2.75
4 1802 10 Algeria 3 2.83
5 1802 11 Algeria 3 2.92
6 1802 12 Algeria 3 3
If you want to replace those leading NAs, you'd have to do linear extrapolation, which you can do with na.spline from zoo instead. And if you'd rather have the observed values in January instead of December and get trailing instead of leading NAs, just change the relevant bit of the second-to-last line to month == 1.
i have this dataframe.
df
name timestamp year
0 A 2004 1995
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
What i am doing is that on the basis of first two entries in the df['timestamp']. I am fetching all the values from df['year'] which comes in between these two entries. Which in this case is (2004-2008).
y1 = df['timestamp'].iloc[0]
y2 = df['timestamp'].iloc[1]
movies = df[df['year'].between(y1, y2,inclusive=True )]
movies
name timestamp year
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005
This is working fine for me. But when i have greater value in first index and lower in 2nd index (e.g. 2008-2004) the result is empty.
df
name timestamp year
0 A 2008 1995
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
In this case i fetch nothing.
Expected Outcome:
What i want is if the values are greater or smaller i should get in-between values every time.
You could use Series.head and Series.agg:
y1, y2 = df['timestamp'].head(2).agg(['min', 'max'])
movies = df[df['year'].between(y1, y2,inclusive=True )]
[out]
name timestamp year
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005
You can fix that by changing just two lines of code:
y1 = min(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
y2 = max(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
in this way y1 is always less or equal than y2.
However as #ALollz pointed out it is possible to save both computation and coding time by using
y1,y2 = np.sort(df['timestamp'].head(2))
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I want to help yours
if i have a pandas dataframe merge
first dataframe is
D = { Year, Age, Location, column1, column2... }
2013, 20 , america, ..., ...
2013, 35, usa, ..., ...
2011, 32, asia, ..., ...
2008, 45, japan, ..., ...
shape is 38654rows x 14 columns
second dataframe is
D = { Year, Location, column1, column2... }
2008, usa, ..., ...
2008, usa, ..., ...
2009, asia, ..., ...
2009, asia, ..., ...
2010, japna, ..., ...
shape is 96rows x 7 columns
I want to merge or join two different dataframe.
How can I do it?
thanks
IIUC you need merge with parameter how='left' if need left join on column Year and Location:
print (df1)
Year Age Location column1 column2
0 2013 20 america 7 5
1 2008 35 usa 8 1
2 2011 32 asia 9 3
3 2008 45 japan 7 1
print (df2)
Year Location column1 column2
0 2008 usa 8 9
1 2008 usa 7 2
2 2009 asia 8 2
3 2009 asia 0 1
4 2010 japna 9 3
df = pd.merge(df1,df2, on=['Year','Location'], how='left')
print (df)
Year Age Location column1_x column2_x column1_y column2_y
0 2013 20 america 7 5 NaN NaN
1 2008 35 usa 8 1 8.0 9.0
2 2008 35 usa 8 1 7.0 2.0
3 2011 32 asia 9 3 NaN NaN
4 2008 45 japan 7 1 NaN NaN
You can also check documentation.
I am grouping the following rows.
df = df.groupby(['id','year']).sum().sort(ascending=False)
print df
amount
id year
1 2009 120
2008 240
2007 240
2006 240
2005 240
2 2014 100
2013 50
2012 50
2011 100
2010 50
2006 100
... ...
Is there a way to add years that do not have any values with the amount equal to zero until a specific year, in this case 2005, as I am showing below?
Expected Output:
amount
id year
2015 0
2014 0
2013 0
2012 0
2011 0
2010 0
2009 120
2008 240
2007 240
2006 240
2005 240
2 2015 0
2014 100
2013 50
2012 50
2011 100
2010 50
2009 0
2008 0
2007 0
2006 100
2005 0
... ...
Starting with your first DataFrame, this will add all years that occur with some id to all ids.
df = df.unstack().fillna(0).stack()
e.g.
In [16]: df
Out[16]:
amt
id year
1 2001 1
2002 2
2003 3
2 2002 4
2003 5
2004 6
In [17]: df = df.unstack().fillna(0).stack()
In [18]: df
Out[18]:
amt
id year
1 2001 1
2002 2
2003 3
2004 0
2 2001 0
2002 4
2003 5
2004 6