Resampling data monthly R or Python - python

I have data recorded in the format as below,
Input
name year value
Afghanistan 1800 68
Albania 1800 23
Algeria 1800 54
Afghanistan 1801 59
Albania 1801 38
Algeria 1801 72
---
Afghanistan 2040 142
Albania 2040 165
Algeria 2040 120
I would like to resample all of my data which is recorded for years 1800 to 2040 using 1 month and exactly use the format as shown below,
Expected output
name year value
Afghanistan Jan 1800 5.6667
Afghanistan Feb 1800 11.3333
Afghanistan Mar 1800 17.0000
Afghanistan Apr 1800 22.6667
Afghanistan May 1800 28.3333
Afghanistan Jun 1800 34.0000
Afghanistan Jul 1800 39.6667
Afghanistan Aug 1800 45.3333
Afghanistan Sep 1800 51.0000
Afghanistan Oct 1800 56.6667
Afghanistan Nov 1800 62.3333
Afghanistan Dec 1800 68.0000
Albania Jan 1800 1.9167
Albania Feb 1800 3.8333
Albania Mar 1800 5.7500
Albania Apr 1800 7.6667
Albania May 1800 9.5833
Albania Jun 1800 11.5000
Albania Jul 1800 13.4167
Albania Aug 1800 15.3333
Albania Sep 1800 17.2500
Albania Oct 1800 19.1667
Albania Nov 1800 21.0833
Albania Dec 1800 23.0000
Algeria Jan 1800 4.5000
Algeria Feb 1800 9.0000
Algeria Mar 1800 13.5000
Algeria Apr 1800 18.0000
Algeria May 1800 22.5000
Algeria Jun 1800 27.0000
Algeria Jul 1800 31.5000
Algeria Aug 1800 36.0000
Algeria Sep 1800 40.5000
Algeria Oct 1800 45.0000
Algeria Nov 1800 49.5000
Algeria Dec 1800 54.000
I would like my data to look as above for all of the years, i.e from 1800 - 2040.
The value column is interpolated.
NB: My model will accept months as abbreviations like above.
My closest trial is as below but did not produce the expected result.
data['year'] = pd.to_datetime(data.year, format='%Y')
data.head(3)
name year value
Afghanistan 1800-01-01 00:00:00 68
Albania 1800-01-01 00:00:00 23
Algeria 1800-01-01 00:00:00 54
resampled = (data.groupby(['name']).apply(lambda x: x.set_index('year').resample('M').interpolate()))
resampled.head(3)
name year name value
Afghanistan 1800-01-31 00:00:00 NaN NaN
1800-02-28 00:00:00 NaN NaN
1800-03-31 00:00:00 NaN NaN
Your thoughts will save me here.

Apart from the imputeTS package for inter- as well as extrapolation, I only use base R in this solution.
res <- do.call(rbind, by(dat, dat$name, function(x) {
## expanding years to year-months
ex <- do.call(rbind, lapply(1:nrow(x), function(i) {
yr <- x$year[i]
data.frame(name=x$name[1],
year=seq.Date(as.Date(ISOdate(yr, 1, 1)),
as.Date(ISOdate(yr, 12, 31)), "month"),
value=x$value[i])
}))
## set values to NA except 01-01s
ex[!grepl("01-01", ex$year), "value"] <- NA
transform(ex,
## impute values linearly
value=imputeTS::na_interpolation(ex$value),
## format dates for desired output
year=strftime(ex$year, format="%b-%Y")
)
}))
Result
res[c(1:3, 13:15, 133:135, 145:147, 265:268, 277:279), ] ## sample rows
# name year value
# A.1 A Jan-1800 71.00000
# A.2 A Feb-1800 73.08333
# A.3 A Mar-1800 75.16667
# A.13 A Jan-1801 96.00000
# A.14 A Feb-1801 93.75000
# A.15 A Mar-1801 91.50000
# B.1 B Jan-1800 87.00000
# B.2 B Feb-1800 83.08333
# B.3 B Mar-1800 79.16667
# B.13 B Jan-1801 40.00000
# B.14 B Feb-1801 40.50000
# B.15 B Mar-1801 41.00000
# C.1 C Jan-1800 47.00000
# C.2 C Feb-1800 49.00000
# C.3 C Mar-1800 51.00000
# C.4 C Apr-1800 53.00000
# C.13 C Jan-1801 71.00000
# C.14 C Feb-1801 72.83333
# C.15 C Mar-1801 74.66667
Data
set.seed(42)
dat <- transform(expand.grid(name=LETTERS[1:3],
year=1800:1810),
value=sample(23:120, 33, replace=TRUE))

Here's a tidyverse approach that also requires the zoo package for the interpolation part.
library(dplyr)
library(tidyr)
library(zoo)
df <- data.frame(country = rep(c("Afghanistan", "Algeria"), each = 3),
year = rep(seq(1800,1802), times = 2),
value = rep(seq(3), times = 2),
stringsAsFactors = FALSE)
df2 <- df %>%
# make a grid of all country/year/month possibilities within the years in df
tidyr::expand(year, month = seq(12)) %>%
# join that to the original data frame to add back the values
left_join(., df) %>%
# put the result in chronological order
arrange(country, year, month) %>%
# group by country so the interpolation stays within those sets
group_by(country) %>%
# make a version of value that is NA except for Dec, then use na.approx to replace
# the NAs with linearly interpolated values
mutate(value_i = ifelse(month == 12, value, NA),
value_i = zoo::na.approx(value_i, na.rm = FALSE))
Note that the resulting column, value_i, is NA until the first valid observation, in December of the first observed year. So here's what the tail of df2 looks like.
> tail(df2)
# A tibble: 6 x 5
# Groups: country [1]
year month country value value_i
<int> <int> <chr> <int> <dbl>
1 1802 7 Algeria 3 2.58
2 1802 8 Algeria 3 2.67
3 1802 9 Algeria 3 2.75
4 1802 10 Algeria 3 2.83
5 1802 11 Algeria 3 2.92
6 1802 12 Algeria 3 3
If you want to replace those leading NAs, you'd have to do linear extrapolation, which you can do with na.spline from zoo instead. And if you'd rather have the observed values in January instead of December and get trailing instead of leading NAs, just change the relevant bit of the second-to-last line to month == 1.

Related

Manipulating Pandas data frame to get into long? format

So I have a df that looks like this:
Year
code
Country
Quan1jan
Quan2jan
Quan1feb
Quan2feb
2020
08123
Japan
500
26
400
28
2020
08123
Taiwan
450
245
4500
87
And I would like for it to look like this:
Year
month
code
Country
Quan1
Quan2
2020
jan
08123
Japan
500
26
2020
feb
08123
Japan
400
28
2020
jan
08123
Taiwan
450
245
2020
feb
08123
Taiwan
4500
87
It doesn’t matter if the data follows this same order, but I need it to be in this format.
Ive tried to play around with melt, and unstack with no luck. Any help is very much appreciated.
Use wide_to_long:
pd.wide_to_long(
df,
['Quan1', 'Quan2'],
i=['Year', 'code', 'Country'],
j='month',
suffix='\w+'
).reset_index()
# Year code Country month Quan1 Quan2
#0 2020 8123 Japan jan 500 26
#1 2020 8123 Japan feb 400 28
#2 2020 8123 Taiwan jan 450 245
#3 2020 8123 Taiwan feb 4500 87

Transform dataframe values with multilevel indices to single column

I would like to please ask your advice.
How can I transform the first dataframe into the second, below?
Continent, Country and Location are names of column indices.
Polution_level would be added as the column name of the values present on the first dataframe.
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
Continent Country Location Date Polution_Level
Asia Japan Tokyo 01 Jan 20 250
Asia Japan Tokyo 02 Jan 20 252
Asia Japan Tokyo 03 Jan 20 253
...
Europe Portugal Lisbon 03 Jan 20 138
Thank you.
The following should do what you want.
Modules
import io
import pandas as pd
Create data
df = pd.read_csv(io.StringIO("""
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
"""), sep="\s\s+", engine="python", header=[0,1,2], index_col=[0])
Verify multiindex
df.columns
MultiIndex([( 'Asia', 'Japan', 'Tokyo'),
( 'Asia', 'China', 'Shanghai'),
('Africa', 'Mozambique', 'Maputo'),
('Europe', 'Portugal', 'Lisbon')],
names=['Continent', 'Country', 'Location'])
Transpose table and stack values
ndf = df.T.stack().reset_index()
ndf.rename({0:'Polution_Level'}, axis=1)

Filling in missing data Python

I have a lot of missing data in between years and months of my dataframe that looks like:
Year Month State Value
1969 12 NJ 5500
1969 12 NY 6418
1970 8 IL 10093
1970 12 WI 6430
1970 7 NY 6140
1971 10 IL 10093
1971 6 MN 6850
1971 3 SC 7686
1972 12 FL 8772
2016 1 NJ 9000
For each state I need to fill out all the missing data from the beginning of the year the values began until 2018 but the only data that exists is mostly in between 1969 and 1990 so I just need to fill in the blanks.
The desired output (for NJ but needed for all states) would be:
Year Month State Value
1969 12 NJ 5500
1970 1 NJ 5500
1970 2 NJ 5500
1970 3 NJ 5500
1970 4 NJ 5500
1970 5 NJ 5500
1970 6 NJ 5500
.
.
1970 12 NJ 5500
.
.
2010 1 NJ 5500
2010 2 NJ 5500
2010 3 NJ 5500
.
.
2018 1 NJ 9000
I've tried turning the months into categorical values that range from 1-12 months, regroup and reset the index, and then use ffill to partition through the values into those newly made column indices like:
df['Month'] = pd.Categorical(df['Month'], categories=range(1, 13))
df = df.groupby(['State', 'Year', 'Month']).first().reset_index()
df['Value'] = df.groupby('Region')['Value'].ffill()
But this method gives me NaN values like:
State Year Month Value
NJ 1969 12 5500.0
NJ 1970 1 nan
NJ 1970 2 nan
NJ 1970 3 nan
.
.
NJ 2016 1 9000.0
I can't understand why this method has worked before as I've tested it on other data with actual results.
Sorry to all those who took time to correct this. It was a simple matter of accidentally grouping by a false column.
I had previously created a 'Region' column based off a collection of the State variables which was called rather than the States themselves.
So to clarify:
df['Value'] = df.groupby('Region')['Value'].ffill()
Needs to be changed into:
df['Value'] = df.groupby('State')['Value'].ffill()
This method works correctly.

Pandas/Python - Generating a chart [duplicate]

This question already has an answer here:
matplotlib bar chart with dates
(1 answer)
Closed 4 years ago.
So I want to generate a chart graph from a csv data file, and I've been following a guide but I can't seem to manipulate my code in such a way to get what I want.
So here is what I have so far:
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import matplotlib
df = pd.read_csv("TB_burden_countries_2018-03-06.csv")
df = df.set_index(['country'])
df2 = df.loc["Zimbabwe", "e_mort_num"]
df2 = df.loc["Zimbabwe", "e_mort_num"]
df = pd.DataFrame(data = df2, columns= ["e_mort_num"])
df.columns = ["Mortality"]
print(df2)
This code was just so I can choose a specific country (Zimbabwe) and look at its population number (e_mort_num). What could I write to generate a chart graph? I've been using this tutorial : http://pbpython.com/simple-graphing-pandas.html, but I'm having trouble manipulating variable names, a I'm not too sure what I should be doing. If you require more information, please say so. Thank you for your help!
CSV bit of interest:
Country Year Mortality
Zimbabwe 2000 20000
Zimbabwe 2001 18000
Zimbabwe 2002 17000
Zimbabwe 2003 19000
Zimbabwe 2004 19000
Zimbabwe 2005 22000
Zimbabwe 2006 24000
Zimbabwe 2007 24000
Zimbabwe 2008 23000
Zimbabwe 2009 17000
Zimbabwe 2010 13000
Zimbabwe 2011 14000
Zimbabwe 2012 14000
Zimbabwe 2013 11000
Zimbabwe 2014 11000
Zimbabwe 2015 9000
Zimbabwe 2016 5600
Assuming your dataframe looks like this:
>>> df
Country Year Mortality
0 Zimbabwe 2000 20000
1 Zimbabwe 2001 18000
2 Zimbabwe 2002 17000
3 Zimbabwe 2003 19000
4 Zimbabwe 2004 19000
5 Zimbabwe 2005 22000
6 Zimbabwe 2006 24000
7 Zimbabwe 2007 24000
8 Zimbabwe 2008 23000
9 Zimbabwe 2009 17000
10 Zimbabwe 2010 13000
11 Zimbabwe 2011 14000
12 Zimbabwe 2012 14000
13 Zimbabwe 2013 11000
14 Zimbabwe 2014 11000
15 Zimbabwe 2015 9000
16 Zimbabwe 2016 5600
You can obtain a barplot by using the following code:
# Plot mortality per year:
plt.bar(df['Year'], df['Mortality'])
# Set plot title
plt.title('Zimbabwe')
# Set the "xticks", for barplots, this is the labels on your x axis
plt.xticks(df['Year'], rotation=90)
# Set the name of the x axis
plt.xlabel('Year')
# Set the name of the y axis
plt.ylabel('Mortality')
# tight_layout makes it nicer for reading and saving
plt.tight_layout()
# Show your plot
plt.show()
Which gives you this:

join missing rows from another table columns in Python

I have two tables (as DataFrames) in Python. One is as follows:
Country Year totmigrants
Afghanistan 2000
Afghanistan 2001
Afghanistan 2002
Afghanistan 2003
Afghanistan 2004
Afghanistan 2005
Afghanistan 2006
Afghanistan 2007
Afghanistan 2008
Algeria 2000
Algeria 2001
Algeria 2002
...
Zimbabwe 2008
the other one is for each single year (9 seperate DataFrames overall 2000-2008):
Year=2000
---------------------------------------
Country totmigrants Gender Total
Afghanistan 73 M 70
Afghanistan F 3
Albania 11 M 5
Albania F 6
Algeria 52 M 44
...
Zimbabwe F 1
I want to join them together, the first table being outer join.
I had this in my mind but this only works for merging by columns:
new=pd.merge(table1,table2,how='left',on=['Country', 'Year'])
What I wanted to see is, from each data frame for each year total number of migrants, F and M occur in new columns in the first table as:
Country Year totmigrants F M
Afghanistan 2000 73 3 70
Afghanistan 2001 table3
Afghanistan 2002 table4
Afghanistan 2003 ...
Afghanistan 2004
Afghanistan 2005
Afghanistan 2006
Afghanistan 2007
Afghanistan 2008
Algeria 2000 52 8 44
Algeria 2001 table3 ...
Algeria 2002 table4 ...
...
Zimbabwe 2008 ... ...
Is there a specific method for this merging, or what function do I need to use?
Here's how to combine the data from the yearly dataframes. Let's assume that the yearly dataframes somehow have been stored in a dictionary:
df = {2000: ..., 2001: ..., ..., 2008: ...}
yearly = []
for N in df.keys():
tmp = df[N].pivot('Country','Gender','Total').fillna(0).astype(int)
tmp['Year'] = N # Store the year
tmp['totmigrants'] = tmp['M'] + tmp['F']
yearly.append(tmp)
df = pd.concat(yearly)
print(df)
#Gender F M Year totmigrants
#Country
#Afghanistan 3 70 2000 73
#Albania 6 5 2000 11
#Algeria 0 44 2000 44
#Zimbabwe 1 0 2000 1
Now you can merge df with the first dataframe using ['Country','Year'] as the keys.
I am not sure you need the first table. I did the following, I hope it helps.
data2000 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 73, 'M', 70],
['2','Afghanistan', None, 'F', 3],
['3','Albania', 11, 'M', 5],
['4','Albania', None ,'F', 6]])
data2001 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 75, 'M', 60],
['2','Afghanistan', None, 'F', 15],
['3','Albania', 15, 'M', 11],
['4','Albania', None ,'F', 4]])
# and so on
datas = {'2000':data2000, '2001':data2001}
reg_dfs = []
for year,data in datas.items():
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
new=pd.merge(df,df,how='inner',on=['Country'])
reg_df = new.query('Gender_x == "M" & Gender_y == "F"' )[['Country', 'Total_x', 'Total_y', 'totmigrants_x']]
reg_df.columns = ['Country', 'M', 'F', 'Total']
reg_df['Year'] = year
reg_dfs.append(reg_df)
print(pd.concat(reg_dfs).sort(['Country']))
# Country M F Total Year
#1 Afghanistan 70 3 73 2000
#1 Afghanistan 60 15 75 2001
#5 Albania 5 6 11 2000
#5 Albania 11 4 15 2001

Categories

Resources