Manipulating Pandas data frame to get into long? format - python

So I have a df that looks like this:
Year
code
Country
Quan1jan
Quan2jan
Quan1feb
Quan2feb
2020
08123
Japan
500
26
400
28
2020
08123
Taiwan
450
245
4500
87
And I would like for it to look like this:
Year
month
code
Country
Quan1
Quan2
2020
jan
08123
Japan
500
26
2020
feb
08123
Japan
400
28
2020
jan
08123
Taiwan
450
245
2020
feb
08123
Taiwan
4500
87
It doesn’t matter if the data follows this same order, but I need it to be in this format.
Ive tried to play around with melt, and unstack with no luck. Any help is very much appreciated.

Use wide_to_long:
pd.wide_to_long(
df,
['Quan1', 'Quan2'],
i=['Year', 'code', 'Country'],
j='month',
suffix='\w+'
).reset_index()
# Year code Country month Quan1 Quan2
#0 2020 8123 Japan jan 500 26
#1 2020 8123 Japan feb 400 28
#2 2020 8123 Taiwan jan 450 245
#3 2020 8123 Taiwan feb 4500 87

Related

Pandas : Groupby sum values

I am using this data frame in excel :
I'd like to show the total sales per year.
Year Sales
2021 7
2018 6
2018 787
2018 935
2018 1 059
2018 5
2018 72
2018 2
2018 3
2019 218
2019 256
2020 2
2018 4
2021 8
2019 14
2020 3
2018 3
2018 1
2020 34
I'm using this :
df.groupby(['Year'])['Sales'].agg('sum')
And the result :
2018.0 67879351 05957223431
2019.0 21825614
2020.0 2334
2021.0 78
Do you know why I don't have the sum of the values ?
Thanks
'Sales' column is of dtype object so convert it to numeric:
df['Sales']=pd.to_numeric(df['Sales'].replace(r"\s+",'',regex=True),errors='coerce')
#df['Sales'].replace(r"\s+",'',regex=True).astype(float)
Now calculte sum():
out=df.groupby(['Year'])['Sales'].sum()
output of out:
Year
2018 2877
2019 488
2020 39
2021 15
Name: Sales, dtype: int64

How to join or merge two dataframes based on different condition?

I want to merge or join two DataFrames based on different date. Join Completed date with any earlier Start date. I have the following dataframes:
df1:
Complted_date
2015
2017
2020
df2:
Start_date
2001
2010
2012
2015
2016
2017
2018
2019
2020
2021
And desired output is:
Complted_date Start_date
2015 2001
2015 2010
2015 2012
2015 2015
2017 2001
2017 2010
2017 2012
2017 2015
2017 2016
2017 2017
2020 2001
2020 2010
2020 2012
2020 2015
2020 2016
2020 2017
2020 2018
2020 2019
2020 2020
I've tried but I'm not getting the output I want.
Thank you for your help!!
Check out merge, which gives you the expected output:
(df1.assign(key=1)
.merge(df2.assign(key=1), on='key')
.query('Complted_date>=Start_date')
.drop('key', axis=1)
)
Output:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
However, you might want to check out merge_asof:
pd.merge_asof(df2, df1,
right_on='Complted_date',
left_on='Start_date',
direction='forward')
Output:
Start_date Complted_date
0 2001 2015.0
1 2010 2015.0
2 2012 2015.0
3 2015 2015.0
4 2016 2017.0
5 2017 2017.0
6 2018 2020.0
7 2019 2020.0
8 2020 2020.0
9 2021 NaN
You can do cross-join and pick records which have Completed_date > Start_date:
Use df.merge with df.query:
In [101]: df1['tmp'] = 1
In [102]: df2['tmp'] = 1
In [107]: res = df1.merge(df2, how='outer').query("Complted_date >= Start_date").drop('tmp', 1)
In [108]: res
Out[108]:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
Here is another way using pd.Series() and explode()
df1['Start_date'] = pd.Series([df2['Start_date'].tolist()])
df1['Start_date'] = df1['Start_date'].fillna(method='ffill')
df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)
You could use conditional_join from pyjanitor to get rows where compltd_date is >= start_date:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('Complted_date', 'Start_date', '>='))
Out[1163]:
left right
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
4 2017 2001
5 2017 2010
6 2017 2012
7 2017 2015
8 2017 2016
9 2017 2017
10 2020 2001
11 2020 2010
12 2020 2012
13 2020 2015
14 2020 2016
15 2020 2017
16 2020 2018
17 2020 2019
18 2020 2020
Under the hood, it is just binary search (searchsorted) - the aim is to avoid a cartesian join, and hopefully, reduce memory usage.

Transform dataframe values with multilevel indices to single column

I would like to please ask your advice.
How can I transform the first dataframe into the second, below?
Continent, Country and Location are names of column indices.
Polution_level would be added as the column name of the values present on the first dataframe.
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
Continent Country Location Date Polution_Level
Asia Japan Tokyo 01 Jan 20 250
Asia Japan Tokyo 02 Jan 20 252
Asia Japan Tokyo 03 Jan 20 253
...
Europe Portugal Lisbon 03 Jan 20 138
Thank you.
The following should do what you want.
Modules
import io
import pandas as pd
Create data
df = pd.read_csv(io.StringIO("""
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
"""), sep="\s\s+", engine="python", header=[0,1,2], index_col=[0])
Verify multiindex
df.columns
MultiIndex([( 'Asia', 'Japan', 'Tokyo'),
( 'Asia', 'China', 'Shanghai'),
('Africa', 'Mozambique', 'Maputo'),
('Europe', 'Portugal', 'Lisbon')],
names=['Continent', 'Country', 'Location'])
Transpose table and stack values
ndf = df.T.stack().reset_index()
ndf.rename({0:'Polution_Level'}, axis=1)

How to filter within a subgroup (Pandas)

here is my problem:
You will find below a Pandas DataFrame, I would like to groupby Date and then filtering within the subgroups, but I have a lot of difficulties in doing it (spent 3 hours on this and I haven't find any solution).
This is what I am looking for :
I first have to group everything by date, then sort each score from the max to the lower (in each subgroup) and then select the two best scores but they have to be from different countries.
(For example, if the two best are from the same country then we select the higher score with a country different from the first).
This is the DataFrame :
Date Name Score Country
2012 Paul 65 France
2012 Silvia 81 Italy
2012 David 80 UK
2012 Alphonse 46 France
2012 Giovanni 82 Italy
2012 Britney 53 UK
2013 Paul 32 France
2013 Silvia 59 Italy
2013 David 92 UK
2013 Alphonse 68 France
2013 Giovanni 23 Italy
2013 Britney 78 UK
2014 Paul 46 France
2014 Silvia 87 Italy
2014 David 89 UK
2014 Alphonse 76 France
2014 Giovanni 53 Italy
2014 Britney 90 UK
The Result I am looking for is something like this :
Date Name Score Country
2012 Giovanni 82 Italy
2012 David 80 UK
2013 David 92 UK
2013 Alphonse 68 France
2014 Britney 90 UK
2014 Silvia 87 Italy
Here is the code that I started :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2012","2012","2013","2013","2013","2013","2013","2013","2014","2014","2014","2014","2014","2014"],
'Name': ["Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney"],
'Score': [65, 81, 80, 46, 82, 53,32,59,92,68,23,78,46,87,89,76,53,90],
"Country":["France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK"]})
df = df.set_index('Name').groupby('Date')["Score","Country"].apply(lambda _df: _df.sort_values(["Score"],ascending=False))
And this is what I have :
But as you can see for example in 2012, the two best scores are from the same country (Italy), so what I still have to do is :
1. Select the max per country for each year
2. Select only two best scores (and the countries have to be different).
I will be really thankful for that because I really don't know how to do it.
If somebody has some ideas on that, please share it :)
PS : please don't hesitate to tell me if it wasn't clear enough
Use DataFrame.sort_values first by 2 columns, then remove duplicates by 2 columns by DataFrame.drop_duplicates and last select top values per groups by GroupBy.head:
df1 = (df.sort_values(['Date','Score'], ascending=[True, False])
.drop_duplicates(['Date','Country'])
.groupby('Date')
.head(2))
print (df1)
Date Name Score Country
4 2012 Giovanni 82 Italy
2 2012 David 80 UK
8 2013 David 92 UK
9 2013 Alphonse 68 France
17 2014 Britney 90 UK
13 2014 Silvia 87 Italy

Resampling data monthly R or Python

I have data recorded in the format as below,
Input
name year value
Afghanistan 1800 68
Albania 1800 23
Algeria 1800 54
Afghanistan 1801 59
Albania 1801 38
Algeria 1801 72
---
Afghanistan 2040 142
Albania 2040 165
Algeria 2040 120
I would like to resample all of my data which is recorded for years 1800 to 2040 using 1 month and exactly use the format as shown below,
Expected output
name year value
Afghanistan Jan 1800 5.6667
Afghanistan Feb 1800 11.3333
Afghanistan Mar 1800 17.0000
Afghanistan Apr 1800 22.6667
Afghanistan May 1800 28.3333
Afghanistan Jun 1800 34.0000
Afghanistan Jul 1800 39.6667
Afghanistan Aug 1800 45.3333
Afghanistan Sep 1800 51.0000
Afghanistan Oct 1800 56.6667
Afghanistan Nov 1800 62.3333
Afghanistan Dec 1800 68.0000
Albania Jan 1800 1.9167
Albania Feb 1800 3.8333
Albania Mar 1800 5.7500
Albania Apr 1800 7.6667
Albania May 1800 9.5833
Albania Jun 1800 11.5000
Albania Jul 1800 13.4167
Albania Aug 1800 15.3333
Albania Sep 1800 17.2500
Albania Oct 1800 19.1667
Albania Nov 1800 21.0833
Albania Dec 1800 23.0000
Algeria Jan 1800 4.5000
Algeria Feb 1800 9.0000
Algeria Mar 1800 13.5000
Algeria Apr 1800 18.0000
Algeria May 1800 22.5000
Algeria Jun 1800 27.0000
Algeria Jul 1800 31.5000
Algeria Aug 1800 36.0000
Algeria Sep 1800 40.5000
Algeria Oct 1800 45.0000
Algeria Nov 1800 49.5000
Algeria Dec 1800 54.000
I would like my data to look as above for all of the years, i.e from 1800 - 2040.
The value column is interpolated.
NB: My model will accept months as abbreviations like above.
My closest trial is as below but did not produce the expected result.
data['year'] = pd.to_datetime(data.year, format='%Y')
data.head(3)
name year value
Afghanistan 1800-01-01 00:00:00 68
Albania 1800-01-01 00:00:00 23
Algeria 1800-01-01 00:00:00 54
resampled = (data.groupby(['name']).apply(lambda x: x.set_index('year').resample('M').interpolate()))
resampled.head(3)
name year name value
Afghanistan 1800-01-31 00:00:00 NaN NaN
1800-02-28 00:00:00 NaN NaN
1800-03-31 00:00:00 NaN NaN
Your thoughts will save me here.
Apart from the imputeTS package for inter- as well as extrapolation, I only use base R in this solution.
res <- do.call(rbind, by(dat, dat$name, function(x) {
## expanding years to year-months
ex <- do.call(rbind, lapply(1:nrow(x), function(i) {
yr <- x$year[i]
data.frame(name=x$name[1],
year=seq.Date(as.Date(ISOdate(yr, 1, 1)),
as.Date(ISOdate(yr, 12, 31)), "month"),
value=x$value[i])
}))
## set values to NA except 01-01s
ex[!grepl("01-01", ex$year), "value"] <- NA
transform(ex,
## impute values linearly
value=imputeTS::na_interpolation(ex$value),
## format dates for desired output
year=strftime(ex$year, format="%b-%Y")
)
}))
Result
res[c(1:3, 13:15, 133:135, 145:147, 265:268, 277:279), ] ## sample rows
# name year value
# A.1 A Jan-1800 71.00000
# A.2 A Feb-1800 73.08333
# A.3 A Mar-1800 75.16667
# A.13 A Jan-1801 96.00000
# A.14 A Feb-1801 93.75000
# A.15 A Mar-1801 91.50000
# B.1 B Jan-1800 87.00000
# B.2 B Feb-1800 83.08333
# B.3 B Mar-1800 79.16667
# B.13 B Jan-1801 40.00000
# B.14 B Feb-1801 40.50000
# B.15 B Mar-1801 41.00000
# C.1 C Jan-1800 47.00000
# C.2 C Feb-1800 49.00000
# C.3 C Mar-1800 51.00000
# C.4 C Apr-1800 53.00000
# C.13 C Jan-1801 71.00000
# C.14 C Feb-1801 72.83333
# C.15 C Mar-1801 74.66667
Data
set.seed(42)
dat <- transform(expand.grid(name=LETTERS[1:3],
year=1800:1810),
value=sample(23:120, 33, replace=TRUE))
Here's a tidyverse approach that also requires the zoo package for the interpolation part.
library(dplyr)
library(tidyr)
library(zoo)
df <- data.frame(country = rep(c("Afghanistan", "Algeria"), each = 3),
year = rep(seq(1800,1802), times = 2),
value = rep(seq(3), times = 2),
stringsAsFactors = FALSE)
df2 <- df %>%
# make a grid of all country/year/month possibilities within the years in df
tidyr::expand(year, month = seq(12)) %>%
# join that to the original data frame to add back the values
left_join(., df) %>%
# put the result in chronological order
arrange(country, year, month) %>%
# group by country so the interpolation stays within those sets
group_by(country) %>%
# make a version of value that is NA except for Dec, then use na.approx to replace
# the NAs with linearly interpolated values
mutate(value_i = ifelse(month == 12, value, NA),
value_i = zoo::na.approx(value_i, na.rm = FALSE))
Note that the resulting column, value_i, is NA until the first valid observation, in December of the first observed year. So here's what the tail of df2 looks like.
> tail(df2)
# A tibble: 6 x 5
# Groups: country [1]
year month country value value_i
<int> <int> <chr> <int> <dbl>
1 1802 7 Algeria 3 2.58
2 1802 8 Algeria 3 2.67
3 1802 9 Algeria 3 2.75
4 1802 10 Algeria 3 2.83
5 1802 11 Algeria 3 2.92
6 1802 12 Algeria 3 3
If you want to replace those leading NAs, you'd have to do linear extrapolation, which you can do with na.spline from zoo instead. And if you'd rather have the observed values in January instead of December and get trailing instead of leading NAs, just change the relevant bit of the second-to-last line to month == 1.

Categories

Resources