Split int64 Pandas column in two - python

I've been given a dataset that has dates as an integer using the format 52019 for May 2019. I've put it into a Pandas DataFrame, and I need to extract that date format into a month column and year column, but I can't figure out how to do that for an int64 datatype or how to handle it for the two digit months. So I want to take something like
ID Date
1 22019
2 32019
3 52019
5 102019
and make it become
ID Month Year
1 2 2019
2 3 2019
3 5 2019
5 10 2019
What should I do?

divmod
df['Month'], df['Year'] = np.divmod(df.Date, 10000)
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Without mutating original dataframe using assign
df.assign(**dict(zip(['Month', 'Year'], np.divmod(df.Date, 10000))))
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019

Using // and %
df['Month'], df['Year'] = df.Date//10000,df.Date%10000
df
Out[528]:
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019

Use:
s=pd.to_datetime(df.pop('Date'),format='%m%Y') #convert to datetime and pop deletes the col
df['Month'],df['Year']=s.dt.month,s.dt.year #extract month and year
print(df)
ID Month Year
0 1 2 2019
1 2 3 2019
2 3 5 2019
3 5 10 2019

str.extract can handle the tricky part of figuring out whether the Month has 1 or 2 digits.
(df['Date'].astype(str)
.str.extract(r'^(?P<Month>\d{1,2})(?P<Year>\d{4})$')
.astype(int))
Month Year
0 2 2019
1 3 2019
2 5 2019
3 10 2019
You may also use string slicing if it's guaranteed your numbers have only 5 or 6 digits (if not, use str.extract above):
u = df['Date'].astype(str)
df['Month'], df['Year'] = u.str[:-4], u.str[-4:]
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

pivot dataframe using columns and values

I have data frame like
Year Month Date X Y
2015 5 1 0.21120733 0.17662421
2015 5 2 0.36878636 0.14629167
2015 5 3 0.27969632 0.37910569
2016 5 1 -1.2968733 8.29E-02
2016 5 2 -1.1575716 -0.20657887
2016 5 3 -1.0049003 -0.39670503
2017 5 1 -1.5630698 1.1710221
2017 5 2 -1.70889 0.93349206
2017 5 3 -1.8548334 0.86701781
2018 5 1 -7.94E-02 0.3962194
2018 5 2 -2.91E-02 0.39321879
I want to make it like
2015 2016 2017 2018
0.21120733 -1.2968733 -1.5630698 -7.94E-02
0.36878636 -1.1575716 -1.70889 -2.91E-02
0.27969632 -1.0049003 -1.8548334 NA
I tried using df.pivot(columns='Year',values='X') but the answer is not as expected
Try passing index in pivot():
out=df.pivot(columns='Year',values='X',index='Date')
#If needed use:
out=out.rename_axis(index=None,columns=None)
OR
Try via agg() and dropna():
out=df.pivot(columns='Year',values='X').agg(sorted,key=pd.isnull).dropna(how='all')
#If needed use:
out.columns.names=[None]
output of out:
2015 2016 2017 2018
0 0.211207 -1.296873 -1.563070 -0.0794
1 0.368786 -1.157572 -1.708890 -0.0291
2 0.279696 -1.004900 -1.854833 NaN

Grabbing data from previous year in a Pandas DataFrame

I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0

Groupby with counts + aggregate row

I have the following code which produces a data frame showing me a per month and per year average sold price. I would like to add to this a total row per year and a total row per pid (person). Sample code and data:
import pandas as pd
import StringIO
s = StringIO.StringIO("""pid,year,month,price
1,2017,4,2000
1,2017,4,2900
1,2018,4,2000
1,2018,4,2300
1,2018,5,2000
1,2018,5,1990
1,2018,6,2200
1,2018,6,2400
1,2018,6,2250
1,2018,7,2150
""")
df = pd.read_csv(s)
maths = {'price': 'mean'}
gb = df.groupby(['pid','year','month'])
counts = gb.size().to_frame(name='n')
out = counts.join(gb.agg(maths)).reset_index()
print(out)
Which yields:
pid year month n price
0 1 2017 4 2 2450.000000
1 1 2018 4 2 2150.000000
2 1 2018 5 2 1995.000000
3 1 2018 6 3 2283.333333
4 1 2018 7 1 2150.000000
I would the additional per year rows to look like:
pid year month n price
0 1 2017 all 2 2450.000000
0 1 2018 all 8 2161.000000
And then the per pid rollup to look like:
pid year month n price
0 1 all all 10 2218.000000
I'm having trouble cleanly grouping/aggregating those last two frames where I essentially want an all split for each year and month value, and then have each data frame here combined into one which I can write to CSV, or a database table.
Using pd.concat
df1=df.groupby(['pid','year','month']).price.agg(['size','mean']).reset_index()
df2=df.groupby(['pid','year']).price.agg(['size','mean']).assign(month='all').reset_index()
df3=df.groupby(['pid']).price.agg(['size','mean']).assign(**{'month':'all','year':'all'}).reset_index()
pd.concat([df1,df2,df3])
Out[484]:
mean month pid size year
0 2450.000000 4 1 2 2017
1 2150.000000 4 1 2 2018
2 1995.000000 5 1 2 2018
3 2283.333333 6 1 3 2018
4 2150.000000 7 1 1 2018
0 2450.000000 all 1 2 2017
1 2161.250000 all 1 8 2018
0 2219.000000 all 1 10 all

Get first index from multiindex grouping by level

I have a multiindex pandas dataframe:
SHOPPING_COUNT
CLIENT YEAR MONTH
1000063 2013 12 9
2014 1 9
2 7
3 9
2015 4 6
5 5
6 9
1001327 2014 5 1
6 1
2015 2 7
3 1
4 3
1001399 2013 8 1
And I would to know the first index of each client, ordering by level 0.
I mean, I would want to get:
1000063 2013 12
1001327 2014 5
1001399 2013 8
Let df be your dataframe, you can do something like:
df = df.groupby(level=0).apply(lambda x: x.iloc[0:1])
df.index = df.index.droplevel(0)
actually this should be more easy to do maybe, but I think that this method works.
It's not very programmatic, but if you look at the result of:
client = 1000063
df.loc[client].index
Then the following would work:
year = df.loc[client].index.levels[0][df.loc[client].index.labels[0][0]]
month = df.loc[client].index.levels[1][df.loc[client].index.labels[1][0]]

Categories

Resources