Python function definition on two list - python

Year Month Year_month
2009 2 2009/2
2009 3 2009/3
2007 4 2007/3
2006 10 2006/10
Year_month
200902
200903
200704
200610
I would like to combine the year and month columns into the format as Year_month (i.e. replace the original one). How could I do it? The following approach seems not working in Python. Thanks.
def f(x, y)
return x*100+y
for i in range(0,filename.shape[0]):
filename['Year_month'][i] = f(filename['year'][i] ,filename['month'][i])

I think you can use zfill:
df['Year_month'] = df.Year.astype(str) + df.Month.astype(str).str.zfill(2)
print df
Year Month Year_month
0 2009 2 200902
1 2009 3 200903
2 2007 4 200704
3 2006 10 200610

df = df.read_clipboard()
Year Month Year_month
0 2009 2 2009/2
1 2009 3 2009/3
2 2007 4 2007/3
3 2006 10 2006/10
df['Year_month'] = df.apply(lambda row: str(row.Year)+str(row.Month).zfill(2), axis=1)
Year Month Year_month
0 2009 2 200902
1 2009 3 200903
2 2007 4 200704
3 2006 10 200610

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

Why do I get index value inside column value when I do pandas groupby?

I have a data frame as follows:
import pandas as pd
df = pd.DataFrame()
df['year'] = ['2015','2015','2016','2016','2017','2017']
df['months'] = [2,4,4,2,3,5]
df['perc'] = ['25','35','55','75','34','38']
Which results in dataframe:
year months perc
0 2015 2 25
1 2015 4 35
2 2016 4 55
3 2016 2 75
4 2017 3 34
5 2017 5 38
When I do a pandas groupby on year column, the resultant dataframe/pandasgroupby object has index of original DF and the year inside the year column.
Command used:
result = pd.DataFrame(df.groupby('year',as_index=False).apply(lambda x: x.nlargest(1, ['months'])))
Where the year column has index of original DF (the 1, 2, 5) with the year value:
year months perc
0 1 2015 4 35
1 2 2016 4 55
2 5 2017 5 38
print(result['year']) gives:
0 1 2015
1 2 2016
2 5 2017
Name: year, dtype: object
Why Do I get index of original dataframe inside the year column and how to remove it?
I am not sure what is you expected output, but use group_keys=False as parameter to only keep the original index:
(df.groupby('year',as_index=False, group_keys=False)
.apply(lambda x: x.nlargest(1, ['months']))
)
output:
year months perc
1 2015 4 35
2 2016 4 55
5 2017 5 38

Loop through timeseries and fill missing data - Python

I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()

Changing an existing column conditional on two other column

I have a data set:
ID Fv_year HP_b_year HP_e_year
1 2010 0 2012
2 0 2009 2011
3 2000 0 2008
4 2001 0 0
I want generate:
ID Fv_year HP_b_year HP_e_year
1 2010 2010 2012
2 0 2009 2011
3 2000 2000 2008
4 2001 0 0
In word, when Fv_year >0 , HP_b_year =0 and HP_e_year>0 then I want to make HP_b_year = Fv_year, otherwise keep HP_b_year as it was before. I have used following cod:
def myfunc(x,y,z):
if x == 0 and y>0 and z>0:
return y
else:
return x
df['HP_b_year'] = df.apply(lambda x: myfunc(x.HP_b_year, x.Fv_year, x.HP_e_year), axis=1)
But its not working
You can use loc with conditions
df.loc[(df['HP_e_year']>0) & (df['Fv_year'].ne(0)), ['HP_b_year']] = df['Fv_year'][(df['HP_e_year']>0) & (df['Fv_year'].ne(0))]
ID Fv_year HP_b_year HP_e_year
0 1 2010 2010 2012
1 2 0 2009 2011
2 3 2000 2000 2008
3 4 2001 0 0

Split int64 Pandas column in two

I've been given a dataset that has dates as an integer using the format 52019 for May 2019. I've put it into a Pandas DataFrame, and I need to extract that date format into a month column and year column, but I can't figure out how to do that for an int64 datatype or how to handle it for the two digit months. So I want to take something like
ID Date
1 22019
2 32019
3 52019
5 102019
and make it become
ID Month Year
1 2 2019
2 3 2019
3 5 2019
5 10 2019
What should I do?
divmod
df['Month'], df['Year'] = np.divmod(df.Date, 10000)
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Without mutating original dataframe using assign
df.assign(**dict(zip(['Month', 'Year'], np.divmod(df.Date, 10000))))
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Using // and %
df['Month'], df['Year'] = df.Date//10000,df.Date%10000
df
Out[528]:
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Use:
s=pd.to_datetime(df.pop('Date'),format='%m%Y') #convert to datetime and pop deletes the col
df['Month'],df['Year']=s.dt.month,s.dt.year #extract month and year
print(df)
ID Month Year
0 1 2 2019
1 2 3 2019
2 3 5 2019
3 5 10 2019
str.extract can handle the tricky part of figuring out whether the Month has 1 or 2 digits.
(df['Date'].astype(str)
.str.extract(r'^(?P<Month>\d{1,2})(?P<Year>\d{4})$')
.astype(int))
Month Year
0 2 2019
1 3 2019
2 5 2019
3 10 2019
You may also use string slicing if it's guaranteed your numbers have only 5 or 6 digits (if not, use str.extract above):
u = df['Date'].astype(str)
df['Month'], df['Year'] = u.str[:-4], u.str[-4:]
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019

Categories

Resources