reshape a pandas DataFrame by transposing two columns and repeating another - python

The title might be a bit confusing, this is what I want to do:
I would like to convert this dataframe
pd.DataFrame({'name':['A','B','C'],'date1':[1999,2000,2001],'date2':[2011,2012,2013]})
date1 date2 name
0 1999 2011 A
1 2000 2012 B
2 2001 2013 C
Into the following:
dates name
0 1999 A
1 2011 A
2 2000 B
3 2012 B
4 2001 C
5 2013 C
I've been trying to do pivot tables and transposing, but with no luck.

You can use melt, remove column by drop and last sort_values:
print (pd.melt(df, id_vars='name', value_name='dates')
.drop('variable', axis=1)
.sort_values('name')[['dates','name']])
dates name
0 1999 A
3 2011 A
1 2000 B
4 2012 B
2 2001 C
5 2013 C
Another solution with unstack and sort_index:
print (df.set_index('name')
.unstack()
.reset_index(drop=True, level=0)
.sort_index()
.reset_index(name='dates')[['dates','name']])
dates name
0 1999 A
1 2011 A
2 2000 B
3 2012 B
4 2001 C
5 2013 C
Solution with lreshape and sort_values:
print (pd.lreshape(df, {'dates':['date1', 'date2']}).sort_values('name')[['dates','name']])
dates name
0 1999 A
3 2011 A
1 2000 B
4 2012 B
2 2001 C
5 2013 C
Numpy solution with numpy.repeat and flattening by numpy.ravel:
df2 = pd.DataFrame({
"name": np.repeat(df.name, 2),
"dates": df[['date1','date2']].values.ravel()})
print (df2)
dates name
0 1999 A
0 2011 A
1 2000 B
1 2012 B
2 2001 C
2 2013 C
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

Loop through timeseries and fill missing data - Python

I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()

Summing up rows in a DataFrame while maintaining a similar DataFrame structure

I have the following DataFrame:
Stint Year ID Data1 Data2 Team
1 2010 A 10 1 SFN
1 2011 A 10 1 SFN
1 2013 A 10 1 SFN
2 2013 A 10 1 ATL
1 1922 B 10 1 ARI
1 1923 B 10 1 ARI
1 1924 B 10 1 ARI
I'm trying to return a new DataFrame which sums up values in the Data1 and Data2 columns for identical years. I would like the DataFrame above to ultimately look like this:
Year ID Data1 Data2
2010 A 10 1
2011 A 10 1
2013 A 20 2
1922 B 10 1
1923 B 10 1
1924 B 10 1
I've messed around with some groupby functions, but I'm having trouble getting the proper DataFrame structure.
Thanks!
groupby with as_index=False
Will not include grouped columns in a new index
df.groupby(['Year', 'ID'], as_index=False)[['Data1', 'Data2']].sum()
Year ID Data1 Data2
0 1922 B 10 1
1 1923 B 10 1
2 1924 B 10 1
3 2010 A 10 1
4 2011 A 10 1
5 2013 A 20 2
groupby with sort=false
Also, if you like to keep your data in the same Year format ie. [2010,2011,2013,1922,1923,1924], you can check 'sort=False'
so the same code can be written with the sort values set to false as:
df.groupby(['Year', 'ID'], as_index= False, sort= False)[['Data1', 'Data2']].sum()

Loop Optimization in python

I have a dataframe df like this
Product Yr Value
A 2014 1
A 2015 3
A 2016 2
B 2015 2
B 2016 1
I want to do max cumululative ie
Product Yr Value
A 2014 1
A 2015 3
A 2016 3
B 2015 2
B 2016 2
My actual data has about 50000 products
I am writing a code like:
df2=pd.DataFrame()
for i in (df['Product'].unique()):
data3=df[df['Product']==i]
data3.sort_values(by=['Yr'])
data3['Value']=data3['Value'].cummax()
df2=df2.append(data3)
#df2 is my result
This code is taking a lot of time(~3 days) for about 50000 products and 10 years. Is there some way to speed it up?
You can use groupby.cummax instead:
df['Value'] = df.sort_values('Yr').groupby('Product').Value.cummax()
df
#Product Yr Value
#0 A 2014 1
#1 A 2015 3
#2 A 2016 3
#3 B 2015 2
#4 B 2016 2

Python function definition on two list

Year Month Year_month
2009 2 2009/2
2009 3 2009/3
2007 4 2007/3
2006 10 2006/10
Year_month
200902
200903
200704
200610
I would like to combine the year and month columns into the format as Year_month (i.e. replace the original one). How could I do it? The following approach seems not working in Python. Thanks.
def f(x, y)
return x*100+y
for i in range(0,filename.shape[0]):
filename['Year_month'][i] = f(filename['year'][i] ,filename['month'][i])
I think you can use zfill:
df['Year_month'] = df.Year.astype(str) + df.Month.astype(str).str.zfill(2)
print df
Year Month Year_month
0 2009 2 200902
1 2009 3 200903
2 2007 4 200704
3 2006 10 200610
df = df.read_clipboard()
Year Month Year_month
0 2009 2 2009/2
1 2009 3 2009/3
2 2007 4 2007/3
3 2006 10 2006/10
df['Year_month'] = df.apply(lambda row: str(row.Year)+str(row.Month).zfill(2), axis=1)
Year Month Year_month
0 2009 2 200902
1 2009 3 200903
2 2007 4 200704
3 2006 10 200610

Categories

Resources