Pandas - manipulate dataframe to create multi-level columns - python

Here is the dataframe:
A B val val2 loc
1 march 3 2 NY
1 april 5 1 NY
1 may 12 4 NY
2 march 4 1 NJ
2 april 7 5 NJ
2 may 12 1 NJ
3 march 1 8 CA
3 april 54 6 CA
3 may 2 9 CA
I'd like to transform this into:
march march april april may may
val1 val2 val1 val2 val1 val2
A B
1 NY 3 5 12 2 1 4
2 NJ 4 7 12 1 5 5
3 CA 1 54 2 8 6 9
I'm looking into pivot tables and stacking and unstacking but im truly stuck. I'm not sure where to start

With pd.pivot_table, and some swapping of levels:
new_df = (pd.pivot_table(df,['val','val2'],['A','loc'],['B'])
.sort_index(axis=1, level=1)
.swaplevel(0, axis=1))
>>> new_df
B april march may
val val2 val val2 val val2
A loc
1 NY 5 1 3 2 12 4
2 NJ 7 5 4 1 12 1
3 CA 54 6 1 8 2 9
If the ordering of your columns is important (as in you need it as march, april and may), you can set it to a ordered categorical:
new_df = (pd.pivot_table(df,['val','val2'],['A','loc'],
[pd.Categorical(df.B, categories=['march','april','may'],
ordered=True)])
.dropna(how='all')
.sort_index(axis=1, level=1)
.swaplevel(0, axis=1))
>>> new_df
B march april may
val val2 val val2 val val2
A loc
1 NY 3.0 2.0 5.0 1.0 12.0 4.0
2 NJ 4.0 1.0 7.0 5.0 12.0 1.0
3 CA 1.0 8.0 54.0 6.0 2.0 9.0

Related

Panel Data - dealing with missing year when creating lead and lag variables

I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN

Count the number of column values (number of unique values of column) that have at least one non null response

This is what my dataframe looks like:
Year
State
Var1
Var2
2018
1
1
3
2018
1
2
Nan
2018
1
NaN
1
2018
2
NaN
1
2018
2
NaN
2
2018
3
3
NaN
2019
1
1
NaN
2019
1
3
NaN
2019
1
2
NaN
2019
1
NaN
NaN
2019
2
NaN
NaN
2019
2
3
NaN
2020
1
1
NaN
2020
2
NaN
1
2020
2
NaN
3
2020
3
3
NaN
2020
4
NaN
NaN
2020
4
1
NaN
Desired Output
Year 2018 2019 2020
Var1 Num of States w/ non-null 2 2 3
Var2 Num of States w/ non-null 2 0 1
I want to count the number of unique values of the variable State that have at least one non null response for each variable.
IIUC you are looking for:
out = pd.concat([
df.dropna(subset='Var1').pivot_table(columns='Year',
values='State',
aggfunc='nunique'),
df.dropna(subset='Var2').pivot_table(columns='Year',
values='State',
aggfunc='nunique')
]).fillna(0).astype(int)
out.index = ['Var1 Num of States w/non-null', 'Var2 Num of states w/non-null']
print(out):
Year 2018 2019 2020
Var1 Num of States w/non-null 2 2 3
Var2 Num of states w/non-null 2 0 1

Using values in a pandas dataframe as column names for another

This is my first dataframe.
Date 0 1 2 3
2003-01-31 CA KY ID CO
2003-02-28 CA KY HI CO
2003-03-31 CA KY CO HI
This is my second dataframe.
Date CA KY ID CO HI
2003-01-31 5 3 4 5 1
2003-02-28 2 7 8 4 5
2003-03-31 6 3 9 3 5
How do I get this dataframe to print as output?
Date 0 1 2 3
2003-01-31 5 3 4 5
2003-02-28 2 7 5 4
2003-03-31 6 3 3 5
I am wondering if there is a way to use the whole dataframe as an index to another instead of having to loop through all the dates/columns.
You can use df.lookup with df.apply here.
# If `Date` is not index.
# df1.set_index('Date')
# 0 1 2 3
# Date
# 2003-01-31 CA KY ID CO
# 2003-02-28 CA KY HI CO
# 2003-03-31 CA KY CO HI
# df2.set_index('Date')
# CA KY ID CO HI
# Date
# 2003-01-31 5 3 4 5 1
# 2003-02-28 2 7 8 4 5
# 2003-03-31 6 3 9 3 5
def f(x):
return df2.lookup(x.index, x)
df1.apply(f)
# df1.apply(lambda x: df2.lookup(x.index, x)
0 1 2 3
Date
2003-01-31 5 3 4 5
2003-02-28 2 7 5 4
2003-03-31 6 3 3 5
This will print the values of dataframe df and the column name of dataframe df1.
print(df.rename(columns={i:j for i,j in zip(df.columns.tolist(),df1.columns.tolist())}))
if you want to make the changes permanent, add the parameter inplace=TRUE

Update dataframe with hierarchical index

I have a data series that looks like this
Component Date Sev Counts
PS 2009 3 4
4 1
2010 1 2
3 2
4 1
2011 2 3
3 5
4 1
2012 1 1
2 5
3 7
2013 2 4
3 9
2014 1 2
2 3
3 4
2015 1 2
2 100
3 31
4 31
2016 1 44
2 27
3 45
Name: Alarm Name, dtype: int64
And I have a vector that gives a certain quantitiy per year
Number
Date
2009-12-31 8.0
2010-12-31 3.0
2011-12-31 13.0
2012-12-31 2.0
2013-12-31 3.0
2014-12-31 4.0
2015-12-31 6.0
2016-12-31 71.0
I want to make a divisoin of my counts in the seriesusing my vector = division of Counts/number. I also want to obtain my original dataframe with the updated numbers.
This is my code
count=0
for i in df3.index.year:
df2.ix['PS'].ix[i].apply(lambda x: x /float(df3.iloc[count]))
count = count + 1
But my dataframe df2 has not changed. Please any hints. Thanks.
I think you need divide by div column Number, but first convert index of df to years:
df.index = df.index.year
s = s.div(df.Number, level=1)
print (s)
Component Date Sev Counts
PS 2009 3 0.500000
4 0.125000
2010 1 0.666667
3 0.666667
4 0.333333
2011 2 0.230769
3 0.384615
4 0.076923
2012 1 0.500000
2 2.500000
3 3.500000
2013 2 1.333333
3 3.000000
2014 1 0.500000
2 0.750000
3 1.000000
2015 1 0.333333
2 16.666667
3 5.166667
4 5.166667
2016 1 0.619718
2 0.380282
3 0.633803
dtype: float64

Get dataframe columns from a list using isin

I have a dataframe df1, and I have a list which contains names of several columns of df1.
df1:
User_id month day Age year CVI ZIP sex wgt
0 1 7 16 1977 2 NA M NaN
1 2 7 16 1977 3 NA M NaN
2 3 7 16 1977 2 DM F NaN
3 4 7 16 1977 7 DM M NaN
4 5 7 16 1977 3 DM M NaN
... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN
35546 35547 12 31 2002 10 RM F 14
35547 35548 12 31 2002 7 DO M 51
35548 35549 12 31 2002 5 NaN NaN NaN
list= [u"User_id", u"day", u"ZIP", u"sex"]
I want to make a new dataframe df2 which will contain omly those columns which are in the list, and a dataframe df3 which will contain columns which are not in the list.
Here I found that I need to do:
df2=df1[df1[df1.columns[1]].isin(list)]
But as a result I get:
Empty DataFrame
Columns: []
Index: []
[0 rows x 9 columns]
What Im I odoing wrong and how can i get a needed result? Why "9 columns" if it supossed to be 4?
Solution with Index.difference:
L = [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[L]
df3 = df1[df1.columns.difference(df2.columns)]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
Or:
df2 = df1[L]
df3 = df1[df1.columns.difference(pd.Index(L))]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]
or
df2 = df1[df1.columns[df1.columns.isin(my_list)]]
You can try :
df2 = df1[list] # it does a projection on the columns contained in the list
df3 = df1[[col for col in df1.columns if col not in list]]
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]

Categories

Resources