Retain NaN values after concatenating - python

I have a df where I want to apply a function. How can I retain the NaN values even after concatenating two columns? I want to avoid np.where since the real function has more elif conditions
df fruit year price vol signifiance
0 apple 2010 1 5 NaN
1 apple 2011 2 4 NaN
2 apple 2012 3 3 NaN
3 NaN 2013 3 3 NaN
4 NaN NaN NaN 3 NaN
5 apple 2015 3 3 important
df = df.fillna('')
def func(row):
if (pd.notna(row['year'])):
return row['fruit'] + row['significance'] +row['price']+ '_test'
else:
return np.NaN
df['final'] = row.apply(func, axis=1)
Expected Output
df fruit year price vol significance final
0 apple 2010 1 5 NaN apple1_test
1 apple 2011 2 4 NaN apple2_test
2 apple 2012 3 3 NaN apple3_test
3 NaN 2013 3 3 NaN 3_test
4 NaN 2014 NaN 3 NaN NaN
5 apple 2015 3 3 important appleimportant3_test

df = df.fillna('')
def func(row):
a = f"{row['fruit']}{row['significance']}{row['price']}"
if a:
return a + '_test'
return np.NaN

First remove df = df.fillna('') and then use your solution with added elif for test if missing values in both columns:
def func(row):
if (pd.notna(row['fruit'])) & (pd.notna(row['signifiance'])):
return row['fruit'] +'_' + row['signifiance']
elif (pd.isna(row['fruit'])) & (pd.isna(row['signifiance'])):
return 'apple'
else:
return row['fruit']
df['final'] = df.apply(func, axis=1)
print (df)
df fruit year price vol signifiance final
0 0 apple 2010 1 5 NaN apple
1 1 apple 2011 2 4 NaN apple
2 2 apple 2012 3 3 NaN apple
3 3 apple 2013 3 3 NaN apple
4 4 NaN 2014 3 3 NaN apple
5 5 apple 2015 3 3 important apple_important

Related

Panel Data - dealing with missing year when creating lead and lag variables

I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN

Count the number of column values (number of unique values of column) that have at least one non null response

This is what my dataframe looks like:
Year
State
Var1
Var2
2018
1
1
3
2018
1
2
Nan
2018
1
NaN
1
2018
2
NaN
1
2018
2
NaN
2
2018
3
3
NaN
2019
1
1
NaN
2019
1
3
NaN
2019
1
2
NaN
2019
1
NaN
NaN
2019
2
NaN
NaN
2019
2
3
NaN
2020
1
1
NaN
2020
2
NaN
1
2020
2
NaN
3
2020
3
3
NaN
2020
4
NaN
NaN
2020
4
1
NaN
Desired Output
Year 2018 2019 2020
Var1 Num of States w/ non-null 2 2 3
Var2 Num of States w/ non-null 2 0 1
I want to count the number of unique values of the variable State that have at least one non null response for each variable.
IIUC you are looking for:
out = pd.concat([
df.dropna(subset='Var1').pivot_table(columns='Year',
values='State',
aggfunc='nunique'),
df.dropna(subset='Var2').pivot_table(columns='Year',
values='State',
aggfunc='nunique')
]).fillna(0).astype(int)
out.index = ['Var1 Num of States w/non-null', 'Var2 Num of states w/non-null']
print(out):
Year 2018 2019 2020
Var1 Num of States w/non-null 2 2 3
Var2 Num of states w/non-null 2 0 1

Spread single value in group across all other NaN values in group

In this example, we attempt to apply value in group and column to all other NaNs, that are in the same group and column.
import pandas as pd
df = pd.DataFrame({'id':[1,1,2,2,3,4,5], 'Year':[2000,2000, 2001, 2001, 2000, 2000, 2000], 'Values': [1, 3, 2, 3, 4, 5,6]})
df['pct'] = df.groupby(['id', 'Year'])['Values'].apply(lambda x: x/x.shift() - 1)
print(df)
id Year Values pct
0 1 2000 1 NaN
1 1 2000 3 2.0
2 2 2001 2 NaN
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN
I have tried to use .ffill() to fill the NaN's within each group that contains a value. For example, the code is trying to make it so that the NaN associated with index 0, to be 2.0, and the NaN associated to index 2 to be 0.5.
df['pct'] = df.groupby(['id', 'Year'])['pct'].ffill()
print(df)
id Year Values pct
0 1 2000 1 NaN
1 1 2000 3 2.0
2 2 2001 2 NaN
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN
It should be bfill
df['pct'] = df.groupby(['id', 'Year'])['pct'].bfill()
df
Out[109]:
id Year Values pct
0 1 2000 1 2.0
1 1 2000 3 2.0
2 2 2001 2 0.5
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN

Convert a single row into a different dataframe in pandas python

I am working on a dataframe of shape 146 rows x 48 columns. The columns are
['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015','Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016','Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
I want to access a particular row and want to convert it to to the following dataframe
Year Rank Score Family Health Freedom Generosity Trust
0 2015 NaN NaN NaN NaN NaN NaN NaN
1 2016 NaN NaN NaN NaN NaN NaN NaN
2 2017 NaN NaN NaN NaN NaN NaN NaN
3 2018 NaN NaN NaN NaN NaN NaN NaN
4 2019 NaN NaN NaN NaN NaN NaN NaN
Any help is welcomed & Thank you in advance.
An alternate way:
cols=['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015', 'Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016', 'Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
# source dataframe
df1 = pd.DataFrame(columns=cols)
df1.loc[0] = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#target dataframe
df2 = pd.DataFrame(columns=['Year','Rank','Score','Family','Health','Freedom','Generosity','Trust','Economy'])
df2['Year']=['2015','2016','2017','2018','2019','Mean']
df2.set_index('Year', inplace=True)
idx = 0 # source row to copy
for col in df1.columns[1:]:
c,r = col.split(" ")
df2.at[r,c] = df1.at[idx, col]
print (df2)
Rank Score Family Health Freedom Generosity Trust Economy
Year
2015 1 1 1 1 1 1 1 1
2016 1 1 1 1 1 1 1 1
2017 1 1 1 1 1 1 1 1
2018 1 1 1 1 1 1 1 1
2019 1 1 1 1 1 1 1 1
Mean NaN 1 1 1 1 1 1 1
Here's a solution utilizing list comprehension:
The input:
cols = ['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015','Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016','Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
df = pd.DataFrame(np.random.randint(1,10,(3,48)))
df.columns = cols
print(df.iloc[:, :4])
Region Rank 2015 Score 2015 Economy 2015
0 7 9 9 9
1 8 7 2 3
2 3 3 4 5
And the new dataframe would be:
target_cols = ['Rank', 'Score', 'Family', 'Health', 'Freedom', 'Generosity', 'Trust']
years = ['2015', '2016', '2017', '2018', '2019']
newdf = pd.DataFrame([df.loc[1, [x + ' ' + year for x in target_cols]].values for year in years])
newdf.columns = target_cols
newdf['year'] = years
print(newdf)
Rank Score Family Health Freedom Generosity Trust year
0 7 2 6 9 3 4 9 2015
1 2 8 1 1 7 6 1 2016
2 7 4 2 5 1 7 4 2017
3 9 7 1 4 7 5 2 2018
4 5 4 4 9 1 6 2 2019
Assuming that you have only the target years are those spanning between 2015 and 2019; and that the target columns are known.
I would procede as follows:
(1) define the target columns and years
target_columns = ['Rank', 'Score', 'Family', 'Health', 'Freedom', 'Generosity', 'Trust'] target_years = ['2015', '2016', '2017', '2018', '2019']
(2) retrieve the particular row, I assume your starting dataframe to be initial_dataframe
particular_row = initial_dataframe.iloc[0]
(3) retrieve and reshape the information from the particular_row
reshaped_row = { 'Year': target_years }
reshaped_row.update({ column_name: [ particular_row[column_name + ' ' + year_name] for year_name in target_years ] for column_name in target_columns })
(4) assign the reshaped row to the output_dataframe
output_dataframe = pd.Dataframe(reshaped_row)
Have you tried using a 2D array? I would find that to be the easiest. Otherwise, you could also use a dictionary. https://www.w3schools.com/python/python_dictionaries.asp
I didn't get your question properly but I can give you hint how to translate the data.
df = pd.DataFrame(li)
df = df[0].str.split("(\d{4})", expand=True)
df = df[df[2]==""]
col_name = df[0].unique()
df_new = df.pivot(index=1, columns=0, values=2)
df_new.drop(df_new.index[0], inplace=True)
df_new:
Economy Family Freedom Generosity Health Rank Score Trust
1
2016
2017
2018
2019
You can write your own logic.
It needs a lot of manipulation, a simple idea is to modify to required dict and then make df
In [61]: dicts = {}
In [62]: for t in text[1:]:
...: n,y = t.split(" ")
...: if n not in dicts:
...: dicts[n]=[]
...: if y !="Mean":
...: if n == 'Rank':
...: dicts[n].append(y)
...: else:
...: dicts[n].append(pd.np.NaN)
...:
In [63]: df = pd.DataFrame(dicts)
In [64]: df['Year'] = df['Rank']
In [65]: df['Rank'] = df['Family']
In [66]: df
Out[66]:
Rank Score Economy Family Health Freedom Generosity Trust Year
0 NaN NaN NaN NaN NaN NaN NaN NaN 2015
1 NaN NaN NaN NaN NaN NaN NaN NaN 2016
2 NaN NaN NaN NaN NaN NaN NaN NaN 2017
3 NaN NaN NaN NaN NaN NaN NaN NaN 2018
4 NaN NaN NaN NaN NaN NaN NaN NaN 2019

merging dataframes on the same index

I can't find the answer to this in here.
I have two dataframes:
index, name, color, day
0 Nan Nan Nan
1 b red thu
2 Nan Nan Nan
3 d green mon
index, name, color, week
0 c blue 1
1 Nan Nan Nan
2 t yellow 4
3 Nan Nan Nan
And I'd like the result to be one dataframe:
index, name, color, day, week
0 c Blue Nan 1
1 b red thu Nan
2 t yellow Nan 4
3 d green mon Nan
Is there a way to merge the dataframes on their indexes, while adding new columns?
You can use DataFrame.combine_first:
df = df1.combine_first(df2)
print (df)
color day name week
0 blue NaN c 1.0
1 red thu b NaN
2 yellow NaN t 4.0
3 green mon d NaN
For custom order of columns create columns names by numpy.concatenate, pd.unique and then add reindex_axis:
cols = pd.unique(np.concatenate([df1.columns, df2.columns]))
df = df1.combine_first(df2).reindex_axis(cols, axis=1)
print (df)
name color day week
0 c blue NaN 1.0
1 b red thu NaN
2 t yellow NaN 4.0
3 d green mon NaN
EDIT:
Use rename columns:
df = df1.combine_first(df2.rename(columns={'week':'day'}))
print (df)
name color day
0 c blue 1
1 b red thu
2 t yellow 4
3 d green mon

Categories

Resources