Drop items with all NaN values from a pandas multi-indexed dataframe - python

I'm having some trouble wrangling a dataframe that looks something like this:
value
year name
2015 bob 10.0
cat NaN
2016 bob NaN
cat NaN
I want to drop those items where all the values for the same name are NaN. In this case the result should be this:
value
year name
2015 bob 10.0
2016 bob NaN
All the cat values were NaN so cat is gone. Since bob had one non-NaN value, it gets to stay.
Note that both the 2016 values were NaN in the input, but 2016 is still around in the output - because this rule only applies to the name column. Ideally I'd like to be able to provide which column this applies to as a parameter.
Is this even possible? How should I do this? I'm okay with reindexing/transposing/whatever if that's needed to get the job done (only if it's necessary though!).

You can use groupby with filter
df.groupby(level='name').filter(lambda x: x.value.notnull().any())
value
year name
2015 bob 10.0
2016 bob NaN

In [208]: df.reset_index().sort_values('name').drop_duplicates(['value']).set_index(['year','name'])
Out[208]:
value
year name
2015 bob 10.0
2016 bob NaN

You can use unstack, isnull, all, and stack:
df.unstack().loc[:,~df.unstack().isnull().all()].stack(-1, dropna=False)
Or use notnull and any:
df.unstack().loc[:,df.unstack().notnull().any()].stack(-1, dropna=False)
Output:
value
year name
2015 bob 10.0
2016 bob NaN

Related

Replace values in specific rows from one DataFrame to another when certain columns have the same values

Unlike the other questions, I don't want to create a new column with the new values, I want to use the same column just changing the old values for new ones if they exist.
For a new column I would have:
import pandas as pd
df1 = pd.DataFrame(data = {'Name' : ['Carl','Steave','Julius','Marcus'],
'Work' : ['Home','Street','Car','Airplane'],
'Year' : ['2022','2021','2020','2019'],
'Days' : ['',5,'','']})
df2 = pd.DataFrame(data = {'Name' : ['Carl','Julius'],
'Work' : ['Home','Car'],
'Days' : [1,2]})
df_merge = pd.merge(df1, df2, how='left', on=['Name','Work'], suffixes=('','_'))
print(df_merge)
Name Work Year Days Days_
0 Carl Home 2022 1.0
1 Steave Street 2021 5 NaN
2 Julius Car 2020 2.0
3 Marcus Airplane 2019 NaN
But what I really want is exactly like this:
Name Work Year Days
0 Carl Home 2022 1
1 Steave Street 2021 5
2 Julius Car 2020 2
3 Marcus Airplane 2019
How can I make such a union?
You can use combine_first, setting the empty strings to NaNs beforehand (the indexing at the end is to rearrange the columns to match the desired output):
df1.loc[df1["Days"] == "", "Days"] = float("NaN")
df1.combine_first(df1[["Name", "Work"]].merge(df2, "left"))[df1.columns.values]
This outputs:
Name Work Year Days
0 Carl Home 2022 1.0
1 Steave Street 2021 5
2 Julius Car 2020 2.0
3 Marcus Airplane 2019 NaN
You can use the update method of Series:
df1.Days.update(pd.merge(df1, df2, how='left', on=['Name','Work']).Days_y)

How can I use groupby to merge rows in Pandas?

I have a dataframe that looks like this:
ID
Name
Major1
Major2
Major3
12
Dave
English
NaN
NaN
12
Dave
NaN
Biology
NaN
12
Dave
NaN
NaN
History
13
Nate
Spanish
NaN
NaN
13
Nate
NaN
Business
NaN
I need to merge rows resulting in this:
ID
Name
Major1
Major2
Major3
12
Dave
English
Biology
History
13
Nate
Spanish
Business
NaN
I know this is possible with groupby but I haven't been able to get it to work correctly. Can anyone help?
If you are intent on using groupby, you could do something like this:
dataframe = dataframe.melt(['ID', 'Name']).dropna()
dataframe = dataframe.groupby(['ID', 'Name', 'variable'])['value'].sum().unstack('variable')
You may have to mess with the column names a bit, but this is what comes to me as a possible solution using groupby.
Use melt and pivot
>>> df.melt(['ID', 'Name']).dropna() \
.pivot(['ID', 'Name'], 'variable', 'value') \
.reset_index().rename_axis(columns=None)
ID Name Major1 Major2 Major3
0 12 Dave English Biology History
1 13 Nate Spanish Business NaN

How to spread the data in pandas?

i'm working on spread r equivalent in pandas my dataframe looks like below
Name age Language year Period
Nik 18 English 2018 Beginer
John 19 French 2019 Intermediate
Kane 33 Russian 2017 Advanced
xi 44 Thai 2015 Beginer
and looking for output like this
Name age Language Beginer Intermediate Advanced
Nik 18 English 2018
John 19 French 2019
Kane 33 Russian 2017
John 44 Thai 2015
my code
pd.pivot(x1,values='year', columns=['Period'])
i'm getting only these columns Beginer,Intermediate,Advanced not the entire dataframe
while reshaping it i tried using index but says no duplicates in index.
So i created new index column but still not getting entire dataframe
If I understood correctly you could do something like this:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Advanced Beginner Intermediate
0 Nik 18 English 0 2018 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 2017 0 0
3 xi 44 Thai 0 2015 0
If you want to match the exact same output, convert the column to categorical first, and specify the order:
# encode as categorical
df['Period'] = pd.Categorical(df['Period'], ['Beginner', 'Advanced', 'Intermediate'], ordered=True)
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018 0 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 0 2017 0
3 xi 44 Thai 2015 0 0
Finally if you want to replace the 0, with missing values, add a third step:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
res = res.replace(0, np.nan)
Output (with missing values)
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018.0 NaN NaN
1 John 19 French NaN NaN 2019.0
2 Kane 33 Russian NaN 2017.0 NaN
3 xi 44 Thai 2015.0 NaN NaN
One way you can get to the equivalent of R's spread function using pd.pivot_table:
If you don't mind about the index, you can use reset_index() on the newly created df:
new_df = (pd.pivot_table(df, index=['Name','age','Language'],columns='Period',values='year',aggfunc='sum')).reset_index()
which will get you:
Period Name age Language Advanced Beginer Intermediate
0 John 19 French NaN NaN 2019.0
1 Kane 33 Russian 2017.0 NaN NaN
2 Nik 18 English NaN 2018.0 NaN
3 xi 44 Thai NaN 2015.0 NaN
EDIT
If you have many columns in your dataframe and you want to include them in the reshaped dataset:
Grab in a list the columns to be used in pivot table (i.e. Period and year)
Grab all the other columns in your dataframe in a list (using not in)
Use the index_cols as index in the pd.pivot_table() command
non_index_cols = ['Period','year'] # SPECIFY THE 2 COLUMNS IN THE PIVOT TABLE TO BE USED
index_cols = [i for i in df.columns if i not in non_index_cols] # GET ALL THE REST IN A LIST
new_df = (pd.pivot_table(df, index=index_cols,columns='Period',values='year',aggfunc='sum')).reset_index()
The new_df, will include all the columns of your initial dataframe.

How to add a word to the end of each string in a specific column (pandas dataframe)

I want to add "NSW" to the end of each town name in a pandas data frame.The dataframe currently looks like this:
0 Parkes NaN
1 Forbes NaN
2 Yanco NaN
3 Orange NaN
4 Narara NaN
5 Wyong NaN
I need every town to also have the word NSW added to it
Try with
df['Name'] = df['Name'] + 'NSW'

Chained conditional count in Pandas

I have a dataframe that looks at how a form has been filled out. Here's an example:
ID Name Postcode Street Employer Salary
1 John NaN Craven Road NaN NaN
2 Sue TD2 NAN NaN 15000
3 Jimmy MW6 Blake Street Bank 40000
4 Laura QE2 Mill Lane NaN 20000
5 Sam NW2 Duke Avenue Farms 35000
6 Jordan SE6 NaN NaN NaN
7 NaN CB2 NaN Startup NaN `
I want to return a count of successively filled out columns on the condition that all previous columns have been filled. The final output should look something like:
Name Postcode Street Employer salary
6 5 3 2 2
Is there a good Pandas way of doing this? I suppose there could be a way of applying a mask so that if any previous boolean is given as zero the current column is also zero and then counting that but I'm not sure if that is the best way.
Thanks!
I think you can use notnull and cummin:
In [99]: df.notnull().cummin(axis=1).sum(axis=0)
Out[99]:
Name 6
Postcode 5
Street 3
Employer 2
Salary 2
dtype: int64
Although note that I had to replace your NAN (Sue's street) with a float NaN before I did that, and I assumed that ID was your index.
The cumulative minimum is one way to implement "applying a mask so that if any previous boolean is given as zero the current column is also zero", as you predicted would work.
Maybe cumprod BTW you have 'NAN' in your df, I try then as notnull here
df.notnull().cumprod(1).sum()
Out[59]:
ID 7
Name 6
Postcode 5
Street 4
Employer 2
Salary 2
dtype: int64

Categories

Resources