Split Pandas DataFrame on Blank rows - python

I have a large dataframe that I need to split on empty rows.
here's a simplified example of the DataFrame:
A B C
0 1 0 International
1 1 1 International
2 NaN 2 International
3 1 3 International
4 1 4 International
5 8 0 North American
6 8 1 North American
7 8 2 North American
8 8 3 North American
9 NaN NaN NaN
10 1 0 Internal
11 1 1 Internal
12 6 0 East
13 6 1 East
14 6 2 East
...
As you can see, row 9 is blank. What I need to do is take rows 0 through 8 and put them in a different dataframe, as well as rows 10 to the next blank so that I have several dataframes in the end. Notice, when looking for blank rows I need the whole row to be blank.
Here is the code I'm using to find blanks:
def find_breaks(df):
df_breaks = df[(df.loc[:,['A','B','C']].isnull()).any(axis=1)]
print(df_breaks.index)
This code works when I test it on the simplified DF but, of course, my real DataFrame has many more columns than ['A','B','C']
How can I find the next blank row (or as I am doing above, all the blank rows at once) without having to specify my column names?
Thanks

IIUC, use pd.isnull + np.split:
df_list = np.split(df, df[df.isnull().all(1)].index)
for df in df_list:
print(df, '\n')
A B C
0 1.0 0.0 International
1 1.0 1.0 International
2 NaN 2.0 International
3 1.0 3.0 International
4 1.0 4.0 International
5 8.0 0.0 North American
6 8.0 1.0 North American
7 8.0 2.0 North American
8 8.0 3.0 North American
A B C
9 NaN NaN NaN
10 1.0 0.0 Internal
11 1.0 1.0 Internal
12 6.0 0.0 East
13 6.0 1.0 East
14 6.0 2.0 East
First, obtain the indices where the entire row is null, and then use that to split your dataframe into chunks. np.split handles dataframes quite well.

Related

How to replace all non-NaN entries of a dataframe with a Series?

I know that you can use pandas.DataFrame.fillna to replace all null values with a Series, but is there an easy way to replace all non-null values with a Series?
Alternatively, I have seen df.loc[~df.isnull()] for replacing all null values with a single value, but again - is there a way to pass in a Series?
Edit: For example, say I have a DataFrame df with a column called code which contains some null values and some non-null values. Then say I have a Series called new_code which contains only non-null values. I want to replace all of the non-null values in df['code'] with the values of new_code (where the number of non-null values is equal to the length of new_code).
You can do it as below.
Since you have not provided a df, I am using my own df (input & output shown). 'f' is the series that has been created.
a = df.loc[~df['Age'].isnull()]
b = df.loc[~df['Age'].isnull()].index
f= pd.Series([i for i in range(1,12)], index=b)
df.loc[~df['Age'].isnull(),['Age']]=f
Input
Country Age
0 USA NaN
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU NaN
5 China NaN
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU 53.0
11 China 66.0
12 USA 32.0
13 EU NaN
14 China 14.0
f
1 1
2 2
3 3
6 4
7 5
8 6
9 7
10 8
11 9
12 10
14 11
Output
Country Age
0 USA NaN
1 EU 1.0
2 China 2.0
3 USA 3.0
4 EU NaN
5 China NaN
6 USA 4.0
7 EU 5.0
8 China 6.0
9 USA 7.0
10 EU 8.0
11 China 9.0
12 USA 10.0
13 EU NaN
14 China 11.0

Setting subset of a pandas DataFrame by a DataFrame

I feel like this question has been asked a millions times before, but I just can't seem to get it to work or find a SO-post answering my question.
So I am selecting a subset of a pandas DataFrame and want to change these values individually.
I am subselecting my DataFrame like this:
df.loc[df[key].isnull(), [keys]]
which works perfectly. If I try and set all values to the same value such as
df.loc[df[key].isnull(), [keys]] = 5
it works as well. But if I try and set it to a DataFrame it does not, however no error is produced either.
So for example I have a DataFrame:
data = [['Alex',10,0,0,2],['Bob',12,0,0,1],['Clarke',13,0,0,4],['Dennis',64,2],['Jennifer',56,1],['Tom',95,5],['Ellen',42,2],['Heather',31,3]]
df1 = pd.DataFrame(data,columns=['Name','Age','Amount_of_cars','cars_per_year','some_other_value'])
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.0 2.0
1 Bob 12 0 0.0 1.0
2 Clarke 13 0 0.0 4.0
3 Dennis 64 2 NaN NaN
4 Jennifer 56 1 NaN NaN
5 Tom 95 5 NaN NaN
6 Ellen 42 2 NaN NaN
7 Heather 31 3 NaN NaN
and a second DataFrame:
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5],[3/31,7]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
cars_per_year some_other_value
0 0.031250 5
1 0.017857 1
2 0.052632 7
3 0.047619 5
4 0.096774 7
and I would like to replace those nans with the second DataFrame
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
Unfortunately this does not work as the index does not match. So how do I ignore the index, when setting values?
Any help would be appreciated. Sorry if this has been posted before.
It is possible only if number of mising values is same like number of rows in df2, then assign array for prevent index alignment:
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0
If not, get errors like:
#4 rows assigned to 5 rows
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
ValueError: shape mismatch: value array of shape (4,) could not be broadcast to indexing result of shape (5,)
Another idea is set index of df2 by index of filtered rows in df1:
df2 = df2.set_index(df1.index[df1['cars_per_year'].isnull()])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0
Just add .values or .to_numpy() if using pandas v 0.24 +
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0

How to merge only a specific data frame column in pandas?

I've been trying to use the pd.merge function properly but I either receive an error or get the table formatted in a way I don't like. I looked through the documentation but I can't find a way to only merge a specific column. For instance lets say I'm working with these two dataframes.
df_1 = county_name accidents pedestrians
ADAMS 1 2
ALLEGHENY 1 3
ARMSTRONG 3 4
BEDFORD 1 1
df_2 = county_name population
ADAMS 102336
ALLEGHENY 1223048
ARMSTRONG 65642
BEDFORD 166140
BERKS 48480
BLAIR 417854
BRADFORD 123457
BUCKS 60853
CAMBRIA 628341
The outcome im looking for is something like this. Where the county names are added to the 'county_name' column but not duplicated and the 'population' column is left off.
df_outcome = county_name accidents pedestrians
ADAMS 1 2
ALLEGHENY 1 3
ARMSTRONG 3 4
BEDFORD 1 1
BERKS Nan Nan
BLAIR Nan Nan
BRADFORD Nan Nan
BUCKS Nan Nan
CAMBRIA Nan Nan
Lastly, I plan to use df_outcome.fillna(0) to replace all the Nan values with zero.
Filter column county_name and use merge with left join:
df = df_2[['county_name']].merge(df_1, how='left')
print (df)
county_name accidents pedestrians
0 ADAMS 1.0 2.0
1 ALLEGHENY 1.0 3.0
2 ARMSTRONG 3.0 4.0
3 BEDFORD 1.0 1.0
4 BERKS NaN NaN
5 BLAIR NaN NaN
6 BRADFORD NaN NaN
7 BUCKS NaN NaN
8 CAMBRIA NaN NaN
Try:
df = pd.merge(df1,df2[['county_name']], how='left')

Why does merging two DataFrames by a common column yield an empty result?

I am processing two DataFrame objects with data from a survey and I cannot merge them correctly. The structures look like this:
In [93]: numeric_answers
Out[93]:
ANSWER_COUNT RESPONSE
1 50 1
2 21 2
4 3 4
In [94]: readable_values
Out[94]:
MEANING
RESPONSE
1 male
2 female
3 transgender
5 non-binary, genderqueer, or gender non-conforming
6 a different identity (please specify)
4 prefer not to disclose
-9 Not answered
My objective is to:
merge them using the RESPONSE column
resulting in a DataFrame with the columns ['RESPONSE', 'MEANING', 'ANSWER_COUNT']
with absent values set to N/A (though 0 would work too)
An example of desired output:
RESPONSE MEANING ANSWER_COUNT
1 male 50
2 female 21
3 transgender NaN
5 non-binary, genderqueer, or gender non-conforming NaN
6 a different identity (please specify) NaN
4 prefer not to disclose 3
-9 Not answered NaN
Having read the documentation for merge I conclued that what I need is pd.merge(readable_values, numeric_answers), but this operation produces an empty result:
Empty DataFrame
Columns: [RESPONSE, MEANING, ANSWER_COUNT]
Index: []
Having experimented with various arguments I got a somewhat promising result with merge(readable_values, numeric_answers, on='RESPONSE', how='outer'):
(Pdb) pd.merge(readable_values, numeric_answers, on='RESPONSE', how='outer')
RESPONSE MEANING ANSWER_COUNT
0 1.0 male NaN
1 2.0 female NaN
2 3.0 transgender NaN
3 5.0 non-binary, genderqueer, or gender non-conforming NaN
4 6.0 a different identity (please specify) NaN
5 4.0 prefer not to disclose NaN
6 -9.0 Not answered NaN
7 1.0 NaN 50.0
8 2.0 NaN 21.0
9 4.0 NaN 3.0
However, it merges by appending values, whereas I need it to intersect the entries using the RESPONSE column. What is the ideologically recommended way to achieve this with Pandas?
readable_values has RESPONSE as the index, rather than as a column.
You can do the merge as:
In [11]: numeric_answers.merge(readable_values, left_on='RESPONSE', right_index=True, how='outer')
Out[11]:
ANSWER_COUNT RESPONSE MEANING
1 50.0 1 male
2 21.0 2 female
4 3.0 4 prefer not to disclose
4 NaN 3 transgender
4 NaN 5 non-binary, genderqueer, or gender non-conforming
4 NaN 6 a different identity (please specify)
4 NaN -9 Not answered
an alternative is to reset_index of readable_values first:
In [12]: numeric_answers.merge(readable_values.reset_index(), on='RESPONSE', how='outer')
Out[12]:
ANSWER_COUNT RESPONSE MEANING
0 50.0 1 male
1 21.0 2 female
2 3.0 4 prefer not to disclose
3 NaN 3 transgender
4 NaN 5 non-binary, genderqueer, or gender non-conforming
5 NaN 6 a different identity (please specify)
6 NaN -9 Not answered
Note the distinction which you can see in how they're rendered:
In [21]: readable_values
Out[21]:
MEANING
RESPONSE
1 male
2 female
3 transgender
5 non-binary, genderqueer, or gender non-conforming
6 a different identity (please specify)
4 prefer not to disclose
-9 Not answered
In [22]: readable_values.reset_index() # RESPONSE is now a column
Out[22]:
RESPONSE MEANING
0 1 male
1 2 female
2 3 transgender
3 5 non-binary, genderqueer, or gender non-conforming
4 6 a different identity (please specify)
5 4 prefer not to disclose
6 -9 Not answered

How to process excel file headers using pandas/python

I am trying to read https://www.whatdotheyknow.com/request/193811/response/480664/attach/3/GCSE%20IGCSE%20results%20v3.xlsx using pandas.
Having saved it my script is
import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
# print xl.sheet_names
df = xl.parse(xl.sheet_names[0])
print df.head()
However this does not seem to process the headers properly as it gives
GCSE and IGCSE1 results2,3 in selected subjects4 of pupils at the end of key stage 4 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 Year: 2010/11 (Final) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Coverage: England NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1. Includes International GCSE, Cambridge Inte... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2. Includes attempts and achievements by these... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
All of this should be treated as comments.
If you load the spreadsheet into libreoffice, for example, you can see that the column headings are correctly parsed and appear in row 15 with drop down menus to let you select the items you want.
How can you get pandas to automatically detect where the column headers are just as libreoffice does?
pandas is (are?) processing the file correctly, and exactly the way you asked it (them?) to. You didn't specify a header value, which means that it defaults to picking up the column names from the 0th row. The first few rows of cells aren't comments in some fundamental way, they're just not cells you're interested in.
Simply tell parse you want to skip some rows:
>>> xl = pd.ExcelFile("GCSE IGCSE results v3.xlsx")
>>> df = xl.parse(xl.sheet_names[0], skiprows=14)
>>> df.columns
Index([u'Local Authority Number', u'Local Authority Name', u'Local Authority Establishment Number', u'Unique Reference Number', u'School Name', u'Town', u'Number of pupils at the end of key stage 4', u'Number of pupils attempting a GCSE or an IGCSE', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-G', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-A', u'Number of students achieving 5 A*-A grades or more at GCSE or IGCSE'], dtype='object')
>>> df.head()
Local Authority Number Local Authority Name \
0 201 City of london
1 201 City of london
2 202 Camden
3 202 Camden
4 202 Camden
Local Authority Establishment Number Unique Reference Number \
0 2016005 100001
1 2016007 100003
2 2024104 100049
3 2024166 100050
4 2024196 100051
School Name Town \
0 City of London School for Girls London
1 City of London School London
2 Haverstock School London
3 Parliament Hill School London
4 Regent High School London
Number of pupils at the end of key stage 4 \
0 105
1 140
2 200
3 172
4 174
Number of pupils attempting a GCSE or an IGCSE \
0 104
1 140
2 194
3 169
4 171
Number of students achieving 8 or more GCSE or IGCSE passes at A*-G \
0 100
1 108
2 SUPP
3 22
4 0
Number of students achieving 8 or more GCSE or IGCSE passes at A*-A \
0 87
1 75
2 0
3 7
4 0
Number of students achieving 5 A*-A grades or more at GCSE or IGCSE
0 100
1 123
2 0
3 34
4 SUPP
[5 rows x 11 columns]

Categories

Resources