How to process excel file headers using pandas/python

How to process excel file headers using pandas/python - python

I am trying to read https://www.whatdotheyknow.com/request/193811/response/480664/attach/3/GCSE%20IGCSE%20results%20v3.xlsx using pandas.
Having saved it my script is
import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
# print xl.sheet_names
df = xl.parse(xl.sheet_names[0])
print df.head()
However this does not seem to process the headers properly as it gives
GCSE and IGCSE1 results2,3 in selected subjects4 of pupils at the end of key stage 4 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 Year: 2010/11 (Final) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Coverage: England NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1. Includes International GCSE, Cambridge Inte... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2. Includes attempts and achievements by these... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
All of this should be treated as comments.
If you load the spreadsheet into libreoffice, for example, you can see that the column headings are correctly parsed and appear in row 15 with drop down menus to let you select the items you want.
How can you get pandas to automatically detect where the column headers are just as libreoffice does?

pandas is (are?) processing the file correctly, and exactly the way you asked it (them?) to. You didn't specify a header value, which means that it defaults to picking up the column names from the 0th row. The first few rows of cells aren't comments in some fundamental way, they're just not cells you're interested in.
Simply tell parse you want to skip some rows:
>>> xl = pd.ExcelFile("GCSE IGCSE results v3.xlsx")
>>> df = xl.parse(xl.sheet_names[0], skiprows=14)
>>> df.columns
Index([u'Local Authority Number', u'Local Authority Name', u'Local Authority Establishment Number', u'Unique Reference Number', u'School Name', u'Town', u'Number of pupils at the end of key stage 4', u'Number of pupils attempting a GCSE or an IGCSE', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-G', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-A', u'Number of students achieving 5 A*-A grades or more at GCSE or IGCSE'], dtype='object')
>>> df.head()
Local Authority Number Local Authority Name \
0 201 City of london
1 201 City of london
2 202 Camden
3 202 Camden
4 202 Camden
Local Authority Establishment Number Unique Reference Number \
0 2016005 100001
1 2016007 100003
2 2024104 100049
3 2024166 100050
4 2024196 100051
School Name Town \
0 City of London School for Girls London
1 City of London School London
2 Haverstock School London
3 Parliament Hill School London
4 Regent High School London
Number of pupils at the end of key stage 4 \
0 105
1 140
2 200
3 172
4 174
Number of pupils attempting a GCSE or an IGCSE \
0 104
1 140
2 194
3 169
4 171
Number of students achieving 8 or more GCSE or IGCSE passes at A*-G \
0 100
1 108
2 SUPP
3 22
4 0
Number of students achieving 8 or more GCSE or IGCSE passes at A*-A \
0 87
1 75
2 0
3 7
4 0
Number of students achieving 5 A*-A grades or more at GCSE or IGCSE
0 100
1 123
2 0
3 34
4 SUPP
[5 rows x 11 columns]

Related

Transform DataFrame in Pandas

I am struggling with the following issue.
My DF is:
df = pd.DataFrame(
[
['7890-1', '12345N', 'John', 'Intermediate'],
['7890-4', '30909N', 'Greg', 'Intermediate'],
['3300-1', '88117N', 'Mark', 'Advanced'],
['2502-2', '90288N', 'Olivia', 'Elementary'],
['7890-2', '22345N', 'Joe', 'Intermediate'],
['7890-3', '72245N', 'Ana', 'Elementary']
],
columns=['Id', 'Code', 'Person', 'Level'])
print(df)
I would like to get such a result:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate
1
3300
88117N
Mark
Advanced
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2502
NaN
NaN
NaN
90288N
Olivia
Elementary
NaN
NaN
NaN
NaN
NaN
NaN

I'd start with the same approach as #Andrej Kesely but then sort by index after unstacking and map over the column names with ' '.join.
df[["Id", "No"]] = df["Id"].str.split("-", expand=True)
df_wide = df.set_index(["Id", "No"]).unstack(level=1).sort_index(axis=1,level=1)
df_wide.columns = df_wide.columns.map(' '.join)
Output
Code 1 Level 1 Person 1 Code 2 Level 2 Person 2 Code 3 \
Id
2502 NaN NaN NaN 90288N Elementary Olivia NaN
3300 88117N Advanced Mark NaN NaN NaN NaN
7890 12345N Intermediate John 22345N Intermediate Joe 72245N
Level 3 Person 3 Code 4 Level 4 Person 4
Id
2502 NaN NaN NaN NaN NaN
3300 NaN NaN NaN NaN NaN
7890 Elementary Ana 30909N Intermediate Greg

Try:
df[["Id", "Id2"]] = df["Id"].str.split("-", expand=True)
x = df.set_index(["Id", "Id2"]).unstack(level=1)
x.columns = [f"{a} {b}" for a, b in x.columns]
print(
x[sorted(x.columns, key=lambda k: int(k.split()[-1]))]
.reset_index()
.to_markdown()
)
Prints:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
2502
nan
nan
nan
90288N
Olivia
Elementary
nan
nan
nan
nan
nan
nan
1
3300
88117N
Mark
Advanced
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate

when i transpose a dataframe, it shows nan values

import pymysql
import pandas as pd
import numpy
conn = pymysql.connect(host="localhost",port=3306,db="school",user="root",password="#mit123")
print("Connection established sucessfully")
cursor = conn.cursor()
sql = "SELECT * FROM records"
cursor.execute(sql)
result = cursor.fetchall()
data= result
df = pd.DataFrame(data)
df1=df.T
print(df)
print(df1)
df2 = pd.DataFrame(df1,index=["id","name","rollno.","city"])
print(df2)
The following is the output. What could be causing the problem? Can't I transpose a data frame into another data frame?
Connection established sucessfully
0 1 2 3 4
0 1 amit 1 92 jorhat
1 2 subham 2 93 jorhat
2 3 ram 3 89 surat
3 4 anil 4 91 delhi
4 5 abdul 5 81 bhopal
5 6 joseph 6 90 sikkim
6 7 Ben 7 94 indore
7 8 tom 8 99 goa
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8
1 amit subham ram anil abdul joseph Ben tom
2 1 2 3 4 5 6 7 8
3 92 93 89 91 81 90 94 99
4 jorhat jorhat surat delhi bhopal sikkim indore goa
0 1 2 3 4 5 6 7
id NaN NaN NaN NaN NaN NaN NaN NaN
name NaN NaN NaN NaN NaN NaN NaN NaN
rollno. NaN NaN NaN NaN NaN NaN NaN NaN
city NaN NaN NaN NaN NaN NaN NaN NaN
Process finished with exit code 0
This is my sql table:
Also when I use an index in the data frame, it says shape error:
Shape of passed values is (5, 8), indices imply (4, 8)

I could reproduce the NaN error using my database. So I think the reason is that there are no column names there.
So you can do following:
import pymysql
import pandas as pd
import numpy
conn = pymysql.connect(host="localhost",
port=3306,
db="school",
user="root",
password="#mit123")
print("Connection established sucessfully")
sql = "SELECT * FROM records"
df = pd.read_sql(con=conn,sql=sql)
df1=df.T
print(df)
print(df1)
df2 = pd.DataFrame(df1,index=["id","name","roll_number","city"])
print(df2)
This solves the NaN error.
The shape error may be due to the fact that you are not passing the "percentage" column to the index, but I was unable to reproduce this error.

Accessing indexes in a list

I am using tabula-py to extract a table from a pdf document like this:
rows = tabula.read_pdf('bank_statement.pdf', pandas_options={"header":[0, 1, 2, 3, 4, 5]}, pages='all', stream=True, lattice=True)
rows
This gives an output like so:
[ 0
0 Customer Statement\rxxxxxxx\rP...
1 Print Date: April 12, 2020Address: 41 BAALE ST...
2 Period: January 1, 2020  April 12, 2020Openin...,
0
0 Customer Statement\xxxxxxxx\rP...
1 Print Date: April 12, 2020Address: 41 gg ST...,
0 1 2 3 4 5 \
0 03Jan2020 0 03Jan2020 NaN 50,000.00 52,064.00
1 10Jan2020 0 10Jan2020 25,000.00 NaN 27,064.00
2 10Jan2020 0 10Jan2020 25.00 NaN 27,039.00
3 10Jan2020 0 10Jan2020 1.25 NaN 27,037.75
4 20Jan2020 999921... 20Jan2020 10,000.00 NaN 17,037.75
5 23Jan2020 999984... 23Jan2020 4,050.00 NaN 12,987.75
6 23Jan2020 0 23Jan2020 1,000.00 NaN 11,987.75
7 24Jan2020 0 24Jan2020 2,000.00 NaN 9,987.75
8 24Jan2020 0 24Jan2020 NaN 30,000.00 39,987.75
6
0 TRANSFER BETWEEN\rCUSTOMERS Via GG from\r...
1 NS Instant Payment Outward\r000013200110121...
2 COMMISSION\r0000132001101218050000326...\rNIP ...
3 VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001
4 CASH WITHDRAWAL FROM\rOTHER ATM 210674 4420...
5 POS/WEB PURCHASE\rTRANSACTION 845061\r80405...
6 Airtime Purchase MBANKING\r101CT0000000001551...
7 Airtime Purchase MBANKING\r101CT0000000001552...
8 TRANSFER BETWEEN\rCUSTOMERS\r00001520012412113... ,
What I want from this pdf starts from index 2. So I run
rows[2]
And I get a dataframe that looks like this:
Now, I want indexes from 2 till the last index. I did
rows[2:]
But I am getting a list and not the expected dataframe.
[ 0 1 2 3 4 5 \
0 03Jan2020 0 03Jan2020 NaN 50,000.00 52,064.00
1 10Jan2020 0 10Jan2020 25,000.00 NaN 27,064.00
2 10Jan2020 0 10Jan2020 25.00 NaN 27,039.00
3 10Jan2020 0 10Jan2020 1.25 NaN 27,037.75
4 20Jan2020 999921... 20Jan2020 10,000.00 NaN 17,037.75
5 23Jan2020 999984... 23Jan2020 4,050.00 NaN 12,987.75
6 23Jan2020 0 23Jan2020 1,000.00 NaN 11,987.75
7 24Jan2020 0 24Jan2020 2,000.00 NaN 9,987.75
8 24Jan2020 0 24Jan2020 NaN 30,000.00 39,987.75
6
0 TRANSFER BETWEEN\rCUSTOMERS Via gg from\r...
1 bi Instant Payment Outward\r000013200110121...
2 COMMISSION\r0000132001101218050000326...\rNIP ...
3 VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001
4 CASH WITHDRAWAL FROM\rOTHER ATM 210674 4420...
5 POS/WEB PURCHASE\rTRANSACTION 845061\r80405...
Please do I solve this? I need a dataframe for indexes starting at 2 and onwards.

You are getting this behaviour because rows is a list and slicing a list produces another list. When you access an element at a specific index, you get the object at that index; in this case, a DataFrame object.
The pandas library ships with a concat function that can combine multiple DataFrame objects into one -- I believe this is what you want to do -- such that you have:
import pandas as pd
df_combo = pd.concat([rows[2], rows[3], rows[4], rows[5] ...])
Even better:
df_combo = pd.concat(rows[2:])

Take a look at https://medium.com/analytics-vidhya/how-to-extract-multiple-tables-from-a-pdf-through-python-and-tabula-py-6f642a9ee673
The best way to go about what you're trying to achieve is by reading the table and returning the response as JSON, loop through the json objects for your lists.

How do I place NaN when computing the average rating for each movie in a DataFrame?

I am working with the MovieLens dataset, basically there are 2 files, a .csv file which contains movies and another .csv file which contains ratings given by n users to specific movies.
I did the following in order to get the average rating for each movie in the DF.
ratings_data.groupby('movieId').rating.mean()
however with that code I am getting 9724 movies whereas in the main DataFrame I have 9742 movies.
I think that there are movies that are not rated at all, however since I want to add the ratings to the main movies dataset how would I put NaN on the fields that have no ratings?!

Use Series.reindex by unique movieId form another column, for same order is add Series.sort_values:
movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')
mov = movies_data['movieId'].sort_values().drop_duplicates()
df = ratings_data.groupby('movieId').rating.mean().reindex(mov).reset_index()
print (df)
movieId rating
0 1 3.920930
1 2 3.431818
2 3 3.259615
3 4 2.357143
4 5 3.071429
... ...
9737 193581 4.000000
9738 193583 3.500000
9739 193585 3.500000
9740 193587 3.500000
9741 193609 4.000000
[9742 rows x 2 columns]
df1 = df[df['rating'].isna()]
print (df1)
movieId rating
816 1076 NaN
2211 2939 NaN
2499 3338 NaN
2587 3456 NaN
3118 4194 NaN
4037 5721 NaN
4506 6668 NaN
4598 6849 NaN
4704 7020 NaN
5020 7792 NaN
5293 8765 NaN
5421 25855 NaN
5452 26085 NaN
5749 30892 NaN
5824 32160 NaN
5837 32371 NaN
5957 34482 NaN
7565 85565 NaN
EDIT:
If need new column to movie_data DataFrame, use DataFrame.merge with left join:
movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')
df = ratings_data.groupby('movieId', as_index=False).rating.mean()
print (df)
movieId rating
0 1 3.920930
1 2 3.431818
2 3 3.259615
3 4 2.357143
4 5 3.071429
... ...
9719 193581 4.000000
9720 193583 3.500000
9721 193585 3.500000
9722 193587 3.500000
9723 193609 4.000000
[9724 rows x 2 columns]
df = movies_data.merge(df, on='movieId', how='left')
print (df)
movieId title \
0 1 Toy Story (1995)
1 2 Jumanji (1995)
2 3 Grumpier Old Men (1995)
3 4 Waiting to Exhale (1995)
4 5 Father of the Bride Part II (1995)
... ...
9737 193581 Black Butler: Book of the Atlantic (2017)
9738 193583 No Game No Life: Zero (2017)
9739 193585 Flint (2017)
9740 193587 Bungo Stray Dogs: Dead Apple (2018)
9741 193609 Andrew Dice Clay: Dice Rules (1991)
genres rating
0 Adventure|Animation|Children|Comedy|Fantasy 3.920930
1 Adventure|Children|Fantasy 3.431818
2 Comedy|Romance 3.259615
3 Comedy|Drama|Romance 2.357143
4 Comedy 3.071429
... ...
9737 Action|Animation|Comedy|Fantasy 4.000000
9738 Animation|Comedy|Fantasy 3.500000
9739 Drama 3.500000
9740 Action|Animation 3.500000
9741 Comedy 4.000000
[9742 rows x 4 columns]

Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries

I have a pandas dataframe with a column named 'City, State, Country'. I want to separate this column into three new columns, 'City, 'State' and 'Country'.
0 HUN
1 ESP
2 GBR
3 ESP
4 FRA
5 ID, USA
6 GA, USA
7 Hoboken, NJ, USA
8 NJ, USA
9 AUS
Splitting the column into three columns is trivial enough:
location_df = df['City, State, Country'].apply(lambda x: pd.Series(x.split(',')))
However, this creates left-aligned data:
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 ID USA NaN
6 GA USA NaN
7 Hoboken NJ USA
8 NJ USA NaN
9 AUS NaN NaN
How would one go about creating the new columns with the data right-aligned? Would I need to iterate through every row, count the number of commas and handle the contents individually?

I'd do something like the following:
foo = lambda x: pd.Series([i for i in reversed(x.split(','))])
rev = df['City, State, Country'].apply(foo)
print rev
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 USA ID NaN
6 USA GA NaN
7 USA NJ Hoboken
8 USA NJ NaN
9 AUS NaN NaN
I think that gets you what you want but if you also want to pretty things up and get a City, State, Country column order, you could add the following:
rev.rename(columns={0:'Country',1:'State',2:'City'},inplace=True)
rev = rev[['City','State','Country']]
print rev
City State Country
0 NaN NaN HUN
1 NaN NaN ESP
2 NaN NaN GBR
3 NaN NaN ESP
4 NaN NaN FRA
5 NaN ID USA
6 NaN GA USA
7 Hoboken NJ USA
8 NaN NJ USA
9 NaN NaN AUS

Assume you have the column name as target
df[["City", "State", "Country"]] = df["target"].str.split(pat=",", expand=True)

Since you are dealing with strings I would suggest the amendment to your current code i.e.
location_df = df[['City, State, Country']].apply(lambda x: pd.Series(str(x).split(',')))
I got mine to work by testing one of the columns but give this one a try.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to process excel file headers using pandas/python - python

Related

Transform DataFrame in Pandas

when i transpose a dataframe, it shows nan values

Accessing indexes in a list

How do I place NaN when computing the average rating for each movie in a DataFrame?

Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries

Categories

Resources