Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries - python

I have a pandas dataframe with a column named 'City, State, Country'. I want to separate this column into three new columns, 'City, 'State' and 'Country'.
0 HUN
1 ESP
2 GBR
3 ESP
4 FRA
5 ID, USA
6 GA, USA
7 Hoboken, NJ, USA
8 NJ, USA
9 AUS
Splitting the column into three columns is trivial enough:
location_df = df['City, State, Country'].apply(lambda x: pd.Series(x.split(',')))
However, this creates left-aligned data:
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 ID USA NaN
6 GA USA NaN
7 Hoboken NJ USA
8 NJ USA NaN
9 AUS NaN NaN
How would one go about creating the new columns with the data right-aligned? Would I need to iterate through every row, count the number of commas and handle the contents individually?

I'd do something like the following:
foo = lambda x: pd.Series([i for i in reversed(x.split(','))])
rev = df['City, State, Country'].apply(foo)
print rev
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 USA ID NaN
6 USA GA NaN
7 USA NJ Hoboken
8 USA NJ NaN
9 AUS NaN NaN
I think that gets you what you want but if you also want to pretty things up and get a City, State, Country column order, you could add the following:
rev.rename(columns={0:'Country',1:'State',2:'City'},inplace=True)
rev = rev[['City','State','Country']]
print rev
City State Country
0 NaN NaN HUN
1 NaN NaN ESP
2 NaN NaN GBR
3 NaN NaN ESP
4 NaN NaN FRA
5 NaN ID USA
6 NaN GA USA
7 Hoboken NJ USA
8 NaN NJ USA
9 NaN NaN AUS

Assume you have the column name as target
df[["City", "State", "Country"]] = df["target"].str.split(pat=",", expand=True)

Since you are dealing with strings I would suggest the amendment to your current code i.e.
location_df = df[['City, State, Country']].apply(lambda x: pd.Series(str(x).split(',')))
I got mine to work by testing one of the columns but give this one a try.

Related

Transform DataFrame in Pandas

I am struggling with the following issue.
My DF is:
df = pd.DataFrame(
[
['7890-1', '12345N', 'John', 'Intermediate'],
['7890-4', '30909N', 'Greg', 'Intermediate'],
['3300-1', '88117N', 'Mark', 'Advanced'],
['2502-2', '90288N', 'Olivia', 'Elementary'],
['7890-2', '22345N', 'Joe', 'Intermediate'],
['7890-3', '72245N', 'Ana', 'Elementary']
],
columns=['Id', 'Code', 'Person', 'Level'])
print(df)
I would like to get such a result:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate
1
3300
88117N
Mark
Advanced
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2502
NaN
NaN
NaN
90288N
Olivia
Elementary
NaN
NaN
NaN
NaN
NaN
NaN
I'd start with the same approach as #Andrej Kesely but then sort by index after unstacking and map over the column names with ' '.join.
df[["Id", "No"]] = df["Id"].str.split("-", expand=True)
df_wide = df.set_index(["Id", "No"]).unstack(level=1).sort_index(axis=1,level=1)
df_wide.columns = df_wide.columns.map(' '.join)
Output
Code 1 Level 1 Person 1 Code 2 Level 2 Person 2 Code 3 \
Id
2502 NaN NaN NaN 90288N Elementary Olivia NaN
3300 88117N Advanced Mark NaN NaN NaN NaN
7890 12345N Intermediate John 22345N Intermediate Joe 72245N
Level 3 Person 3 Code 4 Level 4 Person 4
Id
2502 NaN NaN NaN NaN NaN
3300 NaN NaN NaN NaN NaN
7890 Elementary Ana 30909N Intermediate Greg
Try:
df[["Id", "Id2"]] = df["Id"].str.split("-", expand=True)
x = df.set_index(["Id", "Id2"]).unstack(level=1)
x.columns = [f"{a} {b}" for a, b in x.columns]
print(
x[sorted(x.columns, key=lambda k: int(k.split()[-1]))]
.reset_index()
.to_markdown()
)
Prints:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
2502
nan
nan
nan
90288N
Olivia
Elementary
nan
nan
nan
nan
nan
nan
1
3300
88117N
Mark
Advanced
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate

How to join or merge in python

We have two Dataframes.
We want all the columns form Dataframe1, but just one column (target_name) from the Dataframe2.
However, when we do this, it gives us duplicate values.
Dataframe1 values:
user_id subject_id x y w h g
0 858580 23224814 58.133331 57.466675 181.000000 42.000000 1
1 858580 23224814 293.133331 176.466675 80.000000 34.000000 2
2 313344 28539152 834.049316 37.493195 63.005920 36.444595 1
3 313344 28539152 104.003235 45.072937 242.956024 26.754082 2
4 313344 28539152 635.436829 80.038574 108.716065 35.240089 3
5 313344 28539152 351.910156 80.162117 201.371887 32.738373 4
6 861687 28539165 125.313393 39.836521 231.202873 43.087811 1
7 861687 28539165 623.450500 44.040207 151.332825 34.680435 2
8 1254304 28539165 128.893204 45.765110 225.686691 35.547726 1
Dataframe2 Values:
Unnamed: 0 user_id subject_id good x y w h T0 T1 T2 T3 T4 T5 T6 T7 T8 target_name target_name_length target_name3
0 0 858580 23224814 1 58.133331 57.466675 181.000000 42.000000 NaN 1801 No, there are still more names to be marked Male 1881 John Abbott NaN NaN NaN John Abbott 11 John Abbott
1 1 858580 23224814 1 293.133331 176.466675 80.000000 34.000000 NaN NaN Yes, I've marked all the names Female NaN NaN Edith Joynt Edith Abbot NaN Edith Joynt 11 Edith Joynt
2 2 340348 30629031 1 152.968750 26.000000 224.000000 41.000000 NaN 1852 No, there are still more names to be marked Male 1924 William Sparrow NaN NaN NaN William Sparrow 15 William Sparrow
3 3 340348 30629031 1 497.968750 325.000000 87.000000 29.000000 NaN NaN Yes, I've marked all the names Female NaN NaN Minnie NaN NaN Minnie 6 Minnie
4 4 340348 28613182 1 103.968750 31.000000 162.000000 38.000000 NaN 1819 No, there are still more names to be marked Male 1876 Albert [unclear]Gles[/unclear] NaN NaN NaN Albert Gles 30 Albert Gles
5 5 340348 28613182 1 107.968750 76.000000 72.000000 25.000000 NaN 1819 Yes, I've marked all the names Female 1884 NaN Eliza [unclear]Gles[/unclear] NaN NaN Eliza Gles 29 Eliza Gles
6 6 340348 30628864 1 172.968750 29.000000 192.000000 41.000000 NaN 1840 No, there are still more names to be marked Male 1918 John Slaltery NaN NaN NaN John Slaltery 13 John Slaltery
7 7 340348 30628864 1 115.968750 214.000000 149.000000 31.000000 NaN NaN No, there are still more names to be marked Male NaN [unclear]P.[/unclear] Slaltery NaN NaN NaN P. Slaltery 30 unclear]P. Slaltery
8 8 340348 30628864 1 537.968750 218.000000 64.000000 26.000000 NaN NaN Yes, I've marked all the names Female 1901 NaN Elizabeth Slaltery NaN NaN Elizabeth Slaltery 18 Elizabeth Slaltery
Here is the code we are trying to use:
If you want to blindly add the target column to dataframe1 then
dataframe1['target'] = dataframe2['target']
Just make sure that both the dataframes have same number of rows and they are sorted by any given common column. Eg: user_id is found in both the dataframes

Extract column value based on another column, reading multiple files

I will like to extract out the values based on another on Name,Grade,School,Class.
For example if I were to find Name and Grade, I would like to go through column 0 and find the value in the next few column, but the value is scattered(to be extracted) around the next column. Same goes for School and Class.
Refer to this: extract column value based on another column pandas dataframe
I have multiple files:
0 1 2 3 4 5 6 7 8
0 nan nan nan Student Registration nan nan
1 Name: nan nan John nan nan nan nan nan
2 Grade: nan 6 nan nan nan nan nan nan
3 nan nan nan School: C College nan Class: 1A
0 1 2 3 4 5 6 7 8 9
0 nan nan nan Student Registration nan nan nan
1 nan nan nan nan nan nan nan nan nan nan
2 Name: Mary nan nan nan nan nan nan nan nan
3 Grade: 7 nan nan nan nan nan nan nan nan
4 nan nan nan School: nan D College Class: nan 5A
This is my code: (Error)
for file in files:
df = pd.read_csv(file,header=0)
df['Name'] = df.loc[df[0].isin('Name')[1,2,3]
df['Grade'] = df.loc[df[0].isin('Grade')[1,2,3]
df['School'] = df.loc[df[3].isin('School')[4,5]
df['Class'] = df.loc[df[7].isin('Class')[8,9]
d.append(df)
df = pd.concat(d,ignore_index=True)
This is the outcome I want: (Melt Function)
Name Grade School Class ... .... ... ...
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A
I think here is possible use:
for file in files:
df = pd.read_csv(file,header=0)
#filter out first column and reshape - removed NaNs, convert to 1 column df
df = df.iloc[1:].stack().reset_index(drop=True).to_frame('data')
#compare by :
m = df['data'].str.endswith(':', na=False)
#shift values to new column
df['val'] = df['data'].shift(-1)
#filter and transpose
df = df[m].set_index('data').T.rename_axis(None, axis=1)
d.append(df)
df = pd.concat(d,ignore_index=True)
EDIT:
You can use:
for file in files:
#if input are excel, change read_csv to read_excel
df = pd.read_excel(file, header=None)
df['Name'] = df.loc[df[0].eq('Name:'), [1,2,3]].dropna(axis=1).squeeze()
df['Grade'] = df.loc[df[0].eq('Grade:'), [1,2,3]].dropna(axis=1).squeeze()
df['School'] = df.loc[df[3].eq('School:'), [4,5]].dropna(axis=1).squeeze()
df['Class'] = df.loc[df[6].eq('Class:'), [7,8]].dropna(axis=1).squeeze()
print (df)

How to merge only a specific data frame column in pandas?

I've been trying to use the pd.merge function properly but I either receive an error or get the table formatted in a way I don't like. I looked through the documentation but I can't find a way to only merge a specific column. For instance lets say I'm working with these two dataframes.
df_1 = county_name accidents pedestrians
ADAMS 1 2
ALLEGHENY 1 3
ARMSTRONG 3 4
BEDFORD 1 1
df_2 = county_name population
ADAMS 102336
ALLEGHENY 1223048
ARMSTRONG 65642
BEDFORD 166140
BERKS 48480
BLAIR 417854
BRADFORD 123457
BUCKS 60853
CAMBRIA 628341
The outcome im looking for is something like this. Where the county names are added to the 'county_name' column but not duplicated and the 'population' column is left off.
df_outcome = county_name accidents pedestrians
ADAMS 1 2
ALLEGHENY 1 3
ARMSTRONG 3 4
BEDFORD 1 1
BERKS Nan Nan
BLAIR Nan Nan
BRADFORD Nan Nan
BUCKS Nan Nan
CAMBRIA Nan Nan
Lastly, I plan to use df_outcome.fillna(0) to replace all the Nan values with zero.
Filter column county_name and use merge with left join:
df = df_2[['county_name']].merge(df_1, how='left')
print (df)
county_name accidents pedestrians
0 ADAMS 1.0 2.0
1 ALLEGHENY 1.0 3.0
2 ARMSTRONG 3.0 4.0
3 BEDFORD 1.0 1.0
4 BERKS NaN NaN
5 BLAIR NaN NaN
6 BRADFORD NaN NaN
7 BUCKS NaN NaN
8 CAMBRIA NaN NaN
Try:
df = pd.merge(df1,df2[['county_name']], how='left')

How to process excel file headers using pandas/python

I am trying to read https://www.whatdotheyknow.com/request/193811/response/480664/attach/3/GCSE%20IGCSE%20results%20v3.xlsx using pandas.
Having saved it my script is
import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
# print xl.sheet_names
df = xl.parse(xl.sheet_names[0])
print df.head()
However this does not seem to process the headers properly as it gives
GCSE and IGCSE1 results2,3 in selected subjects4 of pupils at the end of key stage 4 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 Year: 2010/11 (Final) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Coverage: England NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1. Includes International GCSE, Cambridge Inte... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2. Includes attempts and achievements by these... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
All of this should be treated as comments.
If you load the spreadsheet into libreoffice, for example, you can see that the column headings are correctly parsed and appear in row 15 with drop down menus to let you select the items you want.
How can you get pandas to automatically detect where the column headers are just as libreoffice does?
pandas is (are?) processing the file correctly, and exactly the way you asked it (them?) to. You didn't specify a header value, which means that it defaults to picking up the column names from the 0th row. The first few rows of cells aren't comments in some fundamental way, they're just not cells you're interested in.
Simply tell parse you want to skip some rows:
>>> xl = pd.ExcelFile("GCSE IGCSE results v3.xlsx")
>>> df = xl.parse(xl.sheet_names[0], skiprows=14)
>>> df.columns
Index([u'Local Authority Number', u'Local Authority Name', u'Local Authority Establishment Number', u'Unique Reference Number', u'School Name', u'Town', u'Number of pupils at the end of key stage 4', u'Number of pupils attempting a GCSE or an IGCSE', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-G', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-A', u'Number of students achieving 5 A*-A grades or more at GCSE or IGCSE'], dtype='object')
>>> df.head()
Local Authority Number Local Authority Name \
0 201 City of london
1 201 City of london
2 202 Camden
3 202 Camden
4 202 Camden
Local Authority Establishment Number Unique Reference Number \
0 2016005 100001
1 2016007 100003
2 2024104 100049
3 2024166 100050
4 2024196 100051
School Name Town \
0 City of London School for Girls London
1 City of London School London
2 Haverstock School London
3 Parliament Hill School London
4 Regent High School London
Number of pupils at the end of key stage 4 \
0 105
1 140
2 200
3 172
4 174
Number of pupils attempting a GCSE or an IGCSE \
0 104
1 140
2 194
3 169
4 171
Number of students achieving 8 or more GCSE or IGCSE passes at A*-G \
0 100
1 108
2 SUPP
3 22
4 0
Number of students achieving 8 or more GCSE or IGCSE passes at A*-A \
0 87
1 75
2 0
3 7
4 0
Number of students achieving 5 A*-A grades or more at GCSE or IGCSE
0 100
1 123
2 0
3 34
4 SUPP
[5 rows x 11 columns]

Categories

Resources