I am a beginner working with a clinical data set using Pandas in Jupyter Notebook.
A column of my data contains census tract codes and I am trying to merge my data with a large transportation data file that also has a column with census tract codes.
I initially only wanted 2 of the other columns from that transportation file so, after I downloaded the file, I removed all of the other columns except the 2 that I wanted to add to my file and the census tract column.
This is the code I used:
df_my_data = pd.read_excel("my_data.xlsx")
df_transportation_data = pd.read_excel("transportation_data.xlsx")
df_merged_file = pd.merge(df_my_data, df_transportation_data)
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
This worked but then I wanted to add the other columns from the transportation file so I used my initial file (prior to adding the 2 transportation columns) and tried to merge the entire transportation file. This resulted in a new DataFrame with all of the desired columns but only 4 rows.
I thought maybe the transportation file is too big so I tried merging individual columns (other than the 2 I was initially able to merge) and this again results in all of the correct columns but only 4 rows merging.
Any help would be much appreciated.
Edits:
Sorry for not being more clear.
Here is the code for the 2 initial columns I merged:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_two_columns = pd.read_excel('two_columns_from_transportation_file.xlsx')
df_two_columns_merged = pd.merge(df_my_data, df_two_columns, on=['census_tract'])
df_two_columns_merged.to_excel('two_columns_merged.xlsx', index = False)
The outputs were:
df_my_data.head()
census_tract id e t
0 6037408401 1 1 1092
1 6037700200 2 1 1517
2 6065042740 3 1 2796
3 6037231210 4 1 1
4 6059076201 5 1 41
df_two_columns.head()
census_tract households_with_no_vehicle vehicles_per_household
0 6001400100 2.16 2.08
1 6001400200 6.90 1.50
2 6001400300 17.33 1.38
3 6001400400 8.97 1.41
4 6001400500 11.59 1.39
df_two_columns_merged.head()
census_tract id e t households_with_no_vehicle vehicles_per_household
0 6037408401 1 1 1092 4.52 2.43
1 6037700200 2 1 1517 9.88 1.26
2 6065042740 3 1 2796 2.71 1.49
3 6037231210 4 1 1 25.75 1.35
4 6059076201 5 1 41 1.63 2.22
df_my_data has 657 rows and df_two_columns_merged came out with 657 rows.
The code for when I tried to merge the entire transport file:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_transportation_data = pd.read_excel('transportation_data.xlsx')
df_merged_file = pd.merge(df_my_data, df_transportation_data, on=['census_tract'])
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
The output:
df_transportation_data.head()
census_tract Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6001400100 0.00 12.60 65.95 2.16 20.69 0.76 2.08
1 6001400200 5.68 3.66 45.79 6.90 39.01 5.22 1.50
2 6001400300 7.55 6.61 46.77 17.33 31.19 6.39 1.38
3 6001400400 8.85 11.29 43.91 8.97 27.67 4.33 1.41
4 6001400500 8.45 7.45 46.94 11.59 29.56 4.49 1.39
df_merged_file.head()
census_tract id e t Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6041119100 18 0 2755 1.71 3.02 82.12 4.78 8.96 3.32 2.10
1 6061023100 74 1 1201 0.00 9.85 86.01 0.50 2.43 1.16 2.22
2 6041110100 80 1 9 0.30 4.40 72.89 6.47 13.15 7.89 1.82
3 6029004902 123 0 1873 0.00 18.38 78.69 4.12 0.00 0.00 2.40
The df_merged_file only has 4 total rows.
So my question is: why is it that I am able to merge those initial 2 columns from the transportation file and keep all of the rows from my file but when I try to merge the entire transportation file I only get 4 rows of output?
I recommend specifying merge type and merge column(s).
When you use pd.merge(), the default merge type is inner merge, and on the same named columns using:
df_merged_file = pd.merge(df_my_data, df_transportation_data, how='left', left_on=[COLUMN], right_on=[COLUMN])
It is possible that one of the columns you removed from the "transportation_data.xlsx" file previously is the same name as a column in your "my_data.xlsx", causing unmatched rows to be removed due to an inner merge.
A 'left' merge would allow the two columns you need from "transportation_data.xlsx" to attach to values in your "my_data.xlsx", but only where there is a match. This means your merged DataFrame will have the same number of rows as your "my_data.xlsx" has currently.
Well, I think there was something wrong with the initial download of the transportation file. I downloaded it again and this time I was able to get a complete merge. Sorry for being an idiot. Thank you all for your help.
I want to convert the html to csv using pandas functions
This is a part of what I read in the dataframe df
0 1
0 sequence 2
1 trainNo K805
2 trainNumber K805
3 departStation 鹰潭
4 departStationPy yingtan
5 arriveStation 南昌
6 arriveStationPy nanchang
7 departDate 2020-05-24
8 departTime 03:55
9 arriveDate 2020-05-24
10 arriveTime 05:44
11 isStartStation False
12 isEndStation False
13 runTime 1小时49分钟
14 preSaleTime NaN
15 takeDays 0
16 isBookable True
17 seatList seatNamepriceorderPriceinventoryisBookablebutt...
18 curSeatIndex 0
seatName price orderPrice inventory isBookable buttonDisplayName buttonType
0 硬座 23.5 23.5 99 True NaN 0
1 硬卧 69.5 69.5 99 True NaN 0
2 软卧 104.5 104.5 4 True NaN 0
0 1
0 departDate 2020-05-23
1 departStationList NaN
2 endStationList NaN
3 departStationFilterMap NaN
4 endStationFilterMap NaN
5 departCityName 上海
6 arriveCityName 南昌
7 gtMinPrice NaN
My code is like this
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(".\other.csv",index=True,encoding='utf-8-sig')
To preserve the characters in csv, I need to use utf-8-sig encoding. But I don't know how to write the format symbol %
,0,1
0,departDate,2020-05-23
1,departStationList,
2,endStationList,
3,departStationFilterMap,
4,endStationFilterMap,
5,departCityName,上海
6,arriveCityName,南昌
7,gtMinPrice,
This is what I got in csv file, only the last part is preserved.
The dataframe is correct, while the csv need correction. Can you show me how to make the correct output?
you're saving each dataframe to the same file, so each is getting overwritten until the last one.
note the addition of the f-string to change the save file name e.g. f".\other_{i}.csv"
each dataframe is a different shape, so they won't all fit together properly
To CSV
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(f".\other_{i}.csv", index=True, encoding='utf-8-sig')
To Excel
with pd.ExcelWriter('output.xlsx', mode='w') as writer:
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_excel(writer, sheet_name=f'Sheet{i}', encoding='utf-8-sig')
I am importing a csv file into python using pandas but the data frame is only in one column. I copied and pasted data from the comma-separated format from The Player Standing Field table at this link (second one) into an excel file and saved it as a csv (originally as ms-dos, then both as normal and utf-8 per recommendation by AllthingsGo42). But it only returned a single column data frame.
Examples of what I tried:
dataset=pd.read('MLB2016PlayerStats2.csv')
dataset=pd.read('MLB2016PlayerStats2.csv', delimiter=',')
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',')
The each line of code above all returned:
Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary
1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2...
2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1...
3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,...
4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,1...
5,Cristhian Adames\adamecr01,24,COL,NL,69,43,3...
Also tried:
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',',quoting=3)
Which returned:
"Rk Name Age Tm Lg G GS CG Inn Ch
\
0 "1 Fernando Abad\abadfe01 30 TOT AL 57 0 0 46.2 4
1 "2 Jose Abreu\abreujo02 29 CHW AL 152 152 150 1355.2 1337
2 "3 A.J. Achter\achteaj01 27 LAA AL 27 0 0 37.2 6
3 "4 Dustin Ackley\ackledu01 28 NYY AL 23 16 10 140.1 97
4 "5 Cristhian Adames\adamecr01 24 COL NL 69 43 38 415.0 212
E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr RF/9 RF/G \
0 ... 0 1 1.000 NaN NaN NaN NaN 0.77 0.07
1 ... 10 131 0.993 -2.0 -2.0 -5.0 -4.0 8.81 8.73
2 ... 0 0 1.000 NaN NaN 0.0 0.0 1.43 0.22
3 ... 0 8 1.000 1.0 9.0 3.0 27.0 6.22 4.22
4 ... 6 24 0.972 -4.0 -12.0 1.0 3.0 4.47 2.99
Pos Summary"
0 P"
1 1B"
2 P"
3 1B-OF-2B"
4 SS-2B-3B"
Below is what the data looks like in notepad++
"Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary"
"1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2,4,0,4,0,1,1.000,,,,,0.77,0.07,P"
"2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1355.2,1337,1243,84,10,131,.993,-2,-2,-5,-4,8.81,8.73,1B"
"3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,6,2,4,0,0,1.000,,,0,0,1.43,0.22,P"
"4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,140.1,97,89,8,0,8,1.000,1,9,3,27,6.22,4.22,1B-OF-2B"
"5,Cristhian Adames\adamecr01,24,COL,NL,69,43,38,415.0,212,68,138,6,24,.972,-4,-12,1,3,4.47,2.99,SS-2B-3B"
"6,Austin Adams\adamsau01,29,CLE,AL,19,0,0,18.1,1,0,0,1,0,.000,,,0,0,0.00,0.00,P"
Sorry for the confusion with my question before. I hope this edit will clear things up. Thank you to those that answered thus far.
Running it quickly myself, I was able to get what I am understanding is the desired output.
My only thought is that there i s no need to call out a delimiter for a csv, because a csv is a comma separated variable file, but that should not matter. I am thinking that there is something incorrect with your actual data file and I would go and make sure it is saved correctly. I would echo previous comments and make sure that the csv is a UTF-8, and not an MS-DOS or Macintosh (both options when saving in excel)
Best of luck!
There is no need to call for a delimiter for a csv. You only have to change the separator from ";" to ",". For this you can open your csv file with notepad and change them with the replace tool.