I have written a script which is opening multiple tabs one by one and taking data from there. Now I am able to get data from the page but when writing in CSV file getting data as per below.
Bedrooms Bathrooms Super area Floor Status
3 See Dimensions 3 See Dimensions 2100 7 (Out of 23 Floors) 3 See Dimensions
Bedrooms Bathrooms Super area Floor Status
3 See Dimensions 3 See Dimensions 2100 7 (Out of 23 Floors) 3 See Dimensions
Bedrooms Bathrooms Super area Floor Status
1 1 520 4 (Out of 40 Floors) 1
Bedrooms Bathrooms Super area Floor Status
3 See Dimensions 3 See Dimensions 2100 7 (Out of 23 Floors) 3 See Dimensions
Bedrooms Bathrooms Super area Floor Status
1 1 520 4 (Out of 40 Floors) 1
In the Status column i am getting wrong value.
I have tried:
# Go through of them and click on each.
for unique_link in my_needed_links:
unique_link.click()
time.sleep(2)
driver.switch_to_window(driver.window_handles[1])
def get_elements_by_xpath(driver, xpath):
return [entry.text for entry in driver.find_elements_by_xpath(xpath)]
search_entries = [
("Bedrooms", "//div[#class='seeBedRoomDimen']"),
("Bathrooms", "//div[#class='p_value']"),
("Super area", "//span[#id='coveredAreaDisplay']"),
("Floor", "//div[#class='p_value truncated']"),
("Lift", "//div[#class='p_value']")]
with open('textfile.csv', 'a+') as f_output:
csv_output = csv.writer(f_output)
# Write header
csv_output.writerow([name for name, xpath in search_entries])
entries = []
for name, xpath in search_entries:
entries.append(get_elements_by_xpath(driver, xpath))
csv_output.writerows(zip(*entries))
get_elements_by_xpath(driver, xpath)
Edit
Entries: as list
[['3 See Dimensions'], ['3 See Dimensions', '4', '3', '1', '2100 sqft', '1400 sqft', '33%', 'Avenue 54', 'Under Construction', "Dec, '20", 'New Property', '₹ 7.90 Cr ₹ 39,50,000 Approx. Registration Charges ₹ 15 Per sq. Unit Monthly\nSee Other Charges', "Santacruz West, Mumbai., Santacruz West, Mumbai - Western Suburbs, Maharashtra What's Nearby", "Next To St Teresa's Convent School & Sacred Heart School on SV Road.", 'East', 'P51800007149 (The project has been registered via MahaRERA registration number: P51800007149 and is available on the website https://maharera.mahaonline.gov.in under registered projects.)', 'Garden/Park, Pool, Main Road', 'Marble, Marbonite, Wooden', '1 Covered', '24 Hours Available', 'No/Rare Powercut', '6', '6', 'Unfurnished', 'Municipal Corporation of Greater Mumbai', 'Freehold', 'Brokers please do not contact', ''], ['2100'], ['7 (Out of 23 Floors)'], ['3 See Dimensions', '4', '3', '1', '2100 sqft', '1400 sqft', '33%', 'Avenue 54 1 Discussion on forum', 'Under Construction', "Dec, '20", 'New Property', '₹ 7.90 Cr ₹ 39,50,000 Approx. Registration Charges ₹ 15 Per sq. Unit Monthly\nSee Other Charges', "Santacruz West, Mumbai., Santacruz West, Mumbai - Western Suburbs, Maharashtra What's Nearby", "Next To St Teresa's Convent School & Sacred Heart School on SV Road.", 'East', 'P51800007149 (The project has been registered via MahaRERA registration number: P51800007149 and is available on the website https://maharera.mahaonline.gov.in under registered projects.)', 'Garden/Park, Pool, Main Road', 'Marble, Marbonite, Wooden', '1 Covered', '24 Hours Available', 'No/Rare Powercut', '6', '6', 'Unfurnished', 'Municipal Corporation of Greater Mumbai', 'Freehold', 'Brokers please do not contact', '']]
[['3 See Dimensions'], ['3 See Dimensions', '4', '3', '1', '2100 sqft', '1400 sqft', '33%', 'Avenue 54 1 Discussion on forum', 'Under Construction', "Dec, '20", 'New Property', '₹ 7.90 Cr ₹ 39,50,000 Approx. Registration Charges ₹ 15 Per sq. Unit Monthly\nSee Other Charges', "Santacruz West, Mumbai., Santacruz West, Mumbai - Western Suburbs, Maharashtra What's Nearby", "Next To St Teresa's Convent School & Sacred Heart School on SV Road.", 'East', 'P51800007149 (The project has been registered via MahaRERA registration number: P51800007149 and is available on the website https://maharera.mahaonline.gov.in under registered projects.)', 'Garden/Park, Pool, Main Road', 'Marble, Marbonite, Wooden', '1 Covered', '24 Hours Available', 'No/Rare Powercut', '6', '6', 'Unfurnished', 'Municipal Corporation of Greater Mumbai', 'Freehold', 'Brokers please do not contact', ''], ['2100'], ['7 (Out of 23 Floors)'], ['3 See Dimensions', '4', '3', '1', '2100 sqft', '1400 sqft', '33%', 'Avenue 54 1 Discussion on forum', 'Under Construction', "Dec, '20", 'New Property', '₹ 7.90 Cr ₹ 39,50,000 Approx. Registration Charges ₹ 15 Per sq. Unit Monthly\nSee Other Charges', "Santacruz West, Mumbai., Santacruz West, Mumbai - Western Suburbs, Maharashtra What's Nearby", "Next To St Teresa's Convent School & Sacred Heart School on SV Road.", 'East', 'P51800007149 (The project has been registered via MahaRERA registration number: P51800007149 and is available on the website https://maharera.mahaonline.gov.in under registered projects.)', 'Garden/Park, Pool, Main Road', 'Marble, Marbonite, Wooden', '1 Covered', '24 Hours Available', 'No/Rare Powercut', '6', '6', 'Unfurnished', 'Municipal Corporation of Greater Mumbai', 'Freehold', 'Brokers please do not contact', '']]
website link: https://www.magicbricks.com/propertyDetails/1-BHK-520-Sq-ft-Multistorey-Apartment-FOR-Sale-Kandivali-West-in-Mumbai&id=4d423333373433343431
Edit 1
my_needed_links = []
list_links = driver.find_elements_by_tag_name("a")
for i in range(0, 2):
# Get unique links.
for link in list_links:
if "https://www.magicbricks.com/propertyDetails/" in link.get_attribute("href"):
if link not in my_needed_links:
my_needed_links.append(link)
# Go through of them and click on each.
for unique_link in my_needed_links:
unique_link.click()
time.sleep(2)
driver.switch_to_window(driver.window_handles[1])
def get_elements_by_xpath(driver, xpath):
return [entry.text for entry in driver.find_elements_by_xpath(xpath)]
search_entries = [
("Bedrooms", "//div[#class='seeBedRoomDimen']"),
("Bathrooms", "//div[#class='p_value']"),
("Super area", "//span[#id='coveredAreaDisplay']"),
("Floor", "//div[#class='p_value truncated']"),
("Lift", "//div[#class='p_value']")]
#with open('textfile.csv', 'a+') as f_output:
entries = []
for name, xpath in search_entries:
entries.append(get_elements_by_xpath(driver, xpath))
data = [entry for entry in entries if len(entry)==28]
df = pd.DataFrame(data)
print (df)
df.to_csv('nameoffile.csv', mode='a',index=False,encoding='utf-8')
#df.to_csv('nameoffile.csv',mode='a', index=False,encoding='utf-8')
get_elements_by_xpath(driver, xpath)
time.sleep(2)
driver.close()
# Switch back to the main tab/window.
driver.switch_to_window(driver.window_handles[0])
Thank you in advance. Please suggest something
The xpath for bathrooms and for lift are the same, therefore you get the same results in these columns. Try to find another way to identify and distinguish between them. You can probably use an index, though if there's another way it's usually preferred.
I could not load the page due to my location. But from your entries, you could do:
#Your selenium imports
import pandas as pd
def get_elements_by_xpath(driver, xpath):
return [entry.text for entry in driver.find_elements_by_xpath(xpath)]
for unique_link in my_needed_links:
unique_link.click()
time.sleep(2)
driver.switch_to_window(driver.window_handles[1])
search_entries = [
("Bedrooms", "//div[#class='seeBedRoomDimen']"), ("Bathrooms", "//div[#class='p_value']"),("Super area", "//span[#id='coveredAreaDisplay']"),("Floor", "//div[#class='p_value truncated']"),("Lift", "//div[#class='p_value']")]
entries = []
for name, xpath in search_entries:
entries.append(get_elements_by_xpath(driver, xpath))
data = [entry for entry in entries if len(entry)>5]
df = pd.DataFrame(data)
df.drop_duplicates(inplace=True)
df.to_csv('nameoffile.csv', sep=';',index=False,encoding='utf-8',mode='a')
get_elements_by_xpath(driver, xpath)
Related
I'm trying to extract the elements of a list to turn it into a CSV file. I have a long list containing string elements. Example:
print(list_x[0])
['NAME\tGill Pizza Ltd v Labour Inspector', 'JUDGE(S)\tWilliam Young, Ellen France & Williams JJ', 'COURT\tSupreme Court', 'FILE NUMBER\tSC 67-2021', 'JUDG DATE\t12 August 2021', 'CITATION\t[2021] NZSC 97', 'FULL TEXT\t PDF judgment', 'PARTIES\tGill Pizza Ltd (First Applicant), Sandeep Singh (Second Applicant), Jatinder Singh (Third Applicant/Fourth Applicant), Mandeep Singh (Fourth Applicant/Third Applicant), A Labour Inspector (Ministry of Business, Innovation and Employment) (Respondent), Malotia Ltd (First Applicant)', 'STATUTES\tEmployment Relations Act 2000 s6(5), s228(1)', 'CASES CITED\tA Labour Inspector (Ministry of Business, Innovation and Employment) v Gill Pizza Ltd [2021] NZCA 192', 'REPRESENTATION\tGG Ballara, SP Radcliffe, JC Catran, HTN Fong', 'PAGES\t2 p', 'LOCATION\tNew Zealand Law Society Library', 'DATE ADDED\tAugust 19, 2021']
Is it possible to do something like:
name_list = []
file_number_list = []
subject_list = []
held_list = []
pages_list = []
date_list = []
for i in range(len(list_x)):
if list_x[i].startswith('NAME'):
name_list.append(list_x[i])
elif list_x[i].startswith('FILE NUMBER'):
file_number_list .append(list_x[i])
elif list_x[i].startswith('SUBJECT'):
subject_list .append(list_x[i])
elif list_x[i].startswith('HELD'):
held_list .append(list_x[i])
elif list_x[i].startswith('PAGES'):
pages_list .append(list_x[i])
elif list_x[i].startswith('DATE ADDED'):
date_list .append(list_x[i])
Any help is appreciated. Cheers.
You can also try something like this:
my_dict_list = []
for item in list_x:
my_dict_list.append(dict(zip([i.split('\t', 1)[0] for i in item], [i.split('\t', 1)[1] for i in item])))
Results:
[{'NAME': 'Gill Pizza Ltd v Labour Inspector',
'JUDGE(S)': 'William Young, Ellen France & Williams JJ',
'COURT': 'Supreme Court',
'FILE NUMBER': 'SC 67-2021',
'JUDG DATE': '12 August 2021',
'CITATION': '[2021] NZSC 97',
'FULL TEXT': ' PDF judgment',
'PARTIES': 'Gill Pizza Ltd (First Applicant), Sandeep Singh (Second Applicant), Jatinder Singh (Third Applicant/Fourth Applicant), Mandeep Singh (Fourth Applicant/Third Applicant), A Labour Inspector (Ministry of Business, Innovation and Employment) (Respondent), Malotia Ltd (First Applicant)',
'STATUTES': 'Employment Relations Act 2000 s6(5), s228(1)',
'CASES CITED': 'A Labour Inspector (Ministry of Business, Innovation and Employment) v Gill Pizza Ltd [2021] NZCA 192',
'REPRESENTATION': 'GG Ballara, SP Radcliffe, JC Catran, HTN Fong',
'PAGES': '2 p',
'LOCATION': 'New Zealand Law Society Library',
'DATE ADDED': 'August 19, 2021'},
{....
....
...}]
Am new to scraping using selenium python. So i could retrieve some of the data, but i want it in table form as is displayed on the web page:
Here is what i have so far:
url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data'
browser = webdriver.Chrome(r"C:\task\chromedriver")
browser.get(url)
time.sleep(25)
rows_in_table = browser.find_elements_by_xpath('//table[#class="dgrid-row-table"]//tr[th or td]')
for element in rows_in_table:
print(element.text.replace('\n', ''))
result snippet:
Hospital NameHospital TypeCityState AbrvZip CodeCounty NameState Name
Phoenix VA Health Care System (AKA Carl T Hayden VA Medical Center)VA HospitalPhoenixAZ85012MaricopaArizona040130401362620000.001
Southern Arizona VA Health Care SystemVA HospitalTucsonAZ85723PimaArizona04019040192952952202.002
VA Central California Health Care SystemVA HospitalFresnoCA93703FresnoCalifornia060190601954542202.003
VA Connecticut Healthcare System - West Haven Campus (AKA West Haven VA Medical Center)VA HospitalWest HavenCT6516New HavenConnecticut09009090092162161102.004
I will really appreciate a help form an expert on this. Thanks.
This is an updated version to what #Andrej answered, this code will download the table and instead of printing, saves it as an excel document.
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
config_url = 'https://definitivehc.maps.arcgis.com/sharing/rest/portals/self?culture=en-us&f=json'
page_url = 'https://services7.arcgis.com/{_id}/arcgis/rest/services/Definitive_Healthcare_USA_Hospital_Beds/FeatureServer/0/query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=*&orderByFields=OBJECTID%20ASC&resultOffset={offset}&resultRecordCount=50&cacheHint=true&quantizationParameters=%7B%22mode%22%3A%22edit%22%7D'
_id = requests.get(config_url).json()['id']
required=[]
offset = 0
while True:
data = requests.get(page_url.format(_id=_id, offset=offset)).json()
# uncommnet this to print all data:
#pprint(json.dumps(data, indent=4))
for i, f in enumerate(data['features'], offset+1):
required.append(f['attributes'])
if i % 50:
break
offset += 50
df=json_normalize(required)
with pd.ExcelWriter('dataFunction.xlsx', mode='A') as writer:
df.to_excel(writer)
I tried this and uploaded the excel sheet HERE(LINK TO EXCEL SHEET)!
The data is loaded dynamically using Javascript. You can use requests module to simulate those requests:
import json
import requests
config_url = 'https://definitivehc.maps.arcgis.com/sharing/rest/portals/self?culture=en-us&f=json'
page_url = 'https://services7.arcgis.com/{_id}/arcgis/rest/services/Definitive_Healthcare_USA_Hospital_Beds/FeatureServer/0/query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=*&orderByFields=OBJECTID%20ASC&resultOffset={offset}&resultRecordCount=50&cacheHint=true&quantizationParameters=%7B%22mode%22%3A%22edit%22%7D'
_id = requests.get(config_url).json()['id']
offset = 0
while True:
data = requests.get(page_url.format(_id=_id, offset=offset)).json()
# uncommnet this to print all data:
# print(json.dumps(data, indent=4))
for i, f in enumerate(data['features'], offset+1):
print(i, f['attributes'])
print('-' * 160)
if i % 50:
break
offset += 50
Prints all 6624 records:
...
6614 {'OBJECTID': 6614, 'HOSPITAL_NAME': 'Walter E Washington Convention Center Field Hospital (Temporarily Open due to COVID-19)', 'HOSPITAL_TYPE': 'Short Term Acute Care Hospital', 'HQ_ADDRESS': '801 Mount Vernon Pl Nw', 'HQ_ADDRESS1': None, 'HQ_CITY': 'Washington', 'HQ_STATE': 'DC', 'HQ_ZIP_CODE': '20001', 'COUNTY_NAME': 'District of Columbia', 'STATE_NAME': 'District of Columbia', 'STATE_FIPS': '11', 'CNTY_FIPS': '001', 'FIPS': '11001', 'NUM_LICENSED_BEDS': None, 'NUM_STAFFED_BEDS': None, 'NUM_ICU_BEDS': 0, 'ADULT_ICU_BEDS': 0, 'PEDI_ICU_BEDS': None, 'BED_UTILIZATION': None, 'Potential_Increase_In_Bed_Capac': 0, 'AVG_VENTILATOR_USAGE': None}
----------------------------------------------------------------------------------------------------------------------------------------------------------------
6615 {'OBJECTID': 6615, 'HOSPITAL_NAME': 'Joint Base Cape Cod Field Hospital (Temporarily Open due to COVID-19)', 'HOSPITAL_TYPE': 'Short Term Acute Care Hospital', 'HQ_ADDRESS': 'Connery Ave', 'HQ_ADDRESS1': None, 'HQ_CITY': 'Buzzards Bay', 'HQ_STATE': 'MA', 'HQ_ZIP_CODE': '2542', 'COUNTY_NAME': 'Barnstable', 'STATE_NAME': 'Massachusetts', 'STATE_FIPS': '25', 'CNTY_FIPS': '001', 'FIPS': '25001', 'NUM_LICENSED_BEDS': None, 'NUM_STAFFED_BEDS': None, 'NUM_ICU_BEDS': 0, 'ADULT_ICU_BEDS': 0, 'PEDI_ICU_BEDS': None, 'BED_UTILIZATION': None, 'Potential_Increase_In_Bed_Capac': 0, 'AVG_VENTILATOR_USAGE': None}
----------------------------------------------------------------------------------------------------------------------------------------------------------------
6616 {'OBJECTID': 6616, 'HOSPITAL_NAME': 'UMass Lowell Recreation Center Field Hospital (Temporarily Open due to COVID-19)', 'HOSPITAL_TYPE': 'Short Term Acute Care Hospital', 'HQ_ADDRESS': '322 Aiken St', 'HQ_ADDRESS1': None, 'HQ_CITY': 'Lowell', 'HQ_STATE': 'MA', 'HQ_ZIP_CODE': '1854', 'COUNTY_NAME': 'Middlesex', 'STATE_NAME': 'Massachusetts', 'STATE_FIPS': '25', 'CNTY_FIPS': '017', 'FIPS': '25017', 'NUM_LICENSED_BEDS': None, 'NUM_STAFFED_BEDS': None, 'NUM_ICU_BEDS': 0, 'ADULT_ICU_BEDS': 0, 'PEDI_ICU_BEDS': None, 'BED_UTILIZATION': None, 'Potential_Increase_In_Bed_Capac': 0, 'AVG_VENTILATOR_USAGE': None}
----------------------------------------------------------------------------------------------------------------------------------------------------------------
6617 {'OBJECTID': 6617, 'HOSPITAL_NAME': 'Miami Beach Convention Center Field Hospital (Temporarily Open due to COVID-19)', 'HOSPITAL_TYPE': 'Short Term Acute Care Hospital', 'HQ_ADDRESS': '1901 Convention Center Dr', 'HQ_ADDRESS1': None, 'HQ_CITY': 'Miami Beach', 'HQ_STATE': 'FL', 'HQ_ZIP_CODE': '33139', 'COUNTY_NAME': 'Miami-Dade', 'STATE_NAME': 'Florida', 'STATE_FIPS': '12', 'CNTY_FIPS': '086', 'FIPS': '12086', 'NUM_LICENSED_BEDS': None, 'NUM_STAFFED_BEDS': None, 'NUM_ICU_BEDS': 0, 'ADULT_ICU_BEDS': 0, 'PEDI_ICU_BEDS': None, 'BED_UTILIZATION': None, 'Potential_Increase_In_Bed_Capac': 0, 'AVG_VENTILATOR_USAGE': None}
...
My dataframe looks like this:
df_dict = {'Title': titles, 'Relase': release, 'Audience Rating': audience_rating,
'Runtime': runtime, 'Genre': genre,
'Votes': votes, 'Director': directors,
'Actors': actors}
df = pd.DataFrame(df_dict)
This is the Output:
Title Relase Audience Rating Runtime Genre Votes Director Actors
0 Sandhippoma (1998) Not Rated 136 min "
Drama, Romance " 56 Indrakumar ['Ambika', 'Sarath Babu',
'Prakash Raj', 'Radhika Sarathkumar']
1 Joker (2016) Not Rated 130 min "
Comedy, Drama " 2,712 Raju Murugan ['Raju Saravanan', 'Guru
Somasundaram', 'Ramya Pandiyan', 'Gayathri']
when doing a df.to_csv only list 'genre' is getting updated as empty. All others are getting written into the csv except 'genre' column
df.to_csv('movies.csv', index=False)
the list 'genre' looks like this:
['\nDrama, Romance ',
'\nComedy, Drama ',
'\nAction, Biography, Drama ',]
I'm making this web scrape for a project but it only returns one of the values i'm looking for instead of also running the other 18 elements in listings. It will return all the information on 1 house but i want the information on the other 18 houses also stored in the variables. Thanks very much.
'''
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = "https://www.daft.ie/ireland/property-for-sale/"
#open connection and grab webpage
uClient = uReq(my_url)
#store html in a variable
page_html = uClient.read()
#close web connection
uClient.close()
#parse html
soup = BeautifulSoup(page_html, "html.parser")
print(soup)
#grabs listings house information
listings = soup.findAll("div", {"class":"FeaturedCardPropertyInformation__detailsContainer"})
for container in listings:
#extracting price
price= container.div.div.strong.text
#location
name_container = container.div.find("a", {"class":"PropertyInformationCommonStyles__addressCopy-
-link"}).text
#house type
house = container.div.find("div", {"class":"QuickPropertyDetails__propertyType"}).text
#number of bathrooms
bath_num = container.div.find("div", {"class":"QuickPropertyDetails__iconCopy--
WithBorder"}).text
#number of bedrooms
bed_num = container.div.find("div", {"class":"QuickPropertyDetails__iconCopy"}).text
'''
You can simply create a blank list before for loop and append all the variable in every iteration to store all data in a single list.
Your code will look as follows:
data = []
for container in listings:
# extracting price
price = container.div.div.strong.text
# location
name_container = container.div.find("a", {"class": "PropertyInformationCommonStyles__addressCopy--link"}).text
# house type
house = container.div.find("div", {"class": "QuickPropertyDetails__propertyType"}).text
# number of bathrooms
bath_num = container.div.find("div", {"class": "QuickPropertyDetails__iconCopy--WithBorder"}).text
# number of bedrooms
bed_num = container.div.find("div",{"class": "QuickPropertyDetails__iconCopy"}).text
data.append((price, name_container, house, bath_num, bed_num))
print(data)
Your final output will look as follows:
[('€1,350,000', 'The Penthouse at Hanover Quay, 27 Hanover Dock, Grand Canal Dock, Dublin 2', 'Apartment for sale', '2', '3'), ('€450,000', '9 Na Ceithre Gaoithe Ring, Dungarvan, Co. Waterford', ' Detached House', '4', '5'), ('€390,000', 'Cave, Caherlistrane, Co. Galway', ' Detached House', '4', '5'), ('€720,000', '18 Hazelbrook Road, Terenure, Terenure, Dublin 6', ' Detached House', '3', '4'), ('€210,000', 'Carraig Abhainn, Ballisodare, Co. Sligo', 'Bungalow for sale', '1', '3'), ('€495,000', 'Campbell Court, Cairns Hill, Sligo, Co. Sligo', ' Detached House', '4', '4'), ('€125,000', '33 Leim An Bhradain, Gort Road, Ennis, Co. Clare', 'Apartment for sale', '2', '2'), ('€395,000', '1 Windermere Court, Bishopstown, Bishopstown, Co. Cork', ' End of Terrace House', '3', '4'), ('€349,000', '59 Dun Eoin, Ballinrea Road, Carrigaline, Co. Cork', ' Detached House', '3', '4'), ('€515,000', '2 Elm Walk, Classes Lake, Ovens, Co. Cork', ' Detached House', '5', '4'), ('€490,000', '9 Munster st., Phibsborough, Dublin 7', ' Terraced House', '2', '4'), ('€249,950', '47 Westfields, Clare Road, Ennis, Co. Clare', ' Detached House', '3', '4'), ('€435,000', '3 Castlelough Avenue, Loreto Road, Killarney, Co. Kerry', ' Detached House', '3', '4'), ('€620,000', 'Beaufort House, Knockacleva, Philipstown, Dunleer, Dunleer, Co. Louth', ' Detached House', '3', '5'), ('€550,000', "Flat 5, Saint Ann's Apartments, Donnybrook, Dublin 4", 'Apartment for sale', '2', '2'), ('€675,000', '3 Church Hill, Innishannon, Co. Cork', ' Detached House', '3', '5'), ('€495,000', 'River Lodge, The Rower, Inistioge, Co. Kilkenny', ' Detached House', '4', '4'), ('€325,000', 'Coolgarrane House, Coolgarrane, Thurles, Co. Tipperary', ' Detached House', '1', '4'), ('€399,950', 'No 14 Coopers Grange Old Quarter, Ballincollig, Co. Cork', ' Semi-Detached House', '3', '4')]
I tried lot of suggestions but I am unable to remove carriage returns. I am new python and trying it with csv file cleaning.
import csv
filepath_i = 'C:\Source Files\Data Source\Flat File Source\PatientRecords.csv'
filepath_o = 'C:\Source Files\Data Source\Flat File Source\PatientRecords2.csv'
rows = []
with open(filepath_i, 'rU', newline='') as csv_file:
#filtered = (line.replace('\r\n', '') for line in csv_file)
filtered = (line.replace('\r', '') for line in csv_file)
csv_reader = csv.reader(csv_file, delimiter=',')
i = 0
for row in csv_reader:
print(row)
i = i + 1
if(i == 10):
break
#with open(filepath_o, 'w',newline='' ) as writeFile:
# writer = csv.writer(writeFile,lineterminator='\r')
# for row in csv_reader:
# #rows.append(row.strip())
# rows.append(row.strip())
# writer.writerows(rows)
Input
DRG Definition,Provider Id,Provider Name,Provider Street Address,Provider City,Provider State,Provider Zip Code,Hospital Referral Region Description,Hospital Category,Hospital Type, Total Discharges ,Covered Charges , Total Payments ,Medicare Payments
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,AL - Dothan,Specialty Centers,Government Funded,91,"$32,963.07 ","$5,777.24 ","$4,763.73 "
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,10005,MARSHALL MEDICAL CENTER SOUTH,"2505 U S HIGHWAY
431 NORTH",BOAZ,AL,35957,AL - Birmingham,Specialty Centers,Private Institution,14,"$15,131.85 ","$5,787.57 ","$4,976.71 "
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,10006,ELIZA COFFEE MEMORIAL HOSPITAL,205 MARENGO STREET,FLORENCE,AL,35631,AL - Birmingham,Rehabilitation Centers,Private Institution,24,"$37,560.37 ","$5,434.95 ","$4,453.79 "
Output (4th column 'Provider Street Address')
['DRG Definition', 'Provider Id', 'Provider Name', 'Provider Street Address', 'Provider City', 'Provider State', 'Provider Zip Code', 'Hospital Referral Region Description', 'Hospital Category', 'Hospital Type', ' Total Discharges ', 'Covered Charges ', ' Total Payments ', 'Medicare Payments']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10001', 'SOUTHEAST ALABAMA MEDICAL CENTER', '1108 ROSS CLARK CIRCLE', 'DOTHAN', 'AL', '36301', 'AL - Dothan', 'Specialty Centers', 'Government Funded', '91', '$32,963.07 ', '$5,777.24 ', '$4,763.73 ']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10005', 'MARSHALL MEDICAL CENTER SOUTH', '2505 U S HIGHWAY \n431 NORTH', 'BOAZ', 'AL', '35957', 'AL - Birmingham', 'Specialty Centers', 'Private Institution', '14', '$15,131.85 ', '$5,787.57 ', '$4,976.71 ']
I ran this on my side and it works:
with open(filepath_i, 'rU', newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
row[3] = row[3].replace("\n","").replace("\r","")
print(row)
Output:
['DRG Definition', 'Provider Id', 'Provider Name', 'Provider Street Address', 'Provider City', 'Provider State', 'Provider Zip Code', 'Hospital Referral Region Description', 'Hospital Category', 'Hospital Type', ' Total Discharges ', 'Covered Charges ', ' Total Payments ', 'Medicare Payments']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10001', 'SOUTHEAST ALABAMA MEDICAL CENTER', '1108 ROSS CLARK CIRCLE', 'DOTHAN', 'AL', '36301', 'AL - Dothan', 'Specialty Centers', 'Government Funded', '91', '$32,963.07 ', '$5,777.24 ', '$4,763.73 ']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10005', 'MARSHALL MEDICAL CENTER SOUTH', '2505 U S HIGHWAY 431 NORTH', 'BOAZ', 'AL', '35957', 'AL - Birmingham', 'Specialty Centers', 'Private Institution', '14', '$15,131.85 ', '$5,787.57 ', '$4,976.71 ']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10006', 'ELIZA COFFEE MEMORIAL HOSPITAL', '205 MARENGO STREET', 'FLORENCE', 'AL', '35631', 'AL - Birmingham', 'Rehabilitation Centers', 'Private Institution', '24', '$37,560.37 ', '$5,434.95 ', '$4,453.79 ']