This is the code in which I tried to get the data from one website using the requests and saved in dictionary called table but when I tried to iterate through those values and saved them in the list , I faced with below error, any help is appreciated.
import requests
from bs4 import BeautifulSoup
list1 = []
table = {}
r = requests.get("https://www.century21.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/?k=1")
content = r.content
soup = BeautifulSoup(content,'html.parser')
all = soup.find_all('div',{"class":"property-card-primary-info"})
for item in all:
print(item.find('a',{"class":"listing-price"}).text.replace('\n','').replace(' ',''))
table['address'] = item.find('div',{"class":"property-address"}).text.replace('\n','').replace(' ','')
table['city'] = item.find('div',{"class":"property-city"}).text.replace('\n','').replace(' ','')
table['beds'] = item.find('div',{"class":"property-beds"}).text.replace('\n','').replace(' ','')
table['baths'] = item.find('div',{"class":"property-baths"}).text.replace('\n','').replace(' ','')
try:
table['half-baths'] = item.find("div",{"class":"property-half-baths"}).text.replace('\n','').replace(' ','')
except:
table['half-baths'] = None
try:
table['property sq.ft.'] = item.find("div",{"class":"property-sqft"}).text.replace(' ','').replace("\n",'')
except:
table['property sq.ft.'] = None
list1.append(table)
list1
OUTPUT
$325,000
$249,000
$390,000
$274,900
$208,000
$169,000
$127,500
$990,999
I'm getting the unique values when I print price values , but when I append to the list all the values are replicated. Any help will means a lot.
Question : how to get rid of this replication of data and get the corresponding values?
[{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '}]
for item in all:
table ={} # important
print(item.find('a',{"class":"listing-price"}).text.replace('\n','').replace(' ',''))
table['address'] = item.find('div',{"class":"property-address"}).text.replace('\n','').replace(' ','')
table['city'] = item.find('div',{"class":"property-city"}).text.replace('\n','').replace(' ','')
table['beds'] = item.find('div',{"class":"property-beds"}).text.replace('\n','').replace(' ','')
table['baths'] = item.find('div',{"class":"property-baths"}).text.replace('\n','').replace(' ','')
try:
table['half-baths'] = item.find("div",{"class":"property-half-baths"}).text.replace('\n','').replace(' ','')
except:
table['half-baths'] = None
try:
table['property sq.ft.'] = item.find("div",{"class":"property-sqft"}).text.replace(' ','').replace("\n",'')
except:
table['property sq.ft.'] = None
list1.append(table)
print(set(list1)) # print list outside the loop use set to remove dups
Related
['AAR CORP. FQ1 2011 EARNINGS CALL | SEP 16, 2010 ',
'Copyright © 2019 S&P Global Market Intelligence, a division of S&P Global Inc. All Rights reserved. ',
'spglobal.com/marketintelligence 3 ',
' ',
' ',
'Call Participants ',
'EXECUTIVES ',
' ',
'David P. Storch ',
'Chairman of the Board ',
' ',
'Richard J. Poulton ',
'Former Chief Financial Officer, Vice ',
'President and Treasurer ',
' ',
'Timothy J. Romenesko ',
'Former Vice Chairman ',
'Tom Udovich ',
'ANALYSTS ',
'Arnold Ursaner ',
'CJS Securities, Inc. ',
' ',
'Eric Charles Hugel ',
'Stephens Inc., Research Division ',
' ',
'J. B. Groh ',
'D.A. Davidson & Co., Research ',
'Division ',
' ',
'Jonathan Paul Braatz ',
'Kansas City Capital Associates ',
'Joseph DeNardi ',
'Kenneth George Herbert ',
'Wedbush Securities Inc., Research ',
'Division ',
'Thomas Lewis ',
'Tyler Hojo ',
'Sidoti & Company, LLC ']
The above is how the text data is available.
This is how I would like my data to look like.
Can anyone please guide me how to use python for this purpose?
enter image description here
I have a data frame with a column of countries that reviewers are from. I want to replace all of the countries that are NOT in my nationalities list with "other".
I created the following code but it will not run. I get this error..
ValueError: ('Lengths must match to compare', (51575,), (9,))
nationalities = ['United Kingdom', 'United States of America', 'Australia', 'Ireland', 'United Arab Emirates', 'Saudi Arabia', 'Netherlands', 'Germany', 'Canada' ]
sample_hotel_df['Reviewer_Nationality'] = sample_hotel_df['Reviewer_Nationality'].replace(np.where(sample_hotel_df['Reviewer_Nationality'] != nationalities), 'Other')
Sample Input:
sample_hotel_df['Reviewer_Nationality'] = np.array([[' Latvia ', ' Israel ', ' Lebanon ', ' Azerbaijan ',
' Kazakhstan ', ' Iraq ', ' Thailand ', ' Denmark ', ' Bulgaria ',
' Luxembourg ', ' Jordan ', ' Kenya ', ' Iceland ', ' Estonia ',
' Serbia ', ' Malta ', ' Cyprus ', ' Greece ', ' South Africa ',
' Croatia ', ' Oman ', ' Bahrain ', ' Finland ', ' Singapore ',
' Malaysia ', ' Portugal ', ' Yemen ', ' Bangladesh ', ' Sudan ',
' Libya ', ' Palestinian Territory ', ' Lithuania ',
' Philippines ', ' Hong Kong ', ' ', ' Dominican Republic ',
' Armenia ', ' Slovakia ', ' Tunisia ', ' Chile ', ' Mauritius ',
' Nepal ', ' Peru ', ' Ghana ', ' Montenegro ', ' Jersey ',
' Morocco ', ' Andorra ', ' Sri Lanka ', ' Argentina ',
' Puerto Rico ', ' Honduras ', ' Indonesia ', ' Abkhazia Georgia ',
' Ukraine ', ' Mongolia ', ' Taiwan ', ' Georgia ',
' Bosnia and Herzegovina ', ' Montserrat ', ' Uruguay ', ' Syria ',
' Jamaica ', ' Angola ', ' Gibraltar ', ' Zambia '])
Output:
sample_hotel_df['Reviewer_Nationality'] = np.array(['United Kingdom',
'United States of America',
'Australia', 'Ireland',
'United Arab Emirates',
'Saudi Arabia',
'Netherlands', 'Germany',
'Canada', 'Other'
])
I can run a for loop but it's computationally heavy. Any suggestions?
Thanks!
You do not need str.replace for this.
sample_hotel_df.loc[~sample_hotel_df['Reviewer_Nationality'].isin(nationalities), 'Reviewer_Nationality'] = "other"
Let's say this is your CSV file (data.csv):
Reviewer_Nationality
Latvia
Israel
United States of America
Lebanon
United Kingdom
Australia
You can read it by using pandas:
>>> import pandas as pd
>>> rev_nat = pd.read_csv('data.csv')['Reviewer_Nationality'].to_list()
Then you can filter the nationalities in this way:
>>> nat = ['United Kingdom', 'United States of America', 'Australia', 'Ireland']
>>> result = list(set(n if n in nat else 'Other' for n in rev_nat))
The final result is
['Other', 'United States of America', 'Australia']
The simplest is probably :
ser = sample_hotel_df['Reviewer_Nationality']
sample_hotel_df['Reviewer_Nationality'] = ser.where(ser.isin(nationalities), 'Other')
Anyway, use ser.isin(lst) and ~ser.isin(lst) in your filters instead of == and != that's why you had an error.
== and != are for a single element comparison
Edit :
Yes it works :)
But :
Your sample Series has no country that shouldn't be 'Other' according to your list...
All your countries have trailing and leading spaces. So you should clean it with .str.strip()
So this should work, even with your data :
ser = sample_hotel_df['Reviewer_Nationality'].str.strip()
sample_hotel_df['Reviewer_Nationality'] = ser.where(ser.isin(nationalities), 'Other')
The web page I am attempting to extract data from.
Picture of the data I am trying to extract. I want to extract the Test Code, CPT Code(s), Preferred Specimen(s), Minimum Volume, Transport Container, and Transport Temperature.
When I print the soup page, it does not contain the data I need. Therefore, I cannot extract it. Here is how I print the soup page:
soup_page = soup(html_page, "html.parser")
result = soup_page
print(result)
But when I inspect the elements of interest from the web page, I can see the HTML contains the data of interest. Here is some of the HTML:
<h4>Test Code</h4><p>36127</p><span class="LegacyOrder" style="word-wrap:break-word;visibility:hidden"></span><input id="primaryTestCode" value="36127" type="hidden"><input id="searchStringValue" value="36127" type="hidden"><span class="LisTranslatableVerbiage" style="word-wrap:break-word;visibility:hidden"></span>
For the website to return the data, you also need to include cookie information which is used to specify the laboratory that you request. In your case SEA. This can easily be added as a requests parameter as follows:
from bs4 import BeautifulSoup
from operator import itemgetter
import requests
url = 'https://www.questdiagnostics.com/testcenter/TestDetail.action?tabName=OrderingInfo&ntc=36127&searchString=36127'
cookies = {
"TC11SelectedLabCode" : "SEA",
"TC11SelectedLabName" : "WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)",
}
r = requests.get(url, cookies=cookies)
soup = BeautifulSoup(r.content, "html.parser")
data = [el.get_text(strip=True) for el in itemgetter(6, 8, 14, 16, 20, 22)(soup.find_all(['h4', 'p']))]
print(data)
This would give you:
['36127', '84443', '1 mL serum', '0.7 mL', 'Serum Separator Tube (SST®)', 'Room temperature']
You might need to improve the extraction of the information, it assumed the elements returned for each search are consistent. Instead you could search for the required field headings and use the next element, for example:
from bs4 import BeautifulSoup
from operator import itemgetter
import requests
req_fields = ["Test Code", "CPT Code(s)", "Preferred Specimen(s)", "Minimum Volume", "Transport Container", "Transport Temperature"]
url = 'https://www.questdiagnostics.com/testcenter/TestDetail.action?tabName=OrderingInfo&ntc=36127&searchString=36127'
cookies = {
"TC11SelectedLabCode" : "SEA",
"TC11SelectedLabName" : "WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)",
}
r = requests.get(url, cookies=cookies)
soup = BeautifulSoup(r.content, "html.parser")
i_fields = (el.get_text(strip=True) for el in soup.find_all(['h4', 'p']))
data = {field : next(i_fields) for field in i_fields if field in req_fields}
print(data)
Giving a dictionary containing:
{'Test Code': '36127', 'CPT Code(s)': '84443', 'Preferred Specimen(s)': '1 mL serum', 'Minimum Volume': '0.7 mL', 'Transport Container': 'Serum Separator Tube (SST®)', 'Transport Temperature': 'Room temperature'}
It appears that in order to find the desired page, you must first select the test region from a dropdown menu, and press the button "Go". To do so in Python, you will have to use selenium:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re, time
def get_page_data(_source):
headers = ['Test Code', 'CPT Code(s)', 'Preferred Specimen(s)', 'Minimum Volume', 'Transport Container', 'Transport Temperature']
d1 = list(filter(None, [i.text for i in _source.find('div', {'id':'labDetail'}).find_all(re.compile('h4|p'))]))
return {d1[i]:d1[i+1] for i in range(len(d1)-1) if d1[i].rstrip() in headers}
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.questdiagnostics.com/testcenter/TestDetail.action?tabName=OrderingInfo&ntc=36127&searchString=36127')
_d = soup(d.page_source, 'html.parser')
_options = [i.text for i in _d.find_all('option', {'value':re.compile('[A-Z]+')})]
_lab_regions = {}
for _op in _options:
d.find_element_by_xpath(f"//select[#id='labs']/option[text()='{_op}']").click()
try:
d.find_element_by_xpath("//button[#class='confirm go']").click()
except:
d.find_element_by_xpath("//button[#class='confirm update']").click()
_lab_regions[_op] = get_page_data(soup(d.page_source, 'html.parser'))
time.sleep(2)
print(_lab_regions)
Output:
{'AL - Solstas Birmingham 2732 7th Ave South (866)281-9838 (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'AZ - Tempe 1255 W Washington St (800)766-6721 (QSO)': {}, 'CA - Quest Diagnostics Infectious Disease, Inc. 33608 Ortega Hwy (800) 445-4032 (FDX)': {}, 'CA - Sacramento 3714 Northgate Blvd (866)697-8378 (MET)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'CA - San Jose 967 Mabury Rd (866)697-8378 (MET)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'CA - San Juan Capistrano 33608 Ortega Hwy (800) 642-4657 (SJC)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Temperature ': 'Room temperature'}, 'CA - Valencia 27027 Tourney Road (800) 421-7110 (SLI)': {}, 'CA - West Hills 8401 Fallbrook Ave (866)697-8378 (MET)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'CO - Midwest 695 S Broadway (866) 697-8378 (STL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'CT - Wallingford 3 Sterling Dr (866)697-8378 (NEL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'FL - Miramar 10200 Commerce Pkwy (866)697-8378 (TMP)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'FL - Tampa 4225 E Fowler Ave (866)697-8378 (TMP)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'GA - Tucker 1777 Montreal Cir (866)697-8378 (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'IL - Wood Dale 1355 Mittel Blvd (866)697-8378 (WDL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'IN - Indianapolis 2560 North Shadeland Avenue (317)803-1010 (MJV)': {'Test Code': '36127', 'CPT Code(s) ': '84443'}, 'KS - Lenexa 10101 Renner Blvd (866)697-8378 (STL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'MA - Marlborough 200 Forest Street (866) 697-8378 (NEL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'MD - East Region - Baltimore, 1901 Sulphur Spring Rd (866) 697-8378) (PHP)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'MD - Baltimore 1901 Sulphur Spring Rd (866)697-8378 (QBA)': {'Test Code': '36127X', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL (0.7 mL minimum) serumPlasma is no longer acceptable', 'Minimum Volume ': '0.7 mL'}, 'MI - Troy 1947 Technology Drive (866)697-8378 (WDL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'MO - Maryland Heights 11636 Administration Dr (866)697-8378 (STL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NC - Greensboro 4380 Federal Drive (866)697-8378 (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NJ - Teterboro 1 Malcolm Ave (866)697-8378 (QTE)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NM - Albuquerque 5601 Office Blvd ABQ (866) 697-8378 (DAL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NV - Las Vegas 4230 Burnham Ave (866)697-8378 (MET)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NY - Melville 50 Republic Rd Suite 200 - (516) 677-3800 (QTE)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'OH - Cincinnati 6700 Steger Dr (866)697-8378 (WDL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'OH - Dayton 2308 Sandridge Dr. (937) 297 - 8305 (DAP)': {'Test Code': '36127'}, 'OK - Oklahoma City 225 NE 97th Street (800)891-2917 (DLO)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': 'PATIENT PREPARATION:SPECIMEN COLLECTION AFTER FLUORESCEIN DYE ANGIOGRAPHY SHOULDBE DELAYED FOR AT LEAST 3 DAYS. FOR PATIENTS ONHEMODIALYSIS, SPECIMEN COLLECTION SHOULD BE DELAYED FOR 2WEEKS. ACCORDING TO THE ASSAY MANUFACTURER SIEMENS:"SAMPLES CONTAINING FLUORESCEIN CAN PRODUCE FALSELYDEPRESSED VALUES WHEN TESTED WITH THE ADVIA CENTAUR TSH3ULTRA ASSAY."1 ML SERUMINSTRUCTIONS:THIS ASSAY SHOULD ONLY BE ORDERED ON PATIENTS 1 YEAR OF AGEOR OLDER. ORDERS ON PATIENTS YOUNGER THAN 1 YEAR WILL HAVEA TSH ONLY PERFORMED.', 'Minimum Volume ': '0.7 ML', 'Transport Container ': 'SERUM SEPARATOR TUBE (SST)', 'Transport Temperature ': 'ROOM TEMPERATURE'}, 'OR - Portland 6600 SW Hampton Street (800)222-7941 (SEA)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'PA - Erie 1526 Peach St (814)461-2400 (QER)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': 'Preferred Specimen Volume: 1.0 mlSpecimen Type SERUMSpecimen State Room temperaturePatient preparation: Specimen collection after fluoresceindye angiography should be delayed for at least 3 days. Forpatients on hemodialysis, specimen collection should bedelayed for 2 weeks. According to the assay manufacturerSiemens: Samples containing fluorescein can produce falselydepressed values when tested with the ADVIA Centaur TSH3Ultra Assay.STABILITYSerum:Room temperature: 7 daysRefrigerated: 7 daysFrozen: 28 days', 'Minimum Volume ': '0.7 ml', 'Transport Container ': 'Serum Separator'}, 'PA - Horsham 900 Business Center Dr (866)697-8378 (PHP)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'PA - Pittsburgh 875 Greentree Rd (866)697-8378 (QPT)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'TN - Knoxville, 501 19th St, Trustee Towers – Ste 300 & 301 (866)MY-QUEST (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'TN - Nashville 525 Mainstream Dr (866)697-8378 (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'TX - Houston 5850 Rogerdale Road (866)697-8378 (DAL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'TX - Irving 4770 Regent Blvd (866)697-8378 (DAL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'VA - Chantilly 14225 Newbrook Dr (703)802-6900 (AMD)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Temperature ': 'Room temperature'}, 'WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}}
Specifically, for the "WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)" laboratory:
print(_lab_regions["WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)"])
Output:
{'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}
I have the following list of strings:
data = ['1 General Electric (GE) 24581660 $18.19 0.04 0.22 ',
'2 Qudian ADR (QD) 24227349 12.22 -3.93 -24.33 ',
'3 Square Cl A (SQ) 16233308 48.86 0.05 0.10 ',
'4 Teva Pharmaceutical Industries ADR (TEVA) 15830425 13.70 0.22 1.63 ',
'5 Vale ADR (VALE) 14768221 10.98 0.21 1.95 ',
'6 Bank of America (BAC) 13938799 26.59 -0.07 -0.26 ',
'7 Entercom Communications Cl A (ETM) 13087209 12.00 0.10 0.84 ',
'8 Chesapeake Energy (CHK) 12948648 3.92 -0.05 -1.26 ',
"9 Macy's (M) 12684478 21.07 0.44 2.13 "]
Where the format of every string is: count, stock name, volume, some more int values...
I need to split these strings into a list where each element is one of the items in the string format above, and this is how I attempted to do that:
for i in range(1, len(data)-1):
split = data[i].split()
temp = "{} {} {}".format(split[1], split[2], split[3])
del split[2 : 4]
split[1] = temp
print(split)
However, I believe this is inefficient and it doesn't work when the name is more or less than two words. How would I handle this? Would I have to adjust how I generate the list of strings (data) in the first place?
EDIT:
final_data = [
re.split('(?<=\))\s+|(?<=[\d\$-])\s(?=[\d\$-])|(?<=\d)\s(?=[a-zA-Z])', i)
for i in data[1]]
final_data = [i[:-1]+[i[-1][:-1]] for i in final_data]
print(final_data)
Output:
~/workspace $ python extract.py 2017-11-27-04-26-51-ss.xhtml
[[''],
[''],
[''],
...,
[''],
[''],
['']]
You can use re.split:
import re
data = ['1 General Electric (GE) 24581660 $18.19 0.04 0.22 ', '2 Qudian ADR (QD) 24227349 12.22 -3.93 -24.33 ', '3 Square Cl A (SQ) 16233308 48.86 0.05 0.10 ', '4 Teva Pharmaceutical Industries ADR (TEVA) 15830425 13.70 0.22 1.63 ', '5 Vale ADR (VALE) 14768221 10.98 0.21 1.95 ', '6 Bank of America (BAC) 13938799 26.59 -0.07 -0.26 ', '7 Entercom Communications Cl A (ETM) 13087209 12.00 0.10 0.84 ', '8 Chesapeake Energy (CHK) 12948648 3.92 -0.05 -1.26 ', "9 Macy's (M) 12684478 21.07 0.44 2.13 "]
final_data = [re.split('(?<=[a-zA-Z])\s+(?=\()|(?<=\))\s+|(?<=[\d\$-])\s+(?=[\d\$-])|(?<=\d)\s+(?=[a-zA-Z])', i) for i in data]
Output:
[['1', 'General Electric', '(GE)', '24581660', '$18.19', '0.04', '0.22 '], ['2', 'Qudian ADR', '(QD)', '24227349', '12.22', '-3.93', '-24.33 '], ['3', 'Square Cl A', '(SQ)', '16233308', '48.86', '0.05', '0.10 '], ['4', 'Teva Pharmaceutical Industries ADR', '(TEVA)', '15830425', '13.70', '0.22', '1.63 '], ['5', 'Vale ADR', '(VALE)', '14768221', '10.98', '0.21', '1.95 '], ['6', 'Bank of America', '(BAC)', '13938799', '26.59', '-0.07', '-0.26 '], ['7', 'Entercom Communications Cl A', '(ETM)', '13087209', '12.00', '0.10', '0.84 '], ['8', 'Chesapeake Energy', '(CHK)', '12948648', '3.92', '-0.05', '-1.26 '], ['9', "Macy's", '(M)', '12684478', '21.07', '0.44', '2.13 ']]
With the parenthesis removed:
final_data = [[b[1:-1] if b.startswith('(') and b.endswith(')') else b for b in i] for i in final_data]
Output:
[['1', 'General Electric', 'GE', '24581660', '$18.19', '0.04', '0.22 '], ['2', 'Qudian ADR', 'QD', '24227349', '12.22', '-3.93', '-24.33 '], ['3', 'Square Cl A', 'SQ', '16233308', '48.86', '0.05', '0.10 '], ['4', 'Teva Pharmaceutical Industries ADR', 'TEVA', '15830425', '13.70', '0.22', '1.63 '], ['5', 'Vale ADR', 'VALE', '14768221', '10.98', '0.21', '1.95 '], ['6', 'Bank of America', 'BAC', '13938799', '26.59', '-0.07', '-0.26 '], ['7', 'Entercom Communications Cl A', 'ETM', '13087209', '12.00', '0.10', '0.84 '], ['8', 'Chesapeake Energy', 'CHK', '12948648', '3.92', '-0.05', '-1.26 '], ['9', "Macy's", 'M', '12684478', '21.07', '0.44', '2.13 ']]
You can split lists on characters
All of the strings in your original data list have 2 sections, the stock name and then the number values, if you split on the closing paranthesis in the string you can break it into a list holding a string for the stockname and a string containing the numbers, the numbers have consistent spacing between them of one space and then you can split the list of numbers on the space character.
https://docs.python.org/3/library/stdtypes.html#str.split
I am new to python and looking into scraping HTML using python beautifulsoup library.
I need to fetch date field value as Day and date and precip field value as well as measuring unit .
Python code
dates=[]
Precip=[]
for row in right_table.findAll("tr"):
cells = row.findAll('td')
th_cells=row.findAll('th') #To store second column data
if len(cells)==5:
Precip.append(cells[1].find(text=True))
dates.append(th_cells[0].find(text=True))
print(dates)
print(Precip)
Code Output
['Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ']
['0 ', '0 ', '0 ', '1 ', '3 ', '3 ', '13 ', '0 ', '0 ', '0 ', '0 ', '0 ', '\xa0', '1 ', '3 ', '0 ', '1 ', '4 ', '2 ', '9 ', '2 ', '0 ', '1 ', '0 ', '0 ', '0 ', '0 ', '0 ', '1 ', '2 ']
Required Output
['Wed 11/1','Thur 11/2'.......]
['0mm','0mm'....]
Below is the HTML which i am trying to parse
HTML
<class 'list'>: ['\n', <thead>
<tr>
<th>Date</th>
<th>Hi/Lo</th>
<th>Precip</th>
<th>Snow</th>
<th>Forecast</th>
<th>Avg. HI / LO</th>
</tr>
</thead>, '\n', <tbody>
<tr class="pre">
<th scope="row">Wed <time>11/1</time></th>
<td>25°/20°</td>
<td>0 <span class="small">mm</span></td>
<td>0 <span class="small">CM</span></td>
<td> </td>
<td>28°/18°</td>
</tr>
<tr class="pre">
<th scope="row">Thu <time>11/2</time></th>
<td>28°/19°</td>
<td>0 <span class="small">mm</span></td>
<td>0 <span class="small">CM</span></td>
<td> </td>
<td>27°/18°</td>
</tr>
I'd use .text instead of .find(text=true). What's currently happening is you're not fetching the content of the subtags, like <time>.
from bs4 import BeautifulSoup
import requests
html = requests.get("https://www.accuweather.com/en/in/bengaluru/204108/month/204108?view=table").text
soup = BeautifulSoup(html, 'html.parser')
right_table = soup.find("tbody")
dates=[]
Precip=[]
for row in right_table.findAll("tr"):
cells = row.findAll('td')
th_cells=row.findAll('th') #To store second column data
if len(cells)==5:
Precip.append(cells[1].text)
dates.append(th_cells[0].text)
print(dates)
print(Precip)
This gets the correct outputted result:
['Wed 11/1', 'Thu 11/2', 'Fri 11/3', 'Sat 11/4', 'Sun 11/5', 'Mon 11/6', 'Tue 11/7', 'Wed 11/8', 'Thu 11/9', 'Fri 11/10', 'Sat 11/11', 'Sun 11/12', 'Mon 11/13', 'Tue 11/14', 'Wed 11/15', 'Thu 11/16', 'Fri 11/17', 'Sat 11/18', 'Sun 11/19', 'Mon 11/20', 'Tue 11/21', 'Wed 11/22', 'Thu 11/23', 'Fri 11/24', 'Sat 11/25', 'Sun 11/26', 'Mon 11/27', 'Tue 11/28', 'Wed 11/29', 'Thu 11/30']
['0 mm', '0 mm', '0 mm', '1 mm', '3 mm', '3 mm', '13 mm', '0 mm', '0 mm', '0 mm', '0 mm', '0 mm', '\xa0', '1 mm', '3 mm', '0 mm', '1 mm', '4 mm', '2 mm', '9 mm', '2 mm', '0 mm', '1 mm', '0 mm', '0 mm', '0 mm', '0 mm', '0 mm', '1 mm', '2 mm']