Is there a way to separate strings in HTML? - python

I'm trying to get the address of some companies from WSJ.com. However, I couldn't figure out a reliable way to separate the city and the state/province from the HTML page.
here's my code and output
code = "TURN"
url = "https://www.wsj.com/market-data/quotes/{}".format(code)
headers = {'User-Agent':str(ua.random)}
page = requests.get(url, headers = headers)
page.encoding = page.apparent_encoding
pageText = page.text
soup = BeautifulSoup(pageText, 'html.parser')
address = soup.find('div', {"class" : "WSJTheme--contact--bDuH_KYx"}).contents[0]
print(address.contents[2])
Output: <span class="">Montclair New Jersey 07042</span>
I want to get a result like [Montclair, New Jersey]. However, I cant simply separate the string by space since there are inputs like "San Diego California 92130" or "Beijing Beijing 100022" which requires different rules to separate them.
They are separated strings in the original HTML code, I'm not sure if this helps.
<span class="">
"Montclair"
"New Jersey"
"07042"
</span>

I would suggest grabbing the zip code and then using a library like: https://pypi.org/project/zipcodes/

If html really looks like you portrayed it, you can simply split at quotes.
a = address.contents[2].text
b = a.split('"', 4)
city = b[1]
state = b[3]
print(f"{city}, {state}")
output: Montclair, New Jersey

Related

Any easy way to extract details from a HTM webpage?

I am trying to extract the following address from the 10-Q on this webpage and need help getting it to work: https://www.sec.gov/ix?doc=/Archives/edgar/data/1318605/000095017022012936/tsla-20220630.htm
1 Tesla Road
Austin, Texas
URL = f'https://www.sec.gov/ix?doc=/Archives/edgar/data/{cik}/{accessionNumber}/{primaryDocument}'
response = requests.get(URL, headers = headers)
soup = BeautifulSoup(response.content, "html.parser")
soup.find_all('dei:EntityAddressAddressLine1')
Where:
cik = 0001318605
accessionNumber = 000095017022012936
primaryDocument
= tsla-20220630.htm
Unfortently, because I am running this on DataBricks, using Selenium isn't an immediate solution I can take. However, it does look like this method works!
r = requests.get(f'https://www.sec.gov/Archives/edgar/data/{cik}/{accessionNumber.replace("-", "")}/{accessionNumber}.txt', headers=headers)
raw_10k = r.text
city = raw_10k.split('Entity Address, City or Town</a></td>\n<td class="text">')[1].split('<span></span>')[0]
print(city)
As you have already realized, the data is added from the https://www.sec.gov/Archives.... site, and you would need something like selenium to get it from the https://www.sec.gov/ix?doc=/Archives.... site.
[The URL I used was https://www.sec.gov/Archives/edgar/data/1318605/000095017022012936/tsla-20220630.htm and I just copied the cookies and headers from my own browser to pass into the request. I tried to open the link in your answer, but I got a NoSuchKey error...]
If you've managed to fetch a html containing 10-Q form, I feel that the simplest way to extract the address would be with css selectors
[s.text for s in soup.select('td *[name^="dei:EntityAddress"]')]
will return ['1 Tesla Road', 'Austin', 'Texas', '78725'] and so, with
print(', '.join([
s.get_text(strip=True) for s in
soup.select('p>span *[name^="dei:EntityAddress"]')
if 'ZipCode' not in s.get('name') # excludes zipcode
]))
1 Tesla Road, Austin, Texas will be printed.
You can also use
addrsCell = soup.find(attrs={'name':'dei:EntityAddressAddressLine1'})
if addrsCell and addrsCell.find_parent('td'): # is not None
print(' '.join([
s.text for s in addrsCell.find_parent('td').select('p')]))
to get 1 Tesla Road Austin, Texas, which is exactly as you formatted it in your question.

BeautifulSoup: Extracting text from nested tags

Long time lurker, first time poster. I spent some time looking over related questions but I still couldn't seem to figure this out. I think it's easy enough but please forgive me, I'm still a bit of a BeautifulSoup/python n00b.
I have a text file of URLs I parsed from a previous webscraping exercise that I'd like to search through and extract the text contents of a list item (<li>) based on a given keyword. I want to save a csv file of the URL as one column and the corresponding contents from the list item in the second column. In this case, it's albums that I'd like to create a table of by who mastered the album, produced the album, etc.
Given a snippet of html:
from https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls
...
<li>
<span class="entity_1XpR8">Recorded By</span>
" – "
EMI Studios, Paris
</li>
<li>
<span class="entity_1XpR8">Mastered At</span>
" – "
Sterling Sound
</li>
etc etc etc
...
My code so far is something like:
import requests
import pandas as pd
from bs4 import BeautifulSoup
results = []
kw = "Mastered At"
with open("urls.txt") as file:
for line in file:
url = line.rstrip()
source = requests.get(url).text
soup = BeautifulSoup(source, "html.parser")
x = soup.find_all('span', string='Mastered At')
results.append((url, x))
print(results)
df = pd.DataFrame(results)
df.to_csv('mylist1.csv')
With some modifications based on comments below, still having issues:
As you can see I'm trying to do this within a for loop for each link in a list.
The URL list is a simple text file with separate lines for each. Since I'm scraping only one website the sources, class names, and etc should be the same, but the dish will change from page to page.
ex URL list:
https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls
https://www.discogs.com/release/3872976-Pink-Floyd-The-Wall
... etc etc etc
updated code snippet:
import requests
import pandas as pd
from bs4 import BeautifulSoup
results = []
with open("urls.txt") as file:
for line in file:
url = line.rstrip()
print(url)
source = requests.get(url).text
soup = BeautifulSoup(source, "html.parser")
for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Mastered At']:
results.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(),
x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(results, columns=['Url', 'Mastered At', 'Studio'])
print(df)
df.to_csv('studios.csv')
I'm hoping the output in this case is Col 1: (url from txt file); Col 2: "Mastered At — Sterling Sound" (or just "Sterling Sound"), but for each page in the list because these items vary from page to page. I will change the keyword to extract different list items accordingly. In the end I'd like one big spreadsheet with the full list or the url and corresponding item side by side something like below:
example:
album url | Sterling Sound
album url | Abbey Road
album url | Abbey Road
album url | Sterling Sound
album url | Real World Studios
album url | EMI Studios, Paris
album url | Sterling Sound
etc etc etc
Thanks for your help!
Cheers.
The Beautiful Soup library is best suited for this task.
You can use the following code to extract data:
import requests, lxml
from bs4 import BeautifulSoup
# urls.html would be better
with open("urls.txt") as file:
src = file.read()
soup = BeautifulSoup(src, 'lxml')
for first, second in zip(soup.select("li span"), soup.select("li a")):
print(first)
print(second)
To find the desired selector, you can use the select() bs4 method. This method accepts a selector to search for and returns a list of all matched HTML elements.
In this case, I use the zip() built-in function, which allows you to go through two structures at once in one cycle.
Then you can use the data for your tasks.
BeautifulSoup can use different parsers for html. If you have issues with lxml you can try others, like html.parser. You can try the following code, it will create a dataframe from your data, which can then be further saved to csv or other formats:
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<li>
<span class = "spClass">Breakfast</span> " — "
Pancakes
</li>
<li>
<span class = "spClass">Lunch</span> " — "
Sandwiches
</li>
<li>
<span class = "spClass">Dinner</span> " — "
Stew
</li>
'''
soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in soup.select('li'):
df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])
print(df) ## you can save the dataframe as csv like so: df.to_csv('foods.csv')
Result:
Url Food Type
0 /examplepage/Pancakes Pancakes Breakfast
1 /examplepage/Sandwiches Sandwiches Lunch
2 /examplepage/Stew Stew Dinner
EDIT: If you only want to extract specific li tags, as per your comment, you can do:
soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Dinner']:
df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])
And this will return:
Url Food Type
0 /examplepage/Stew Stew Dinner

find_all on span tag in Beautiful Soup yields AttributeError: ResultSet object has no attribute 'get_text'

Warning: this is only my second attempt at Python code so I may be making errors that will cause distress to a professional:
I'd like to get a list of cities using 'addressLocality' from the set of results in soup_r:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.tjhughes.co.uk/map'
page = requests.get(URL, verify=False)
soup_r = BeautifulSoup(page.text, 'html.parser')
this is the type of result I'd like with just the name of the city (in this case = Bradford)
single_span = soup_r.find('span',itemprop = 'addressLocality').get_text()
I'd like to be able to return the full list of results in the same format as single_span (ie by isoloating the city name) but the following code gives me the error "AttributeError: ResultSet object has no attribute 'get_text'"
spans_fail = soup_r.find_all('span',itemprop = 'addressLocality').get_text()
The nearest I can get is by dropping the get_text():
spans = soup_r.find_all('span',itemprop = 'addressLocality')
...thus returning the results in one bundle:
[<span itemprop="addressLocality">Bradford</span>, <span itemprop="addressLocality">Birkenhead</span>, <span itemprop="addressLocality">Bootle</span>, <span itemprop="addressLocality">Bury</span>,
...
<span itemprop="addressLocality">Sheffield</span>, <span itemprop="addressLocality">St Helens</span>, <span itemprop="addressLocality">Widnes</span>]
Assuming this is the best I can do, I still get tied in knots when I try to re-arrange the results.
For instance this just returns Bradford 52 times which baffles me because there are only 26 cities in the original list so I don't know how I'm doubling up, let alone how to access the other 25:
cities = []
for check in soup:
check = soup.find('span',itemprop = 'addressLocality').text
cities.append(check)
I was looking for an elegantly simple solution, and I appreciate that I might need a workaround, but I can't see how else to approach this and so any input is appreciated.
You can use list comprehension to obtain your list of cities.
For example:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.tjhughes.co.uk/map'
page = requests.get(URL, verify=False)
soup_r = BeautifulSoup(page.text, 'html.parser')
cities = [span.get_text() for span in soup_r.select('span[itemprop="addressLocality"]')]
print(cities)
Prints:
['Bradford', 'Birkenhead', 'Bootle', 'Bury', 'Chelmsford', 'Chesterfield', 'Glasgow', 'Cumbernauld', 'London', 'Coventry', 'Dundee', 'Durham', 'East Kilbride', 'Glasgow', 'Harlow', 'Hartlepool', 'Liverpool', 'Maidstone', 'Middlesbrough', 'Newcastle upon Tyne', 'Nuneaton', 'Oldham', 'Preston', 'Sheffield', 'St Helens', 'Widnes']
When you get down to a list of single elements sometimes you have to do string chopping.
spans = soup_r.find_all('span',itemprop = 'addressLocality')
# [<span itemprop="addressLocality">Bradford</span>, <span
cities = []
for span in spans:
left_angle=span.find('>'+1)
sec_rangle=spane.find('<',1)
city=span[left_angle:sec_rangle]
print(city)
cities.append(city)
print(cities)

BeautifulSoup Python .text method doesn't return proper text

I'm trying to scrape soccer results from a website. I get the results with the html and when I try to remove them with .text I get strange output. I use the parent method to get the parent HTML element for the whole score.
The scraper script:
response = requests.get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
results = html_soup.findAll('strong',text="East Wall Rovers")
chosen_team_results=[]
for result in results:
chosen_team_results.append(result.parent.text)
print(chosen_team_results)
HTML:
<p class="zeta"><strong>
Killester Donnycarney FC</strong>
1
<strong>Cherry Orchard</strong>
2
</p>
<p class="zeta"><strong>
Ballymun United</strong>
2
<strong>Bluebell United</strong>
1
</p>
OUTPUT:
'\r\n\t\t\tValeview Shankill\r\n\t\t\t1\r\n\t\t\tEast Wall Rovers\r\n\t\t\t5\r\n\t\t\t\t\t\t', '\r\n\t\t\tMarks Celtic FC\r\n\t\t\t0\r\n\t\t\tEast Wall Rovers\r\n\t\t\t5\r\n\t\t\t\t\t\t', '\r\n\t\t\tBlessington FC\r\n\t\t\t0\r\n\t\t\tEast Wall Rovers\r\n\t\t\t5\r\n\t\t\t\t\t\t', '\r\n\t\t\tParkvale FC\r\n\t\t\t2\r\n\t\t\tEast Wall Rovers\r\n\t\t\t1\r\n\t\t\t\t\t\t', '\r\n\t\t\tBoyne Rovers\r\n\t\t\t1\r\n\t\t\tEast Wall Rovers\r\n\t\t\t1\r\n\t\t\t\t\t\t'
I expect the results to be in plain text just the teams and the points.
To get rid of the blank space, I recommend you do something like this:
for result in results:
chosen_team_results.append(''.join(str(result.parent.text).split()))
print(chosen_team_results)
You can add a .strip() method to the string/text so it displays only the text without the \r\n\t\t\t (linebreaks etc.)
for example:
songs = soup.find_all(name="h3", id="title-of-a-story")
songs_text = [song.getText().strip() for song in songs ]

why cant I use a string when concatenating a url for Beautifulsoup?

question updated, see below
I'm trying to scrape city and state given zip code. Here's code that works:
r = requests.get("http://www.city-data.com/zips/11021.html")
data = r.text
soup = BeautifulSoup(data)
main_body = soup.find(id="main_body").findAll('a')[5].string
print main_body
I get the following, correct string:
Great Neck Plaza, NY
the following code does not (it prints the wrong string):
zipCode = str(10023)
url = "http://www.city-data.com/zips/" + zipCode + ".html"
print url
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
main_body = soup.find(id="main_body").findAll('a')[5].string
print main_body
here's the wrong string:
Recent home sales, real estate maps, and home value estimator for zip code 10023
Why cant I use a string for the zip code? What else can I do, as I'm trying to write a function to look up city and state?
UPDATE
Per some suggestions, i'm now searching for the text immediately prior to the tag I want. here is the text i'm searching for followed by the info I actually want:
<b>City:</b>
New York, NY
here is the code i'm not trying:
zipCode = str(11021)
url = "http://www.city-data.com/zips/" + zipCode + ".html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
main_body = soup.findAll(text="City:")
print main_body
All I get, however, are empty brackets. How do I search for the City: text and then get the string for the next tag?
Your code is working, but the premise of your solution is not correct. Your code (findAll('a')[5]) assumes that the data you're after is going to be in the same place for every zip code page. However, if you look at the pages for zips 11021 and 10023, you'll see that they don't have the same number of hyperlinks. You need to find another way to locate the data than simply grabbing index 5 of the array of hyperlinks on the page.
here's code that worked for me:
zipCode = str("07928")
url = "http://www.city-data.com/zips/" + zipCode + ".html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
if soup.findAll(text="City:") ==[]:
cityNeeded = soup.findAll(text="Cities:")
for t in cityNeeded:
print t.find_next('a').string
else:
cityNeeded = soup.findAll(text="City:")
for t in cityNeeded:
print t.find_next('a').string

Categories

Resources