I'm trying to scrape two values from a webpage using BeautifulSoup. When printing only one value, the content looks good. However, when printing two values (to the same line), html-code is displayed around the one of the values..
Here is my code:
from bs4 import BeautifulSoup
import urllib.request as urllib2
list_open = open("source.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
i = 0
for url in line_in_list:
soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
sku = soup.find_all(attrs={'class': "identifier"})
description = soup.find_all(attrs={'class': "description"})
for text in description:
print((sku), text.getText())
i += 1
And the output looks like this:
[<span class="identifier">112404</span>] A natural for...etc
[<span class="identifier">110027</span>] After what...etc
[<span class="identifier">03BA5730</span>] Argentina is know...etc
[<span class="identifier">090030</span>] To be carried...etc
The output should preferably be without the [<span class="identifier">-thing around the numbers...
I guess the problem is in the last for-loop, but I have no idea how to correct it. All help is appreciated. Thanks! -Espen
It looks like you need to zip() identifiers and descriptions and call getText() for every tag found in the loop:
identifiers = soup.find_all(attrs={'class': "identifier"})
descriptions = soup.find_all(attrs={'class': "description"})
for identifier, description in zip(identifiers, descriptions):
print(identifier.getText(), description.getText())
find_all() returns a ResultSet, which is more or less a fancy list. Printing a ResultSet will include the surrounding left and right square brackets that typically denote a list, and the items (tags) will be displayed within.
Your sample output suggests that the HTML for each URL contains one SKU and one description per URL. If that is correct then your code could just pick off the first item in each ResultSet like this:
sku = soup.find_all(attrs={'class': "identifier"})
description = soup.find_all(attrs={'class': "description"})
print(sku[0].get_text(), description[0].get_text())
Or, you could just find the first of each using find():
sku = soup.find(attrs={'class': "identifier"})
description = soup.find(attrs={'class': "description"})
print(sku.get_text(), description.get_text())
However, your code suggests that there can be multiple descriptions for each SKU because you are iterating over the description result set. Perhaps there can be multiple SKUs and descriptions per page (in which case see #alecxe's answer)? It's difficult to tell.
If you could update your question by adding live URLs or sample HTML we could offer better advice.
Related
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/roster.era?CID=102353 and I am able to do it for the names beginning with ANANDASABAPATHY which are contained inside a "p" tag:
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("p")
and then playing with the length of the element:
for bullet in column:
if len(bullet.find_all("br"))==4:
person = {}
person["NAME"]=bullet.contents[0].strip()
person["PROFESSION"]=bullet.contents[2].strip()
person["DEPARTMENT"]=bullet.contents[4].strip()
person["INSTITUTION"]=bullet.contents[6].strip()
person["LOCATION"]=bullet.contents[8].strip()
However, I have 2 issues.
I am unable to scrape the information for the chairperson (GUDJONSSON) which is not contained inside a "p" tag. I was trying something like:
soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()
but it is not working
I am unable to differentiate between the last 2 persons (WONDRAK and GERSCH) because they are both contained inside the same "p" tag.
Any help would be extremely useful! Thanks in advance!
This is a case where it may be easier to handle processing the data more as plain text than as HTML, after initially extracting the element you're looking for. The reason is that the HTML is not very well formatted for parsing / it doesn't follow a very uniform pattern. The html5lib package generally handles poorly formatted html better than html.parser, but it didn't help significantly in this case.
import re
from typing import Collection, Iterator
from bs4 import BeautifulSoup
def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
for sibling in soup.find('b').next_siblings:
for block in sibling.stripped_strings:
block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n'))))
if block_str and block_str not in ignore:
yield block_str
def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
zip_code_pattern = re.compile(r', \d+$')
people = []
person = []
for line in iter_lines(soup, ignore):
person.append(line)
if zip_code_pattern.search(line):
people.append(person)
person = []
return people
def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
return {
'NAME': raw_person[0],
'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
'INSTITUTION': raw_person[-2],
'LOCATION': raw_person[-1],
}
raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
normalized = [normalize_person(person) for person in raw_people]
This works with both BeautifulSoup(content, 'html.parser') and BeautifulSoup(content, 'html5lib').
The iter_lines function finds the first <b> tag like you did before, and then yields a single string for each line that is displayed in a browser.
The group_people function groups the lines into separate people, using the zip code at the end to indicate that that person's entry is complete. It may be possible to combine this function with iter_lines and skip the regex, but this was slightly easier. Better formatted html would be more conducive to that approach.
The ignore parameter was used to skip the SCIENTIFIC REVIEW OFFICER header above the last person on that page.
Lastly, the normalize_person function attempts to interpret what each line for a given person means. The name, institution, and location appear to be fairly consistent, but I took some liberties with profession and department to use None when it seemed like a value did not exist. Those decisions were only made based on the particular page you linked to - you may need to adjust those for other pages. It uses negative indexes for the institution and location because the number of lines that existed for each person's data was variable.
final edit: so here's the solution -
list_c = [[x, y] for x, y in zip(titleList, linkList)]
Original post: I used bs4 to scrape a recipe website where the title to each recipe is not saved within the link tag. so I've extracted the titles of the recipes from one part of the code, and extracted the links from the other part and I've got these two lists (recipes, links) but I'm not sure the best way to pair each title to its corresponding link.
(The end goal is to have the titles be hyperlinked in an HTML file that I will put on my eventual recipe aggregator website).
I was considering saving them to a dictionary as key value pairs, or something else(?), so that I can call them into the HTML doc later on.
suggestions?
EDIT:
here's the code, works fine
soup = BeautifulSoup(htmlText, 'lxml')
links = soup.find_all('article')
linkList = []
titleList = []
for link in links[0:12]:
hyperL = link.find('header', class_ = 'entry-header').a['href']
linkList.append(hyperL)
for title in links:
x = title.get('aria-label')
titleList.append(x)
linkList prints out something like
['www.recipe.com/ham', 'www.recipe.com/curry', 'www.recipe.com/etc']
and
titleList is ['Ham', 'Curry', 'etc']
I want to print a list from these 2 like this:
[['Ham', 'www.recipe.com/ham'],['Curry', 'www.recipe.com/curry']]
The final goal for my website, I would want to have the following for each pair:
<a href='www.recipe.com/ham'>Ham<a/>
If you only anticipate looking up titles, and then using the result links, dictionaries are great for that.
As mentioned above, I am trying to remove HTML from the printed output to just get text and my dividing | and -. I get span information as well as others that I would like to remove. As it is part of the program that is a loop, I cannot search for the individual text information of the page as they change. The page architecture stays the same, which is why printing the items in the list stays the same. Wondering what would be the easiest way to clean the output. Here is the code section:
infoLink = driver.find_element_by_xpath("//a[contains(#href, '?tmpl=component&detail=true&parcel=')]").click()
driver.switch_to.window(driver.window_handles[1])
aInfo = driver.current_url
data = requests.get(aInfo)
src = data.text
soup = BeautifulSoup(src, "html.parser")
parsed = soup.find_all("td")
for item in parsed:
Original = (parsed[21])
Owner = parsed[13]
Address = parsed[17]
print (*Original, "|",*Owner, "-",*Address)
Example output is:
<span class="detail-text">123 Main St</span> | <span class="detail-text">Banner,Bruce</span> - <span class="detail-text">1313 Mockingbird Lane<br>Santa Monica, CA 90405</br></span>
Thank you!
To get the text between the tags just use get_text() but you should be aware, that there is always text between the tags to avoid errors:
for item in parsed:
Original = (parsed[21].get_text(strip=True))
Owner = parsed[13].get_text(strip=True)
Address = parsed[17].get_text(strip=True)
I wrote an algorithm recently that does something like this. It won't work if your target text has a < or a > in it, though.
def remove_html_tags(string):
data = string.replace(string[string.find("<"):string.find(">") + 1], '').strip()
if ">" in data or "<" in data:
return remove_html_tags(data)
else:
return str(data)
It recursively removes the text between < and >, inclusive.
Let me know if this works!
I'm fairly new to web scraping in Python; and after reading most of the tutorials on the topic online I decided to give it a shot. I finally got one site working but the output is not formatted properly.
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
page = requests.get("https://leeweebrothers.com/our-food/lunch-boxes/#")
soup = BeautifulSoup(page.text, "html.parser")
for div in soup.find_all('h2'): #prints the name of the food"
print(div.text)
for a in soup.find_all('span', {'class' : 'amount'}): #prints price of the food
print(a.text)
Output
I want both the name of the food to be printed side by side with the corresponding price of the food, concatenated by a "-" ... Would appreciate any help given, thanks!
Edit: After #Reblochon Masque comments below - I've run into another problem; As you can see there is a $0.00 which is a value from the inbuilt shopping cart on the website, how would i exclude this as an outlier and continue moving down the loop while ensuring that the other items in the price "move up" to correspond to the correct food?
Best practice is to use zip function in the for loop, but we can do that this way also. This is to just to show we can do by using indexing the two lists.
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for index in range(len(names)):
print('{} - {}'.format(names[index].text, rest[index].text))
You could maybe zip the two results:
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for div, a in zip(names, rest):
print('{} - {}'.format(div.text, a.text))
# print(f"{div.text} - {a.text}") # for python > 3.6
I am new to Beautiful Soup and I am trying to extract the information that appears on a page. This info is contained in the div class="_50f3", and depending on the user it can contain multiple info (studies, studied, works, worked, lives, etc). So, far I have managed though the following code to parse the div classes, but I don't know how to extract the information I want from that..
table = soup.findAll('div', {'class': '_50f3'})
[<div class="_50f3">Lives in <a class="profileLink" data-hovercard="/ajax/hovercard/page.php?id=114148045261892" href="/Fort-Worth-Texas/114148045261892?ref=br_rs">Fort Worth, Texas</a></div>,
<div class="_50f3">From <a class="profileLink" data-hovercard="/ajax/hovercard/page.php?id=111762725508574" href="/Dallas-Texas/111762725508574?ref=br_rs">Dallas, Texas</a></div>]
For example, in the above I would like to store "Lives in" : "Fort Worth, Texas" and "From": "Dallas, Texas". But in the most general case I would like to store whatever info there is in there.
Any help greatly appreciated!
In general case this is just get_text() you need - it would construct a single element text string recursively going through the child nodes:
table = soup.find_all('div', {'class': '_50f3'})
print([item.get_text(strip=True) for item in table])
But, you can also extract the labels and values separately:
d = {}
for item in table:
label = item.find(text=True)
value = label.next_sibling
d[label.strip()] = value.get_text()
print(d)
Prints:
{'From': 'Dallas, Texas', 'Lives in': 'Fort Worth, Texas'}
for i in range(len(table)):
print(table[i].text)
Should Work