Beautifulsoup parsing html line breaks - python

I'm using BeautifulSoup to parse some HTML from a text file. The text is written to a dictionary like so:
websites = ["1"]
html_dict = {}
for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read().splitlines()
html_dict.update({website_id:get_raw_html})
I parse the HTML from html_dict = {} to find texts with the <p> tag:
scraped = {}
for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all('p')
This is what the HTML in html_dict looks like:
<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>
The problem is, BeautifulSoup seems to be considering the line break and ignoring the second line. So when i print out scrape_selected_tags the output is...
<p>Hey, this should be scraped</p>
when I would expect the whole text.
How can I avoid this? I've tried splitting the lines in html_dict and it doesn't seem to work. Thanks in advance.

By calling splitlines when you read your html documents you break the tags in a list of strings.
Instead you should read all the html in a string.
websites = ["1"]
html_dict = {}
for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read()
html_dict.update({website_id:get_raw_html})
Then remove the inner for loop, so you won't iterate over that string.
scraped = {}
for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
soup = BeautifulSoup(raw_html, 'html.parser')
scrape_selected_tags = soup.find_all('p')
BeautifulSoup can handle newlines inside tags, let me give you an example:
html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p'))
[<p>Hey, this should be scraped\nbut this part gets ignored for some reason.</p>]
But if you split one tag in multiple BeautifulSoup objects:
html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''
for line in html.splitlines():
soup = BeautifulSoup(line, 'html.parser')
print(soup.find_all('p'))
[<p>Hey, this should be scraped</p>]
[]

Related

Parsing HTML data with BeautifulSoup - cannot extract the 'href' out in one string

I'm trying to parse out the html to get the - 'href' link;
My code is parsing the 'href link' into separate string, but I'm hoping to get a complete string.
Here is my code:
data = requests.get("https://www.chewy.com/b/food_c332_p2",
auth = ('user', 'pass'),
headers = {'User-Agent': user_agent})
with open("dogfoodpage/dg2.html","w+") as f:
f.write(data.text)
with open("dogfoodpage/dg2.html") as f:
page = f.read()
soup = BeautifulSoup(page,"html.parser")
test = soup.find('a',class_= "kib-product-title")
productlink = []
for items in test:
for link in items.get("href"):
productlink.append(link)
Here is my output:
Here is the html structure for test:
productlink = []
for items in test:
for link in items.find_all('a', class_="kib-product-title"):
productlink.append(link.get('href'))
This should work. find method returns a single href as a string and while looping over the string you are getting URL as a list divided into characters. find_all method will get all the links and we can iterate over it to get the links.

After scraping I can not write the text to a text file

I am trying to scrape the prices from a website and it's working but... I can't write the result to a text.file.
this is my python code.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.futbin.com/stc/cheapest"
r = requests.get(url)
soup = bs(r.content, "html.parser")
price = soup.find("div", {"class":"d-flex row col-md-9 px-0"})
name =("example")
f =open(name + '.txt', "a")
f.write(price.text)
This is not working but if I print it instead of try to write it to a textfile it's working. I have searched for a long time but don't understand it. I think it must be a string to write to a text file but don't know how to change the ouput to a string.
You're getting error due to unicode character.
Try to add encoding='utf-8' property while opening a file.
Also your code gives a bit messy output. Try this instead:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.futbin.com/stc/cheapest"
r = requests.get(url)
soup = bs(r.content, "html.parser")
rows = soup.find("div", {"class":"d-flex row col-md-9 px-0"})
prices = rows.findAll("span",{"class":"price-holder-row"})
names = rows.findAll("div",{"class":"name-holder"})
price_list = []
name_list = []
for price in prices:
price_list.append(price.text.strip("\n "))
for name in names:
name_list.append(name.text.split()[0])
name =("example")
with open(f"{name}.txt",mode='w', encoding='utf-8') as f:
for name, price in zip(name_list,price_list):
f.write(f"{name}:{price}\n")

Trying to remove tags with Python (BeautifulSoup)

This is my code:
from bs4 import BeautifulSoup
url = "https://www.example.com/"
result = requests.get(url)
soup = BeautifulSoup(result.text, "html.parser")
find_by_class = soup.find('div', attrs={"class":"class_name"}).find_all('p')
I want to print the data without the html tags, but I can't use get_text() after the find_all('p').
You could use a for loop, like so:
for i in soup.find('div', attrs={"class":"class_name"}).find_all('p'):
print(i.get_text())
or if you want to save that information, put it into an array:
things = []
for i in soup.find('div', attrs={"class":"class_name"}).find_all('p'):
things.append(i.get_text())

Best way to loop this situation?

I have a list of divs, and I'm trying to get certain info in each of them. The div classes are the same so I'm not sure how I would go about this.
I have tried for loops but have been getting various errors
Code to get list of divs:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://sneakernews.com/release-dates/'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, "lxml")
soup1 = soup.find("div", {'class': 'popular-releases-block'})
soup1 = str(soup1.find("div", {'class': 'row'}))
soup1 = soup1.split('</div>')
print(soup1)
Code I want to loop for each item in the soup1 list:
linkinfo = soup1.find('a')['href']
date = str(soup1.find('span'))
name = soup1.find('a')
non_decimal = re.compile(r'[^\d.]+')
date = non_decimal.sub('', date)
name = str(name)
name = re.sub('</a>', '', name)
link, name = name.split('>')
link = re.sub('<a href="', '', link)
link = re.sub('"', '', link)
name = name.split(' ')
name = str(name[-1])
date = str(date)
link = str(link)
print(link)
print(name)
print(date)
Based on the URL you posted above, I imagine you are interested in something like this:
import requests
from bs4 import BeautifulSoup
url = requests.get('https://sneakernews.com/release-dates/').text
soup = BeautifulSoup(url, 'html.parser')
tags = soup.find_all('div', {'class': 'col lg-2 sm-3 popular-releases-box'})
for tag in tags:
link = tag.find('a').get('href')
print(link)
print(tag.text)
#Anything else you want to do
If you are using the BeautifulSoup library, then you do not need regex to try to parse through HTML tags. Instead, use the handy methods that accompany BeautifulSoup. If you would like to apply a regex to the text output from the tags you locate via BeautifulSoup to accomplish a more specific task, then that would be reasonable.
My understanding is that you want to loop your code for each item within a list.
An example of this:
my_list = ["John", "Fred", "Tom"]
for name in my_list:
print(name)
This will loop for each name that is in my_list and print out each item (reffered to here as name in the list). You could do something similar with your code:
for item in soup1:
# perform some action

BeautifulSoup child-tags and removing duplicates

I'm attempting to clean some html by parsing it through BeautifulSoup using Python 2.
BeautifulSoup parses the raw_html which is associated with a website_id in the html_dict. It also removes any attribute associated with the html tags (<a>, <b>, and <p>).
html_dict = {"l0000": ["<a href='some url'>test</a>", "lol", "<a><b>test</b></a>"], "l0001":["<p>this is html</p>", "<p>this is html</p>"]}
clean_html = {}
for website_id, raw_html in html_dict.items():
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all(["a", "b", "p"])
# Remove attributes from html tag
for i in scrape_selected_tags:
i.attrs = {}
print website_id, scrape_selected_tags
This outputs:
l0001 [<p>this is html</p>]
l0001 [<p>this is html</p>]
l0000 [<a>test</a>]
l0000 []
l0000 [<a><b>test</b></a>, <b>test</b>]
I have two questions:
1) The last output has outputted "test" twice. I assume this is because it is surrounded by both the <a> and <b> tags? How does one deal with child-tags to output <a><b>test</b></a> only?
2) Given a unique website_id, how would one remove duplicates, so that there's only one occurrence of <p>this is html</p> for l0001? I know that scrape_selected_tags has a type of bs4.element.ResultSet and I'm not sure how to handle this and insert the new output in the same format as html_dict but in clean_html.
Thanks
1) Set the recursive argument to False. This will select only direct descendants and will not go deeper in the soup. The problem with this method is that children tags will hold their attributes, so you'll have to use one more loop to clean them.
2) Use a set (or you could use list comprehensions) to select only unique tags.
from bs4 import BeautifulSoup
html_dict = {
"l0000":["<a href='some url'>test</a>", "lol", "<a class='1'><b class='2'>test</b></a>"],
"l0001":["<p>this is html</p>", "<p>this is html</p>"]
}
clean_html = {}
for website_id, raw_html in html_dict.items():
clean_html[website_id] = []
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all(["a", "b", "p"], recursive=False)
for i in scrape_selected_tags:
i.attrs = {}
for i in [c for p in scrape_selected_tags for c in p.find_all()]:
i.attrs = {}
clean_tags = list(set(scrape_selected_tags + clean_html[website_id]))
clean_html[website_id] = clean_tags
print(clean_html)
{'l0001': [<p>this is html</p>], 'l0000': [<a><b>test</b></a>, <a>test</a>]}

Categories

Resources