python HTML parsing issue - python

Given an html page, I would like to only get an array of variables like this (id1, value1), (id2, value2), ...., the file is given like this:
<div class="col m3 s12 col_title"><div class="font-small grey-text truncate content" title="value1">value1</div></div>
<div class="col m7 s12 col_id"><div class="content wrap">id1</div></div>
every value is followed by a "content wrap" id.
I was thinking of something like:
match = re.compile('title="(.+?)".+?wrap"(.+?)"').findall(source)
This is an example:
<li class="collection-item Ids ">
<div class="row">
<div class="col m3 s12 col_title"><div class="font-small grey-text truncate content" title="filename1">filename1</div></div>
<div class="col m7 s12 col_id"><div class="content wrap">6000bc3211af43d7</div></div>
<div></div>
<div class="col m2 s12 col_time">
<div class="content">
<a href="http://test.com/test.php" target="_blank" class="secondary-content pull-right">
<span class="font-small grey-text" title="filex">test</span>
<i class="fa fa-external-link" aria-hidden="true" title="filey"></i>
</a>
</div>
</div>
</div>

Can you show the example for id1 and value1?
I have a idea :D
\w{1,}\d{1,}<
And getting from 1 to len(match)-1
It can not true.

You can try to use Beautiful Soup, it should have everything you need for parsing HTML.
For exemple, you could use :
# open the html from the website or from a file, check the doc
soup = BeautifulSoup(urllib.urlopen(yoururl), "lxml")
result = soup.find_all(class_="content wrap").get_text()
Here, result would be an array containing all the text contents inside the elements that have a "content wrap" class.

Building on TheWildHealer's answer, you can use the following:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://websitehere.com").text, "lxml")
results = []
for row in soup.find_all(class_ = "row"):
titleText = row.find(class_ = "col_title").get_text()
idText = row.find(class_ = "col_id").get_text()
results.append((idText, titleText))

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

I am trying to do web scraping using BeautifulSoup. The code I have written is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(type(questions[0]))
When I run the code, I get the error message below:
print(type(questions[10]))
IndexError: list index out of range
Then i tried to print the list like below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(questions)
And then I get an empty list: []
What am I doing wrong?
Thanks for your answers.
.question-summary is incorrect locator because it's a portion of id meaning each id value start with question-summary. Now it's working.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select('[id^="question-summary"]')
print(questions)
Output:
1" data-post-type-id="1" id="question-summary-71715531">
<div class="s-post-summary--stats js-post-summary-stats">
<div class="s-post-summary--stats-item s-post-summary--stats-item__emphasized" title="Score of 0">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">votes</span>
</div>
<div class="s-post-summary--stats-item" title="0 answers">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">answers</span>
</div>
<div class="s-post-summary--stats-item" title="5 views">
<span class="s-post-summary--stats-item-number">5</span>
<span class="s-post-summary--stats-item-unit">views</span>
</div>
</div>
<div class="s-post-summary--content">
<h3 class="s-post-summary--content-title">
<a class="s-link" href="/questions/71715531/is-it-possible-to-draw-a-logistic-regression-graph-with-multiple-x-variable">Is it possible to draw a
logistic regression graph with multiple x variable?</a>
</h3>
<div class="s-post-summary--content-excerpt">
Currently, this is my X and V value. May I know is it possible to draw a logistic regression curve with X that has multiple column? Or I am required to draw multiple graphs to do so?
X = df1.drop(['...
</div>
<div class="s-post-summary--meta">
<div class="s-post-summary--meta-tags tags js-tags t-python-3ûx t-machine-learning">
<a class="post-tag flex--item mt0 js-tagname-python-3ûx" href="/questions/tagged/python-3.x" rel="tag" title="show questions tagged 'python-3.x'">python-3.x</a> <a class="post-tag flex--item mt0 js-tagname-machine-learning" href="/questions/tagged/machine-learning" rel="tag" title="show questions tagged 'machine-learning'">machine-learning</a>
</div>
<div class="s-user-card s-user-card__minimal">
<a class="s-avatar s-avatar__16 s-user-card--avatar" href="/users/14128881/christopher-chua"> <div class="gravatar-wrapper-16" data-user-id="14128881">
<img ,="" alt="user avatar" class="s-avatar--image" height="16" src="https://lh6.googleusercontent.com/-Sn3B_E5hiJc/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucl1oyfdhJiXhrx73JLYqzKAK9icag/photo.jpg?sz=32" width="16"/>
</div>
</a>
<div class="s-user-card--info">
<div class="s-user-card--link d-flex gs4">
<a class="flex--item" href="/users/14128881/christopher-chua">Christopher Chua</a>
</div>
<ul class="s-user-card--awards">
<li class="s-user-card--rep"><span class="todo-no-class-here" dir="ltr" title="reputation score ">7</span></li>
</ul>
</div>
<time class="s-user-card--time">asked <span class="relativetime" title="2022-04-02 07:03:06Z">13 mins ago</span></time>
.. so on

Parsing HTML with BeatifulSoup class == AND title CONTAINS

I am trying to parse the following HTML:
<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
,
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>
I am trying to get the 'id' where the title contains 'Blue' AND the item is not sold.
I have tried:
soup.find_all("a",href=re.compile("Blue"),class_="")
links = soup.find_all("a", href=re.compile("Blue", "Add To Cart"))
ids = [tag["id"] for tag in soup.find_all("a", href=re.compile("Blue"))]
But it is not returning the info I'm looking for.
I would like it to return:
AddToCartSimple-3593
I think your html is corrupted. You can do the entire filtering with css selectors using :has, :not, and :contains (:-soup-contains - latest soupsieve), along with attribute = value selectors. The ^ is a starts with operator, meaning attribute value starts with the string after the =. The ~ is a general sibling combinator and the > is a child combinator. This means looking for a sibling with class (.) tocart and then a child with id that starts with AddToCartSimple-, but that doesn't have text containing SOLD displayed. Less specific than !="SOLD" , as it can be a partial string exclusion. Depends on observed variation in actual data.
from bs4 import BeautifulSoup as bs
html ='''
<div class="product-details">
<h4 class="title">Blue - Standard</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a>
</div>
<div class="product-details">
<h4 class="title">Blue - Wide</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576">SOLD</a>
</div>
'''
soup = bs(html, 'html.parser')
print(soup.select_one('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')['id'])
You should check there was a match before accessing with ['id'] of course. You could also go for all matches as follows:
[i['id'] for i in soup.select('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')]
To get the data where the "title" contains "Blue" and the item is not "SOLD":
Use a CSS selector .product-details > h4 a[title*='Blue'] which will select all a where the title=Blue under an h4 under the class product-details
Find the next div using the find_next() method, and check that the text is not "SOLD".
Print the next div's id
from bs4 import BeautifulSoup
html = """<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select(".product-details > h4 a[title*='Blue']"):
if tag.find_next("div").text != "SOLD":
print(tag.find_next("div")["id"])
Output:
AddToCartSimple-3593

Find text between specific id beautifulsoup

I've an html like the following example:
<a class="anchor" id="category-1"></a>
<h2 class="text-muted">First Category</h2>
<div class="row">
<a class="anchor-entry" id="cat1-first-id"></a>
<div class="col-lg-10">
<h3>First H3 Title</h3>
</div>
<a class="anchor-entry" id="cat1-second-id"></a>
<div class="col-lg-10">
<h3>Second H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat1-third-id"></a>
<div class="col-lg-10">
<h3>Third H3 Title</h3>
</div>
</div>
</div>
<a class="anchor" id="category-2"></a>
<h2 class="text-muted">Second Category</h2>
<div class="row">
<a class="anchor-entry" id="cat2-first-id"></a>
<div class="col-lg-10">
<h3>First H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat2-second-id"></a>
<div class="col-lg-10">
<h3>Second H3 Title</h3>
</div>
</div>
</div>
<a class="anchor" id="category-3"></a>
<h2 class="text-muted">Third Category</h2>
<div class="row">
<a class="anchor-entry" id="cat3-first-id"></a>
<div class="col-lg-10">
<h3>Cat-3 First H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat3-second-id"></a>
<div class="col-lg-10">
<h3>Cat-3 Second H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat3-third-id"></a>
<div class="col-lg-10">
<h3>Cat-3 Third H3 Title</h3>
</div>
</div>
</div>
</div>
so there are some blocks not within any div, but contained between a with the specific id.
I've the list of every id I need (category-1, category-2) and I would like to get in a python object (dict, dataframe, whatever) all the h3 text for each category:
d = {
'category-1': ['Cat-1 First H3 Title', 'Cat-1 Second H3 Title', 'Cat-1 Third H3 Title'],
'categor-2': ['Cat-2 First H3 Title', 'Cat-2 Second H3 Title']
}
The problem is that I didn't find any method to get in between information:
import requests
from bs4 import BeautifulSoup
url = 'myUrl'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
category_list = ['category-1', 'category-2']
for i in category_list:
# list like: [<a class="anchor" id="category-1"></a>]
catid = soup.find_all(id=i)
# long list like: [<a class="anchor-entry" id="cat1-first-id"></a>, ...]
cata = soup.find_all('a', {'class': 'anchor-entry'})
But catid and cata aren't linked and I stopped here.
Your code will only select a tags with class anchor-entry.
category_list = ['category-1', 'category-2', 'category-3']
category_tags = soup.find_all("a", {"class": "anchor"})
d = {}
for i in category_list:
tag = soup.find("a", {"id": i}).find_next()
while tag not in category_tags:
tag = tag.find_next()
if tag is None: break
if tag.name == "h3":
if d.get(i): d[i].append(tag.text)
else: d[i] = [tag.text]
My approach is to traverse the html tree, get h3 headers and store them in d until another category-id is found.

Python v3 , Beautifoulsoup - multiple div tags with same name

soup = BeautifulSoup(html, "html.parser") # BeautifulSoup(markup, "lxml")
items = soup.find_all("div","_3u1 _gli _uvb", recursive=True)
for item in items:
abouts = item.find_all("div", {"class":"_glo"}, recursive = True)[0].text
print (abouts)
HTML page:
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
Afternoon , i am trying to scrape a webpage using beautifullsoup, python. I need al the "text" strings in a separate variable. When i print abouts i get :"text text text" I want it to be seperated.
Kind regards
Try this:
items = soup.find_all('div', attrs={'class':'_ajw'})
dict = {}
for i in range(len(items)):
dict['text'+str(i+1)] = item[i].find('div', attrs={'class':'_52eh'}).text
print(dict)
This will give you something like this:
{'text1': text, 'text2': text, 'text3': text}
I'd use soup.select to apply a class selector to the html. It is a fast method to get a list of the appropriate elements by class
from bs4 import BeautifulSoup as bs
html = '''
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('._52eh')]
print(items)

Save Python output to excel using one row

I am trying to save my Python console output to an excel file. Unfortunately, it only saves the first line, but truncates the rest.
html = '''<div class="dynamicBottom">
<div class="dynamicLeft">
<div class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<div class="improve_listing_btn ui_button primary small">improve this entry</div>
<h3 class="tabs_header">Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Location</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
</li>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for div in soup.find_all('div', class_="ratingRow wrap"):
text = div.text.strip()
alt = div.find('img').get('alt')
print(text, alt)
This gives out:
Location 45 out of fifty points
Service 45 out of fifty points
I tried saving this to excel adding the last two lines:
for div in soup.find_all('div', class_="ratingRow wrap"):
text = div.text.strip()
alt = div.find('img').get('alt')
print(text, alt)
worksheet.write_string(row, col+15, text)
worksheet.write_string(row, col+16, alt)
But this only saves "Location; 45 out of fifty points". I would like to save the info given in the output to excel using one row but different columns. Is there any way I can do that?
Thank you very much for your help!

Categories

Resources