Did something break with beautifulsoup element extraction? - python

Classic case of code used to work, changed nothing, now it doesn't work no more here. I'm trying to extract a list of unique appid values from this page that I'm saving locally as roguelike.html
The code I have looks like this and it used to work as of a couple months ago when I last ran it, but now the end result is a list of 1 with just a NoneType in it. Any ideas as to what's going wrong here?
from bs4 import BeautifulSoup
text_file = open("roguelike.html", "rb")
steamdb_text = text_file.read()
text_file.close()
soup = BeautifulSoup(steamdb_text, "html.parser")
trs = [tr for tr in soup.find_all('tr')]
apps = []
for app in soup.find_all('tr'):
apps.append(app.get('data-appid'))
appset = list(set(apps))
Is there a simpler way to get the unique appids from the page source? The individual elements I'm trying to cycle over and grab look like:
<tr class="app" data-appid="98821" data-cache="1533726913">
where I want all the unique data-appid values. I'm scratching my head trying to figure out if formatting in the page changed (doesn't seem like it), or some kind of version upgrade in Spyder, Python, or Beautifulsoup broke something that used to be working.
Any ideas?

I tried this code and it worked well for me. You should make sure that the html file you have is the right file. Perhaps you've hit a capcha test in the html test.

Related

BeautifulSoup: How to pass a variable into soup.find({variable])

I am using Beautiful Soup to search an XML file provided by the SEC (this is public data). Beautiful Soup works very well for referencing tags but I can not seem to pass a variable to its find function. Static content is fine. I think there is a gap in my python understanding that I can't seem to figure out. (I code a few days a year, not my main role)
File:
https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_02_08_2023.xml.gz
I download, unzip and then create the soup from the file using lxml.
with open(Firm_Download_name,'r') as f:
soup = BeautifulSoup(f, 'lxml')
Next is where I am running into trouble, I have a list of Firm CRD numbers (these are public numbers identifying the firm) that I am looking for in the XML file and then pulling out various data points from the child tags.
If I write it statically such as:
soup.find(firmcrdnb="5639055").parent
This works perfectly, but I want to loop through a list of CRD numbers and pull out a different block each time. I can not figure out how to pass a variable to the soup.find function.
I feel like this should be simple. I appreciate any help you can provide.
Here is my current attempt:
searchstring = 'firmcrdnb="'+Firm_CRD+'"'
select_firm = soup.find(searchstring).parent
I have tried other similar setups and reviewed other stack exchanges such as Is it possible to pass a variable to (Beautifulsoup) soup.find()? but just not quite getting it.
Here is an example of the XML.
<?xml version="1.0" encoding="iso-8859-1"?>
<IAPDFirmSECReport GenOn="2017-09-30">
<Firms>
<Firm>
<Info SECRgnCD="MIRO" FirmCrdNb="9999" SECNb="999-99999" BusNm="XXXX INC." LegalNm="XXX INC" UmbrRgstn="N"/>
<MainAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" PhNb="999-999-9999" FaxNb="999-999-9999"/>
<MailingAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" />
<Rgstn FirmType="Registered" St="APPROVED" Dt="9999-01-01"/>
<NoticeFiled>
Thanks
ps: if anyone has ideas on how to improve the speed of the search on this large file I'd appreciate that to. I get messages such as "pydevd warning: Computing repr of soup (BeautifulSoup) was slow (took 43.83s)" I did install and import chardet per the beautifulsoup documentation but that hasn't seemed to help.
I'm not sure where I got turned around but my static answer did in fact not work.
The tag is "info" and the attribute is "firmcrdnb".
The answer that works was:
select_firm = soup.find("info", {"firmcrdnb" : Firm_CRD}).parent
Welcome to StackOverFlow
Try use,
select_firm = soup.find(attrs={'firmcrdnb': str(Firm_CRD)}).parent
Maybe I'm missing something. If it works statically, have you tried something such as:
list_of_crds = ["11111","22222","33333"]
for crd in list_of_crds:
result = soup.find(firmcrdnb=crd).parent
...

How to open and scrape multiple links with Selenium

I am new to scraping with Python and have encountered a weird issue.
I am attempting to scrape of OCR'd newspaper articles from a list of URLS using selenium -- the proxy settings on the data source make this easier than other options.
However, I receive tracebacks for the text data every time I run my code. Here is the code that I am using:
article_links = []
for link in driver.find_elements_by_xpath('/html/body/div[1]/main/section[1]/ul[2]/li[*]/div[2]/div[1]/h3/a'):
links = link.get_attribute("href")
article_links.append(links)
articles = []
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
driver.find_element_by_css_selector("#js-doc-explorer-show-additional-views").click()
time.sleep(1)
for article_text in driver.find_elements_by_css_selector("#ocr-container > div.fulltext-ocr.js-page-ocr"):
articles.append(article_text)
I come closest to solving the issue by using .click(), which opens a hidden panel for my data. However, upon using this code, the only data that fills is the last row in the dataset. Without the .click(), all rows come back with nothing. Changing the sleep settings also does not help.
The Xpath for the text data is:
/html/body/div[2]/main/section/div[2]/div[2]/section[2]/div/div[4]/text()
Alternatively, is there a way to get each link's source code and parse it with beautifulsoup after the fact?
UPDATE: There has to be something wrong with the loops -- I can get either the first or last values, but nothing in between.
In a more recent version of Selenium, the method find_elements_by_xpath() is deprecated. Is that the issue you are facing? If it is, import from selenium.webdriver.common.by import By and change it to find_elements(By.XPATH, ...) Similarly, find_elements_by_css_selector() is replaced with find_elements(By.CSS_SELECTOR, ...)
You don't specify if this is even the issue, but if it is, I hope this helps :-)
The solution is found by calling the relevant (unique) class and specifying that it must contain text.
news = []
for article in article_links:
driver2.get(article)
driver2.find_element(By.CSS_SELECTOR, "#js-doc-explorer-show-additional-views").click()
article_text = driver2.find_element(By.XPATH, '//div[#class="fulltext-ocr js-page-ocr"][contains(text()," ")]')
news.append([article_text.text])

BeautifulSoup find_all Method not Generalising

I'm a bit stuck on a problem with BeautifulSoup. This piece of code is a snippet from a function I'm trying to debug. The scraper worked fine and suddenly stopped. The strange thing is that the class I'm searching for "ipsColumn ipsColumn_fluid" is in the "post_soup" file that is produced in the second step of the loop.
As part of my debugging, I wanted to see what was produced which is the reason for the text file. However, it is empty. I have no idea why.
Any ideas?
post_pages = ['https://coffeeforums.co.uk/topic/4843-a-little-thank-you/', 'https://coffeeforums.co.uk/topic/58690-for-sale-area-rules-changes-important/']
for topic_url in post_pages:
post_page = urlopen(topic_url)
post_soup = BeautifulSoup(post_page, 'lxml')
messy_posts = post_soup.find_all('div', class_='ipsColumn ipsColumn_fluid')
with open('messy_posts.txt', 'w') as f:
f.write(str(messy_posts))
edit: you can swap in this variable to see how it should work. These websites are built on the same platform so the scrape should be the same (I would think):
post_pages = ['https://forum.cardealermagazine.co.uk/topic/8603-customer-comms-and-the-virus/', 'https://forum.cardealermagazine.co.uk/topic/10096-volvo-issue-heads-up/']
The class_ takes a list for multiple classes and not a string for an OR operation. You could change it from
class_='ipsColumn ipsColumn_fluid'
it to this and it should work.
class_=['ipsColumn', 'ipsColumn_fluid']
and it should work.
Alternatively, if you are going for an AND(where you want a div with both classes). I advise you to use select as such:
post_soup.select('div.ipsColumn.ipsColumn_fluid')
This would return the div that includes both classes

Scraping specific 'dd' tags with BeautifulSoup and Python

Im learning beautifulsoup and I came a cross one problem. Thats scraping dd tags in html. Check out the picture below, I want to get the parameters that are in the red color zone. The problem is I do not know how to access them. I have tried this:
kvadratura = float(nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[0])
jedinica_mere = nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[1].strip()
...
But the problem is that sometimes different pages have different parameters, or different order of parameters so I cant access with index. Check out the links:
https://www.nekretnine.rs/stambeni-objekti/stanovi/centar-zmaj-jovina-salonac-id1003/NkmUEzjEFo0/
https://www.nekretnine.rs/stambeni-objekti/stanovi/prodajemo-stan-milica-od-macve-mirijevo-46m2-nov/NkNruPymNHy/
How can I sure that I will always scrape the parameter that I want?
Each parameter goes into the list afterwards so If some parameter does now exist, it should add '' to the list
In such cases, this is something you might wanna do instead of using index as the latter may lead you to the wrong dd. When you go for the following approach, all you need to do is replace the text within :contains('') to get their dd, as in Transakcija,Vrsta stana and so on..
import requests
from bs4 import BeautifulSoup
url = "https://www.nekretnine.rs/stambeni-objekti/stanovi/zemun-krajiska-41m-bela-fasadna-cila-odlican/NkiRX4sq4Cy/"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
Kategorija = soup.select_one(".base-inf .dl-horozontal:has(:contains('Kategorija:')) > dd")
Kategorija = Kategorija.get_text(strip=True) if Kategorija else ""
print(Kategorija)

Losing data when scraping with Python?

UPDATE(4/10/2018):
So I found that my problem was that the information wasn't available in the source code which means I have to use Selenium.
UPDATE:
I played around with this problem a bit more. What I did was instead or running soup, I just took pageH, decoded it into a string and made a text file out of it, and I found that the '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}' were from the template section separately stated in the HTML file. Which I THINK means that I was just looking at the wrong place. I am still unsure but that's what I think.
So now I have a new question. After having looked at the text file, I am now realizing that the information necessary is not even in pageH. At the place where it should give me the information I am looking for, it says instead:
<bread-crumbs :location="location" :product-name="product.productName"></bread-crumbs>
<product-info ref="productInfo" :basic="product" :location="location" :prod-info="prodInfo"></product-info>
What does this mean?/Is there a way to get through this to get to the information?
ORIGINAL QUESTION:
I am trying to collect the names/prices for products off of a website. I am unsure if the data is being lost because of the html parser or because of BeautifulSoup but what is happening is that once I do get to the position I want to be in, what is returned instead of the specific name/price is '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}'. After I get the url using pageH = urllib.request.urlopen(), the code that gives this result is:
pageS = soup(pageH, "html.parser")
pageB = pageS.body
names = pageB.findAll("h4")
optionTitle = names[3].get_text()
optionPrice = names[5].get_text()
Because this didn't work, I tried going about it a different way and looked for more specific tags, but the section of the code that mattered just does not show. It completely disappears. Is there something I can do to get the specific names/prices or is this a security measure that I cannot work through?
The {{}} syntax looks like Angular. Try Requests-HTML to do the rendering (by using render())and get the content afterward. Example shows below:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'

Categories

Resources