BeautifulSoup find_all Method not Generalising

BeautifulSoup find_all Method not Generalising - python

I'm a bit stuck on a problem with BeautifulSoup. This piece of code is a snippet from a function I'm trying to debug. The scraper worked fine and suddenly stopped. The strange thing is that the class I'm searching for "ipsColumn ipsColumn_fluid" is in the "post_soup" file that is produced in the second step of the loop.
As part of my debugging, I wanted to see what was produced which is the reason for the text file. However, it is empty. I have no idea why.
Any ideas?
post_pages = ['https://coffeeforums.co.uk/topic/4843-a-little-thank-you/', 'https://coffeeforums.co.uk/topic/58690-for-sale-area-rules-changes-important/']
for topic_url in post_pages:
post_page = urlopen(topic_url)
post_soup = BeautifulSoup(post_page, 'lxml')
messy_posts = post_soup.find_all('div', class_='ipsColumn ipsColumn_fluid')
with open('messy_posts.txt', 'w') as f:
f.write(str(messy_posts))
edit: you can swap in this variable to see how it should work. These websites are built on the same platform so the scrape should be the same (I would think):
post_pages = ['https://forum.cardealermagazine.co.uk/topic/8603-customer-comms-and-the-virus/', 'https://forum.cardealermagazine.co.uk/topic/10096-volvo-issue-heads-up/']

The class_ takes a list for multiple classes and not a string for an OR operation. You could change it from
class_='ipsColumn ipsColumn_fluid'
it to this and it should work.
class_=['ipsColumn', 'ipsColumn_fluid']
and it should work.
Alternatively, if you are going for an AND(where you want a div with both classes). I advise you to use select as such:
post_soup.select('div.ipsColumn.ipsColumn_fluid')
This would return the div that includes both classes

Related

BeautifulSoup: How to pass a variable into soup.find({variable])

I am using Beautiful Soup to search an XML file provided by the SEC (this is public data). Beautiful Soup works very well for referencing tags but I can not seem to pass a variable to its find function. Static content is fine. I think there is a gap in my python understanding that I can't seem to figure out. (I code a few days a year, not my main role)
File:
https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_02_08_2023.xml.gz
I download, unzip and then create the soup from the file using lxml.
with open(Firm_Download_name,'r') as f:
soup = BeautifulSoup(f, 'lxml')
Next is where I am running into trouble, I have a list of Firm CRD numbers (these are public numbers identifying the firm) that I am looking for in the XML file and then pulling out various data points from the child tags.
If I write it statically such as:
soup.find(firmcrdnb="5639055").parent
This works perfectly, but I want to loop through a list of CRD numbers and pull out a different block each time. I can not figure out how to pass a variable to the soup.find function.
I feel like this should be simple. I appreciate any help you can provide.
Here is my current attempt:
searchstring = 'firmcrdnb="'+Firm_CRD+'"'
select_firm = soup.find(searchstring).parent
I have tried other similar setups and reviewed other stack exchanges such as Is it possible to pass a variable to (Beautifulsoup) soup.find()? but just not quite getting it.
Here is an example of the XML.
<?xml version="1.0" encoding="iso-8859-1"?>
<IAPDFirmSECReport GenOn="2017-09-30">
<Firms>
<Firm>
<Info SECRgnCD="MIRO" FirmCrdNb="9999" SECNb="999-99999" BusNm="XXXX INC." LegalNm="XXX INC" UmbrRgstn="N"/>
<MainAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" PhNb="999-999-9999" FaxNb="999-999-9999"/>
<MailingAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" />
<Rgstn FirmType="Registered" St="APPROVED" Dt="9999-01-01"/>
<NoticeFiled>
Thanks
ps: if anyone has ideas on how to improve the speed of the search on this large file I'd appreciate that to. I get messages such as "pydevd warning: Computing repr of soup (BeautifulSoup) was slow (took 43.83s)" I did install and import chardet per the beautifulsoup documentation but that hasn't seemed to help.

I'm not sure where I got turned around but my static answer did in fact not work.
The tag is "info" and the attribute is "firmcrdnb".
The answer that works was:
select_firm = soup.find("info", {"firmcrdnb" : Firm_CRD}).parent

Welcome to StackOverFlow
Try use,
select_firm = soup.find(attrs={'firmcrdnb': str(Firm_CRD)}).parent

Maybe I'm missing something. If it works statically, have you tried something such as:
list_of_crds = ["11111","22222","33333"]
for crd in list_of_crds:
result = soup.find(firmcrdnb=crd).parent
...

How to open and scrape multiple links with Selenium

I am new to scraping with Python and have encountered a weird issue.
I am attempting to scrape of OCR'd newspaper articles from a list of URLS using selenium -- the proxy settings on the data source make this easier than other options.
However, I receive tracebacks for the text data every time I run my code. Here is the code that I am using:
article_links = []
for link in driver.find_elements_by_xpath('/html/body/div[1]/main/section[1]/ul[2]/li[*]/div[2]/div[1]/h3/a'):
links = link.get_attribute("href")
article_links.append(links)
articles = []
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
driver.find_element_by_css_selector("#js-doc-explorer-show-additional-views").click()
time.sleep(1)
for article_text in driver.find_elements_by_css_selector("#ocr-container > div.fulltext-ocr.js-page-ocr"):
articles.append(article_text)
I come closest to solving the issue by using .click(), which opens a hidden panel for my data. However, upon using this code, the only data that fills is the last row in the dataset. Without the .click(), all rows come back with nothing. Changing the sleep settings also does not help.
The Xpath for the text data is:
/html/body/div[2]/main/section/div[2]/div[2]/section[2]/div/div[4]/text()
Alternatively, is there a way to get each link's source code and parse it with beautifulsoup after the fact?
UPDATE: There has to be something wrong with the loops -- I can get either the first or last values, but nothing in between.

In a more recent version of Selenium, the method find_elements_by_xpath() is deprecated. Is that the issue you are facing? If it is, import from selenium.webdriver.common.by import By and change it to find_elements(By.XPATH, ...) Similarly, find_elements_by_css_selector() is replaced with find_elements(By.CSS_SELECTOR, ...)
You don't specify if this is even the issue, but if it is, I hope this helps :-)

The solution is found by calling the relevant (unique) class and specifying that it must contain text.
news = []
for article in article_links:
driver2.get(article)
driver2.find_element(By.CSS_SELECTOR, "#js-doc-explorer-show-additional-views").click()
article_text = driver2.find_element(By.XPATH, '//div[#class="fulltext-ocr js-page-ocr"][contains(text()," ")]')
news.append([article_text.text])

Did something break with beautifulsoup element extraction?

Classic case of code used to work, changed nothing, now it doesn't work no more here. I'm trying to extract a list of unique appid values from this page that I'm saving locally as roguelike.html
The code I have looks like this and it used to work as of a couple months ago when I last ran it, but now the end result is a list of 1 with just a NoneType in it. Any ideas as to what's going wrong here?
from bs4 import BeautifulSoup
text_file = open("roguelike.html", "rb")
steamdb_text = text_file.read()
text_file.close()
soup = BeautifulSoup(steamdb_text, "html.parser")
trs = [tr for tr in soup.find_all('tr')]
apps = []
for app in soup.find_all('tr'):
apps.append(app.get('data-appid'))
appset = list(set(apps))
Is there a simpler way to get the unique appids from the page source? The individual elements I'm trying to cycle over and grab look like:
<tr class="app" data-appid="98821" data-cache="1533726913">
where I want all the unique data-appid values. I'm scratching my head trying to figure out if formatting in the page changed (doesn't seem like it), or some kind of version upgrade in Spyder, Python, or Beautifulsoup broke something that used to be working.
Any ideas?

I tried this code and it worked well for me. You should make sure that the html file you have is the right file. Perhaps you've hit a capcha test in the html test.

Losing data when scraping with Python?

UPDATE(4/10/2018):
So I found that my problem was that the information wasn't available in the source code which means I have to use Selenium.
UPDATE:
I played around with this problem a bit more. What I did was instead or running soup, I just took pageH, decoded it into a string and made a text file out of it, and I found that the '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}' were from the template section separately stated in the HTML file. Which I THINK means that I was just looking at the wrong place. I am still unsure but that's what I think.
So now I have a new question. After having looked at the text file, I am now realizing that the information necessary is not even in pageH. At the place where it should give me the information I am looking for, it says instead:
<bread-crumbs :location="location" :product-name="product.productName"></bread-crumbs>
<product-info ref="productInfo" :basic="product" :location="location" :prod-info="prodInfo"></product-info>
What does this mean?/Is there a way to get through this to get to the information?
ORIGINAL QUESTION:
I am trying to collect the names/prices for products off of a website. I am unsure if the data is being lost because of the html parser or because of BeautifulSoup but what is happening is that once I do get to the position I want to be in, what is returned instead of the specific name/price is '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}'. After I get the url using pageH = urllib.request.urlopen(), the code that gives this result is:
pageS = soup(pageH, "html.parser")
pageB = pageS.body
names = pageB.findAll("h4")
optionTitle = names[3].get_text()
optionPrice = names[5].get_text()
Because this didn't work, I tried going about it a different way and looked for more specific tags, but the section of the code that mattered just does not show. It completely disappears. Is there something I can do to get the specific names/prices or is this a security measure that I cannot work through?

The {{}} syntax looks like Angular. Try Requests-HTML to do the rendering (by using render())and get the content afterward. Example shows below:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'

Chosing next relative in Python BeautifulSoup with automation

First of all - I'm creating xml document with python BeautifulSoup.
Currently, what I'm trying to create, is very similar to this example.
<options>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
</options>
Notice, that there should be only one tag, called name.
As options can be much more in count, and different as well, I decided to create little python function, which could help me create such result.
array = ['FirstName','SecondName','ThirdName']
# This list will be guideline for function to let it know, how much options will be in result, and how option tags will be called.
def create_options(array):
soup.append(soup.new_tag('options'))
if len(array) > 0: # It's small error handling, so you could see, if given array isn't empty by any reason. Optional.
for i in range(len(array)):
soup.options.append(soup.new_tag('opt'))
# With beatifullsoup methods, we create opt tags inside options tag. Exact amount as in parsed array.
counter = 0
# There's option to use python range() method, but for testing purposes, current approach is sufficient enough.
for tag in soup.options.find_all():
soup.options.find('opt')['name'] = str(array[counter])
# Notice, that in this part tag name is assigned only to first opt element. We'll discuss this next.
counter += 1
print len(array), ' options were created.'
else:
print 'No options were created.'
You notice, that in function, tag assignment is handled by for loop, which, unfortunately, assigns all different tag names to first option in options element.
BeautifulSoup has .next_sibling and .previous_sibling, which can help me in this task.
As they describe by name, with them I can access next or previous sibling in element. So, by this example:
soup.options.find('opt').next_sibling['name'] = str(array[counter])
We can access second child of options element. So, if we add .next_sibling to each soup.items.find('opt'), we could then move from first element to next.
Problem is, that by finding option element in options with:
soup.options.find('opt')
each time we access first option. But my function is willing to access with each item in list, next option as well. So it means, as more items are in list, more .next_sibling methods it must add to first option.
In result, with logic I constructed, with 4th or further item in list, accessing relevant option for assigning it's appropriate tag, should look like this:
soup.options.find('opt').next_sibling.next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
And now we are ready to my questions:
1st. As I didn't found any other kind of method, how to do it with Python BeautifulSoup methods, I'm not sure, that my approach still is only way. Is there any other method?
2st. How could I achieve result by this approach, if as my experiments show me, that I can't put variable inside method row? (So I could multiply methods)
#Like this
thirdoption = .next_sibling.next_sibling.next_sibling
#As well, it's not quite possible, but it's just example.
soup.options.find('opt').next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
3st. May be I read BeautifulSoup documentation badly, and just didn't found method, which could help me in this task?

I managed to achieve result, ignoring BeatifulSoup methods.
Python has element tree methods, which were sufficient enough to work with.
So, let me show the example code, and explain it, what it does. Comments provide explanation more precisely.
"""
Before this code, there goes soup xml document generation. Except part, I mentioned in topic, we just create empty options tags in document, thus, creating almost done document.
Right after that, with this little script, we will use basic python provided element tree methods.
"""
import xml.etree.ElementTree as ET
ET_tree = ET.parse("exported_file.xml")
# Here we import exactly the same file, we opened with soup. Exporting can be done in different file, if you wish.
ET_root = ET_tree.getroot()
for position, opt in enumerate(item.find('options')):
# Position is pretty important, as this will remove 'counter' thing in for loop, I was using in code in first example. Position will be used for getting out exact items from array, which works like template for our option tag names.
opt.set('name', str(array[position]))
opt.text = 'text'
# Same way, with position, we can get data from relevant array, provided, that they are inherited or connected in same way.
tree = ET.ElementTree(ET_root).write('exported_file.xml',encoding="UTF-8",xml_declaration=True)
# This part was something, I researched before quite lot. This code will help save xml document with utf-8 encoding, which is very important.
This approach is pretty inefficient, as for achieving same result, I could use ET for everything.
Thought, BeatifulSoup prepares document in nice output, which in any way is very neat, as element-tree creates files for software friendly only look.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup find_all Method not Generalising - python

Related

BeautifulSoup: How to pass a variable into soup.find({variable])

How to open and scrape multiple links with Selenium

Did something break with beautifulsoup element extraction?

Losing data when scraping with Python?

Chosing next relative in Python BeautifulSoup with automation

Categories

Resources