I am using Beautiful Soup to search an XML file provided by the SEC (this is public data). Beautiful Soup works very well for referencing tags but I can not seem to pass a variable to its find function. Static content is fine. I think there is a gap in my python understanding that I can't seem to figure out. (I code a few days a year, not my main role)
File:
https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_02_08_2023.xml.gz
I download, unzip and then create the soup from the file using lxml.
with open(Firm_Download_name,'r') as f:
soup = BeautifulSoup(f, 'lxml')
Next is where I am running into trouble, I have a list of Firm CRD numbers (these are public numbers identifying the firm) that I am looking for in the XML file and then pulling out various data points from the child tags.
If I write it statically such as:
soup.find(firmcrdnb="5639055").parent
This works perfectly, but I want to loop through a list of CRD numbers and pull out a different block each time. I can not figure out how to pass a variable to the soup.find function.
I feel like this should be simple. I appreciate any help you can provide.
Here is my current attempt:
searchstring = 'firmcrdnb="'+Firm_CRD+'"'
select_firm = soup.find(searchstring).parent
I have tried other similar setups and reviewed other stack exchanges such as Is it possible to pass a variable to (Beautifulsoup) soup.find()? but just not quite getting it.
Here is an example of the XML.
<?xml version="1.0" encoding="iso-8859-1"?>
<IAPDFirmSECReport GenOn="2017-09-30">
<Firms>
<Firm>
<Info SECRgnCD="MIRO" FirmCrdNb="9999" SECNb="999-99999" BusNm="XXXX INC." LegalNm="XXX INC" UmbrRgstn="N"/>
<MainAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" PhNb="999-999-9999" FaxNb="999-999-9999"/>
<MailingAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" />
<Rgstn FirmType="Registered" St="APPROVED" Dt="9999-01-01"/>
<NoticeFiled>
Thanks
ps: if anyone has ideas on how to improve the speed of the search on this large file I'd appreciate that to. I get messages such as "pydevd warning: Computing repr of soup (BeautifulSoup) was slow (took 43.83s)" I did install and import chardet per the beautifulsoup documentation but that hasn't seemed to help.
I'm not sure where I got turned around but my static answer did in fact not work.
The tag is "info" and the attribute is "firmcrdnb".
The answer that works was:
select_firm = soup.find("info", {"firmcrdnb" : Firm_CRD}).parent
Welcome to StackOverFlow
Try use,
select_firm = soup.find(attrs={'firmcrdnb': str(Firm_CRD)}).parent
Maybe I'm missing something. If it works statically, have you tried something such as:
list_of_crds = ["11111","22222","33333"]
for crd in list_of_crds:
result = soup.find(firmcrdnb=crd).parent
...
Related
Classic case of code used to work, changed nothing, now it doesn't work no more here. I'm trying to extract a list of unique appid values from this page that I'm saving locally as roguelike.html
The code I have looks like this and it used to work as of a couple months ago when I last ran it, but now the end result is a list of 1 with just a NoneType in it. Any ideas as to what's going wrong here?
from bs4 import BeautifulSoup
text_file = open("roguelike.html", "rb")
steamdb_text = text_file.read()
text_file.close()
soup = BeautifulSoup(steamdb_text, "html.parser")
trs = [tr for tr in soup.find_all('tr')]
apps = []
for app in soup.find_all('tr'):
apps.append(app.get('data-appid'))
appset = list(set(apps))
Is there a simpler way to get the unique appids from the page source? The individual elements I'm trying to cycle over and grab look like:
<tr class="app" data-appid="98821" data-cache="1533726913">
where I want all the unique data-appid values. I'm scratching my head trying to figure out if formatting in the page changed (doesn't seem like it), or some kind of version upgrade in Spyder, Python, or Beautifulsoup broke something that used to be working.
Any ideas?
I tried this code and it worked well for me. You should make sure that the html file you have is the right file. Perhaps you've hit a capcha test in the html test.
I'm a bit stuck on a problem with BeautifulSoup. This piece of code is a snippet from a function I'm trying to debug. The scraper worked fine and suddenly stopped. The strange thing is that the class I'm searching for "ipsColumn ipsColumn_fluid" is in the "post_soup" file that is produced in the second step of the loop.
As part of my debugging, I wanted to see what was produced which is the reason for the text file. However, it is empty. I have no idea why.
Any ideas?
post_pages = ['https://coffeeforums.co.uk/topic/4843-a-little-thank-you/', 'https://coffeeforums.co.uk/topic/58690-for-sale-area-rules-changes-important/']
for topic_url in post_pages:
post_page = urlopen(topic_url)
post_soup = BeautifulSoup(post_page, 'lxml')
messy_posts = post_soup.find_all('div', class_='ipsColumn ipsColumn_fluid')
with open('messy_posts.txt', 'w') as f:
f.write(str(messy_posts))
edit: you can swap in this variable to see how it should work. These websites are built on the same platform so the scrape should be the same (I would think):
post_pages = ['https://forum.cardealermagazine.co.uk/topic/8603-customer-comms-and-the-virus/', 'https://forum.cardealermagazine.co.uk/topic/10096-volvo-issue-heads-up/']
The class_ takes a list for multiple classes and not a string for an OR operation. You could change it from
class_='ipsColumn ipsColumn_fluid'
it to this and it should work.
class_=['ipsColumn', 'ipsColumn_fluid']
and it should work.
Alternatively, if you are going for an AND(where you want a div with both classes). I advise you to use select as such:
post_soup.select('div.ipsColumn.ipsColumn_fluid')
This would return the div that includes both classes
UPDATE(4/10/2018):
So I found that my problem was that the information wasn't available in the source code which means I have to use Selenium.
UPDATE:
I played around with this problem a bit more. What I did was instead or running soup, I just took pageH, decoded it into a string and made a text file out of it, and I found that the '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}' were from the template section separately stated in the HTML file. Which I THINK means that I was just looking at the wrong place. I am still unsure but that's what I think.
So now I have a new question. After having looked at the text file, I am now realizing that the information necessary is not even in pageH. At the place where it should give me the information I am looking for, it says instead:
<bread-crumbs :location="location" :product-name="product.productName"></bread-crumbs>
<product-info ref="productInfo" :basic="product" :location="location" :prod-info="prodInfo"></product-info>
What does this mean?/Is there a way to get through this to get to the information?
ORIGINAL QUESTION:
I am trying to collect the names/prices for products off of a website. I am unsure if the data is being lost because of the html parser or because of BeautifulSoup but what is happening is that once I do get to the position I want to be in, what is returned instead of the specific name/price is '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}'. After I get the url using pageH = urllib.request.urlopen(), the code that gives this result is:
pageS = soup(pageH, "html.parser")
pageB = pageS.body
names = pageB.findAll("h4")
optionTitle = names[3].get_text()
optionPrice = names[5].get_text()
Because this didn't work, I tried going about it a different way and looked for more specific tags, but the section of the code that mattered just does not show. It completely disappears. Is there something I can do to get the specific names/prices or is this a security measure that I cannot work through?
The {{}} syntax looks like Angular. Try Requests-HTML to do the rendering (by using render())and get the content afterward. Example shows below:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'
I am starting to program in Python, and have been reading a couple of posts where they say that I should use an HTML parser to get an URL from a text rather than re.
I have the source code which I got from page.read() with the urllib and urlopen.
Now, my problem is that the parser is removing the url part from the text.
Also, if I had read correctly, var = page.read(), var is stored as a string?
How can I tell it to give me the text between 2 "tags"? The URL is always in between flv= and ; so and as such it doesn't start with href which is what the parsers look for, and it doesn't contain http:// either.
I have read many posts, but it seems they all look for ``href in the code.
Do I have it all completely wrong?
Thank you!
You could consider implementing your own search / grab. In psuedocode, it would look a little like this:
find location of 'flv=' in HTML = location_start
find location of ';' in HTML = location_end
grab everything in between: HTML[location_start : location_end]
You should be able to implement this in python.
Good luck!
I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.
This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.
BeautifulSoup is the first thing that comes to mind.
Another possibility is copying/pasting it in TextMate and then running regular expressions.
What do you suggest?
This is the script that I ended up writing, but as I said, I'm looking for a more general solution.
from BeautifulSoup import BeautifulSoup
import urllib2
url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
tds = row.findAll('td')
if(len(tds)==4):
countrycode = tds[1].string
timezone = tds[2].string
if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())
Comments and suggestions for improvement to my python code welcome, too ;)
For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)
A quick example for your specific case:
from lxml import html
tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
[td.text_content().strip() for td in row.findall('td')]
for row in table.findall('tr')
]
This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)
BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.
Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...
A few other alternatives
SimpleHTMLDom PHP
Hpricot & Nokogiri Ruby
Web::Scraper Perl/CPAN
All of these are reasonably tolerant of poorly formed HTML.
I suggest loading the document with an XML parser like DOMDocument::loadHTMLFile that is bundled with PHP and then use XPath to grep the data you need.
This is not the fastest way, but the most readable (in my opinion) in the end. You can use Regex, which will probably be a little faster, but would be bad style (hard to debug, hard to read).
EDIT: Actually this is hard because the page you mentioned is not valid HTML (see validator.w3.org). Especially tags with no opening/closing tag are heavily in the way.
It looks though like xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) is able to repair the problem (run xmlstarlet fo -R ). xmlstarlet can also do xpath and xslt script which can help you in extracting your data with a simple shell script.
While we were building SerpAPI we tested many platform/parser.
Here is the benchmark result for Python.
For more, here is a full article on Medium:
https://medium.com/#vikoky/fastest-html-parser-available-now-f677a68b81dd
The efficiency of a regex is superior to a DOM parser.
Look at this comparison:
http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team
You can find many more searching the web.