I have written a pretty simple code to get the first result for any term on urbandictionary.com. I started by writing a simple thing to see how their code is formatted.
def parseudtest(searchurl):
url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
url_info = urllib.urlopen(url)
for lines in url_info:
print lines
For a test, I searched for 'cats', and used that as the variable searchurl. The output I receive is of course a gigantic page, but here is the part I care about:
<meta content='He set us up the bomb. Also took all our base.' name='Description' />
<meta content='He set us up the bomb. Also took all our base.' property='og:description' />
<meta content='cats' property='og:title' />
<meta content="http://static3.urbandictionary.com/rel-1e0b481/images/og_image.png" property="og:image" />
<meta content='Urban Dictionary' property='og:site_name' />
As you can see, the first time the element "meta content" appears on the site, it is the first definition for the search term. So I wrote this code to retrieve it:
def parseud(searchurl):
url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
url_info = urllib.urlopen(url)
if (url_info):
xmldoc = minidom.parse(url_info)
if (xmldoc):
definition = xmldoc.getElementsByTagName('meta content')[0].firstChild.data
print definition
For some reason the parsing doesn't seem to be working and invariably encounters an error every time. It is especially confusing since the site appears to use basically the same format as other sites I have successfully retrieved specific data from. If anyone could help me figure out what I am messing up here, it would be greatly appreciated.
As you don't give the traceback for the errors that occur it's hard to be specific, but I assume that although the site claims to be XHTML it's not actually valid XML. You'd be better off using Beautiful Soup as it is designed for parsing HTML and will correctly handle broken markup.
I never used the minidom parser, but I think the problem is that you call:
xmldoc.getElementsByTagName('meta content')
while tha tag name is meta, content is just the first attribute (as shown pretty well by the highlighting of your html code).
Try to replace that bit with:
xmldoc.getElementsByTagName('meta')
Related
I am using Beautiful Soup to search an XML file provided by the SEC (this is public data). Beautiful Soup works very well for referencing tags but I can not seem to pass a variable to its find function. Static content is fine. I think there is a gap in my python understanding that I can't seem to figure out. (I code a few days a year, not my main role)
File:
https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_02_08_2023.xml.gz
I download, unzip and then create the soup from the file using lxml.
with open(Firm_Download_name,'r') as f:
soup = BeautifulSoup(f, 'lxml')
Next is where I am running into trouble, I have a list of Firm CRD numbers (these are public numbers identifying the firm) that I am looking for in the XML file and then pulling out various data points from the child tags.
If I write it statically such as:
soup.find(firmcrdnb="5639055").parent
This works perfectly, but I want to loop through a list of CRD numbers and pull out a different block each time. I can not figure out how to pass a variable to the soup.find function.
I feel like this should be simple. I appreciate any help you can provide.
Here is my current attempt:
searchstring = 'firmcrdnb="'+Firm_CRD+'"'
select_firm = soup.find(searchstring).parent
I have tried other similar setups and reviewed other stack exchanges such as Is it possible to pass a variable to (Beautifulsoup) soup.find()? but just not quite getting it.
Here is an example of the XML.
<?xml version="1.0" encoding="iso-8859-1"?>
<IAPDFirmSECReport GenOn="2017-09-30">
<Firms>
<Firm>
<Info SECRgnCD="MIRO" FirmCrdNb="9999" SECNb="999-99999" BusNm="XXXX INC." LegalNm="XXX INC" UmbrRgstn="N"/>
<MainAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" PhNb="999-999-9999" FaxNb="999-999-9999"/>
<MailingAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" />
<Rgstn FirmType="Registered" St="APPROVED" Dt="9999-01-01"/>
<NoticeFiled>
Thanks
ps: if anyone has ideas on how to improve the speed of the search on this large file I'd appreciate that to. I get messages such as "pydevd warning: Computing repr of soup (BeautifulSoup) was slow (took 43.83s)" I did install and import chardet per the beautifulsoup documentation but that hasn't seemed to help.
I'm not sure where I got turned around but my static answer did in fact not work.
The tag is "info" and the attribute is "firmcrdnb".
The answer that works was:
select_firm = soup.find("info", {"firmcrdnb" : Firm_CRD}).parent
Welcome to StackOverFlow
Try use,
select_firm = soup.find(attrs={'firmcrdnb': str(Firm_CRD)}).parent
Maybe I'm missing something. If it works statically, have you tried something such as:
list_of_crds = ["11111","22222","33333"]
for crd in list_of_crds:
result = soup.find(firmcrdnb=crd).parent
...
I'm trying to write a BeautifulSoup object to a file. Note that I append something to the soup object. The thing is div containing HTML/JavaScript from Plotly's to_html() function, which gives me a chart in HTML form. I narrowed down the problem to the following code:
from bs4 import BeautifulSoup
file_writer = open("path/to/file", "w")
html_outline = """<html>
<head></head>
<body>
<p>Hello World!</p>
<div></div>
</body>
</html>"""
soup = BeautifulSoup(html_outline, "html.parser")
soup.div.append({plotly HTML/JavaScript})
file_writer.write(soup)
file_writer.close()
Inside the write function, I've tried various functions for the soup object to convert it to a string, like str(soup), soup.prettify(), and more that I'm forgetting, and those indeed successfully write to the file, but the angled brackets ("<>") from the Plotly HTML I insert become HTML entities (I believe that's what they're called), so a
<div>
becomes:
<div>
inside the file I write to. I will note here that only the angled brackets for the HTML I appended into the soup object turn into HTML entities, the html, head, and body tags are all proper angled brackets.
My question is, how can I convert the soup object directly into a string that has proper angled brackets and no HTML entities?
I guess I can maybe write a function that parses the file for those HTML entities and replaces them with proper angled brackets, but I'm hoping there's a better solution before I do that. I tried searching this problem up multiple times but nothing came up for it.
I asked this question previously but it was marked as a duplicate, but the duplicate question linked didn't help because that was for adding empty tags. I'm appending a whole existing div with JavaScript and other content to my soup object here.
Thanks in advance!
I found out that I was able to use bs4's .prettify() function, but I had to change the formatter to None. So my line of code that writes the HTML to the file becomes:
file_writer.write(soup.prettify(formatter=None))
This isn't best practice because according to bs4's docs, it said that this may generate invalid HTML/XML. I know the docs say that it should convert HTML entities to Unicode characters by default, so I'm not sure why that didn't work for me. While I'm not in urgent need of a solution anymore, I posted this because I thought that someone may find it useful in the future. Hopefully someone can give a better solution, though!
Python + programming noob here, so you may have to bear with me. I have a number of xml files (RSS archives) and I want to extract news article urls from them. I'm using Python 2.7.3 on Windows... and here's an example of the code I'm looking at:
<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<!--
Content-type: Preventing XSRF in IE.
-->
<generator uri="http://www.google.com/reader">Google Reader</generator>
<id>
tag:google.com,2005:reader/feed/http://feeds.smh.com.au/rssheadlines/national.xml
</id>
<title>The Sydney Morning Herald National Headlines</title>
<subtitle type="html">
The top National headlines from The Sydney Morning Herald. For all the news, visit http://www.smh.com.au.
</subtitle>
<gr:continuation>CJPL-LnHybcC</gr:continuation>
<link rel="self" href="http://www.google.com/reader/atom/feed/http://feeds.smh.com.au/rssheadlines/national.xml?n=1000&c=%5BC%5D"/>
<link rel="alternate" href="http://www.smh.com.au/national" type="text/html"/>
<updated>2013-06-16T07:55:56Z</updated>
<entry gr:is-read-state-locked="true" gr:crawl-timestamp-msec="1371369356359">
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
<category term="user/03956512242887934409/state/com.google/read" scheme="http://www.google.com/reader/" label="read"/>
<title type="html">Daley opts for Dugan for Origin two</title>
<published>2013-06-16T07:12:11Z</published>
<updated>2013-06-16T07:12:11Z</updated>
<link rel="alternate" href="http://rss.feedsportal.com/c/34697/f/644122/s/2d5973e2/l/0Lnews0Bsmh0N0Bau0Cbreaking0Enews0Esport0Cdaley0Eopts0Efor0Edugan0Efor0Eorigin0Etwo0E20A130A6160E2oc5k0Bhtml/story01.htm" type="text/html"/>
Specifically I want to extract the "original id" link:
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
I originally tried using BeautifulSoup for this but ran into problems, and from the research I did it looks like Element Tree is the way to go. First off with ET I tried:
import xml.etree.ElementTree as ET
tree = ET.parse('thefile.xml')
root = tree.getroot()
#first_original_id = root[8][0]
parents_of_interest = root[8::]
for elem in parents_of_interest:
print elem.items()[0][1]
So far as I can work out parents_of_interest does grab the data I want (as a list of dictionaries) but the for loop only returns a bunch of true statements, and after reading the documentation and SO it seems like this is the wrong approach.
I think this has the answer I'm looking for but even though it's a good explanation I can't seem to apply it to my own situation. From that answer I tried:
print tree.find('//{http://www.w3.org/2005/Atom}entry}id').text
But got the error:
__main__:1: FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version. If you rely
on the current behaviour, change it to './/{http://www.w3.org/2005/Atom}entry}id'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'
Any help on this would be appreciated... and sorry if that's a verbose question... but I thought I'd detail everything... just in case.
Your xpath expression matches the first id, not the one you're looking for and original-id is an attribute of the element, so you should write something like that:
idelem = tree.find('./{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}id')
if idelem is not None:
print idelem.get('{http://www.google.com/schemas/reader/atom/}original-id')
That will find only the first matching id, if you want them all, use findall and iterate over the results.
I'm trying to write a python script that modifies the contents of <script> tag in files I'm parsing. I'm using lxml.html (as opposed to BeautifulSoup, etc.) for this due to its speed. The contents of script tag are surrounded in comment tags (<!-- and -->):
<script>
<!--
...
-->
</script>
The problem is when I try something like scriptNode.text = '<!-- ... lxml modifies the angle brackets to their html representations (& lt; and & gt;) when I write the html back to file. I tried escaping them in the string ('\< ...'), but that doesn't seem to help.
Looking at most modern websites, it looks like those comment tags are not needed. I can remove them, but many of the scripts also use some html within them and if those get modified to their HTML representation as well, that's a problem.
I'm surprised that lxml is modifying this data at all, last I heard HTML parsers are designed to avoid modifying/interpreting data within <script> tags.
Is there a setting/command I can use to prevent this from happening?
Thanks
Put them in a CDATA section.
An alternative solution I just found that seems to work as well is using tostring() instead of write():
main = open('file.html', 'w')
main.write(lxml.html.tostring(htmlTree))
main.close()
instead of
htmlTree.write('file.html', pretty_print=False)
Figured I'd post it here as well, even though I decided to go with CDATA since it seems to be a cleaner solution that will prevent problems in the future with other parsing scripts as well.
I am a total python newb and am trying to parse an XML document that is being returned from google as a result of a post request.
The document returned looks like the one outlined in this doc
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#Archives
where it says 'The response contains information about the archive.'
The only part I am interested in is the Id attribute right near the beginning. There will only every be 1 entry, and 1 id attribute. How can I extract it to be use later? I've been fighting with this for a while and I feel like I've tried everything from minidom to elementtree. No matter what I do my search comes back blank, loops don't iterate, or methods are missing. Any assistance is much appreciated. Thank you.
I would highly recommend the Python package BeautifulSoup. It is awesome. Here is a simple example using their example data (assuming you've installed BeautifulSoup already):
from BeautifulSoup import BeautifulSoup
data = """<?xml version='1.0' encoding='utf-8'?>
<entry xmlns='http://www.w3.org/2005/Atom'
xmlns:docs='http://schemas.google.com/docs/2007'
xmlns:gd='http://schemas.google.com/g/2005'>
<id>
https://docs.google.com/feeds/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA</id>
<published>2010-11-18T18:34:06.981Z</published>
<updated>2010-11-18T18:34:07.763Z</updated>
<app:edited xmlns:app='http://www.w3.org/2007/app'>
2010-11-18T18:34:07.763Z</app:edited>
<category scheme='http://schemas.google.com/g/2005#kind'
term='http://schemas.google.com/docs/2007#archive'
label='archive' />
<title>Document Archive - someuser#somedomain.com</title>
<link rel='self' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<link rel='edit' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<author>
<name>someuser</name>
<email>someuser#somedomain.com</email>
</author>
<docs:archiveNotify>someuser#somedomain.com</docs:archiveNotify>
<docs:archiveStatus>flattening</docs:archiveStatus>
<docs:archiveResourceId>
0Adj-hQNOVsTFSNDEkdk2221OTJfMWpxOGI5OWZu</docs:archiveResourceId>
<docs:archiveResourceId>
0Adj-hQNOVsTFZGZodGs2O72NFMllMQDN3a2Rq</docs:archiveResourceId>
<docs:archiveConversion source='application/vnd.google-apps.document'
target='text/plain' />
</entry>"""
soup = BeautifulSoup(data, fromEncoding='utf8')
print soup('id')[0].text
There is also expat, which is built into Python, but it is worth learning BeautifulSoup, because it will respond way better to real-world XML (and HTML).
Assuming the variable response contains a string representation of the returned HTML document, let me tell you the WRONG way to solve your problem
id = response.split("</id>")[0].split("<id>")[1]
The right way to do it is with xml.sax or xml.dom or expat, but personally, I wouldn't be bothered unless I wanted to have robust error handling of exception cases when response contains something unexpected.
EDIT: I forgot about BeautifulSoup, it is indeed as awesome as Travis describes.
If you'd like to use minidom, you can do the following (replace gd.xml with your xml input):
from xml.dom import minidom
dom = minidom.parse("gd.xml")
id = dom.getElementsByTagName("id")[0].childNodes[0].nodeValue
print id
Also, I assume you meant id element, and not id attribute.