How to speed up parsing an 10MB HTML file with BeautifulSoup

How to speed up parsing an 10MB HTML file with BeautifulSoup - python

I'm parsing large HTMl files with Beautifulsoup that range between 3 and 10MB. Unfortunately, 99% of the data is content that I want to parse. The file practically contains a small header, a few js scripts and then between 1,000 and 10,000 items. Each items consists of the following table rows:
<tr class="New" id="content_id">
<td class="item" align="center">
</td><td align="center">
<a onclick="somecode"><img src="/images/sample.gif" alt="alttext" class="image"></a>
</td><td style="astyle">[content1]</td><td>[content2]</td><td>[content3]</td><td>[content4]</td><td>[content5]</td><td style="bstyle">[content6]</td><td>[content7]</td><td>[content8]</td><td>[content9]</td><td>[content10]</td><td>[content11]</td><td></td><td>[content12]</td><td>[content13]</td><td>
[content14]
</td><td>
[content15]
</td><td>[content16]</td><td>
<a title="" href="somejs">[content16]</a>
</td><td>
<a title="" href="somejs">[content17]</a>
</td>
</tr>
Note every [content] placeholder is relevant data that I need to parse.
I have tried a variety of common optimizations such as a) using different parsers, b) using SoupStrainer, c) define encoding
b) and c) practically have no effect when I log the time it takes. The different parsers have a significant impact. When I run the script below on a 1.5k list of items (comparably small list), I'm getting the following parsing times (I am running the experiment on a 2012 Mac Book Air):
#1653 items parsed in 15.5 seconds with lxml
#xml takes 27 sec
#html5lib takes 69 sec
#html.parser takes 24 sec
current = datetime.datetime.utcnow()
strainer = SoupStrainer('table', attrs={'id':'contenttable'})
soup = BeautifulSoup(html,'lxml',parse_only=strainer,from_encoding="UTF-8")
print datetime.datetime.utcnow() - current
Question: Besides what I have used so far, are there any tweaks I can use to dramatically shorten the parsing time?
So far I can only think of increasing CPU power.

Assuming you're reading in the entire file into memory first, then there isn't much else you can do. If the HTML is broken in quite a few places, then the parsers have to perform more work to try and guess the correct structure.
When it comes to parsing XML/HTML in Python, it has been my experience that lxml has been the fastest and most memory efficient (compared to something like xml.minidom or BeautifulSoup).
However, I have parsed simple XML files larger than 10MB in less than 15 seconds, so it leads me to believe that you may have really nasty/heavily nested HTML which is choking the parser. Either that or my hardware is just crazy awesome (i7 2700k and an SSD).

lxml looks to be the best solution in Python.
We benchmark the all parser / platform when building: serpapi.com
https://medium.com/#vikoky/fastest-html-parser-available-now-f677a68b81dd

Have you tried using lxml iterparse and remove nodes on every iteration. Here is an excellent article which talks about how to parse huge files. See the solution in the end.

Related

How to make a html table using python?

Just received a task from a basic programming course in uni.
I am a complete newbie regarding computers. I am a freshman and have no prior programming experience.
The task requires making a python source code that would print a html table as output.
No use of modules is allowed.
We covered some basic python things like if, for loop, while, print, etc...
but didn't learn anything about creating html in python.
I've been searching on the internet for hours and hours, but all solutions seem so advanced and they all involve use of third-party modules, which in my case is not allowed.
Professor knows that we are all complete newbies, so there's got to be a way to do this without much professional knowledge.
Can anyone please tell me the basics of making a html table in python?
Like do I just type in things like
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
in python? Basically have no idea where to start.
** The code should be written in a way that when it is executed in bash shell ($ python file_name.py), it prints out a html table.
P.S. I'm using vs code as an editor for python.

As I can imagine the only way to make it simple without going bananas is to write
html file. So basically its .write rows into html file using python. If this is what you want - here is an example how to do that plain and simple:
from random import randint
with open('table.html', 'w') as html:
html.write('<table style="border:2px solid black;">\n')
for i in range(5):
html.write('<tr>\n')
for i in range(3):
html.write(f'<td>{randint(10,150)}</td>\n')
html.write('</tr>\n')
html.write('</table>')
"\n" - used to make you Html document look readable.
Random.randint module is used to generate data - its basic simple module just for data placeholders.
For the table to look nice you can add border to td -
<td style="border:2px solid black;" >(random data)</td>
It will look like solid html/excel table.

You won't learn anything by copy-pasting working examples that you don't understand and you can't expect to understand anything, when you search for solutions to complex problems without knowing the basics.
Given the level of experience you have according to your question, you should instead search for a Python tutorial to get a grip of the language. Read about the synthax, Python's object model and the type hierarchy. With that knowledge you will be able to understand the documentation, and with that, you should be able to solve your problem without searching for pre-made solutions.

As we haven't covered any HTML-related topic in class yet, doing the task doesn't require much complex solution. It looks like the task can be done just by using for loop, lists, if, print... etc, all of which I've already learned and am quite confident with. Many thanks to everyone who cared to give help and answers.

Python XML fast count number of elements with certain tag

I have just started out programming in Python a while ago and I am currently working on a really big dataset. It is one xml file, the size of about 80GB, so I can't just parse it e.g. with xml.etree.ElementTree, because it simply won't fit into my RAM. (File: ftp://ftp.ebi.ac.uk/pub/databases/interpro/46.0/, see match_complete.xml.gz)
What I did until now: I iterparsed it, always cleaning the current element and the root of it, as soon as I found what I was looking for, which is pretty efficient (requires less than 10MB of RAM).
What I am trying to do now, is parallelize my parsing, since I have 10 cores and 20 threads at my disposal. In order to do so, I am planning on splitting this one big xml file into 20 smaller ones, so I can just start a search in each of the small files in parallel (this might be a second question, in another thread).
Since I am not only trying to do this for one dataset, which size I can easily look up (see release_notes.txt in the upper link), but I want this to be a more general script for further use, I am looking for the most efficient way of finding out, how many elements with a certain tag are present in this huge xml file, so I can always split the files according to the number of threads I have available.
Datastructure looks like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE interpromatch SYSTEM "match_complete.dtd">
<interpromatch>
<release>
<dbinfo Here is stuff I am totally not interested in>
<dbinfo Here is stuff I am totally not interested in>
</release><protein id="A0A000" name="A0A000_9ACTN" length="394" crc64="F1DD0C1042811B48">
<match id=Some info about the proteins in my case>
<ipr Some info I am acutally looking for, when I am parsing the file ESSENTIAL />
<Don't need this either />
</match>
<match id=Some info about the proteins in my case>
<ipr Some info I am acutally looking for, when I am parsing the file ESSENTIAL />
<Don't need this either />
</match>
<match id=Some info about the proteins in my case>
<ipr Some info I am acutally looking for, when I am parsing the file ESSENTIAL />
<Don't need this either />
</match>
</protein>
.
.
. (around 50000000 more entries in the whole db)
<protein>
<match id=Some info about the proteins in my case>
<ipr Some info I am acutally looking for, when I am parsing the file ESSENTIAL />
<Don't need this either />
</match>
</protein>
</interpromatch>
Let's say I am looking for the tag "protein" and my database contains 10000 entries of that kind. I want to be able to look this number up as fast as possible (iterating I think is simply not feasable), so I can find out, how many of those entries there are and divide this number by the number of threads. In this example I want to get e.g. len(tree.findall("protein")), so I know, how many of those entries I have to put into one of the smaller files. This would be 10000(proteins)/20(threads) per file in this case.
I am mainly working with Python, but I would consider everything, that just tells me, how many "protein" entries there are in my database as fast as possible.
Just for completeness sake, what I am trying to do later is:
Start a script/subprocess for each of the smaller files and query it for a certain attribute in the "ipr" section. There I am looking for a certain identifier and if this one is present, extract data from the parent "protein" node. Combine those results, and work with those.
I hope you get what I mean and can help me. Thanks in advance!

Parsing HTML with str.split in Python

I'm parsing a website with the requests module and I'm trying to get specific URLs inside tags (but a table of data as the tags are used more than once) without using BeautifulSoup. Here's part of the code I'm trying to parse:
<td class="notranslate" style="height:25px;">
<a class="post-list-subject" href="/Forum/ShowPost.aspx?PostID=80631954">
<div class="thread-link-outer-wrapper">
<div class="thread-link-container notranslate">
Forum Rule: Don't Spam in Any Way
</div>
I'm trying to get the text inside the tag:
/Forum/ShowPost.aspx?PostID=80631954
The thing is, because I'm parsing a forum site, there are multiple uses of those divider tags. I'd like to retrieve a table of post URLs using string.split using code similar to this:
htmltext.split('<a class="post-list-subject" href="')[1].split('"><div class="thread-link-outer-wrapper">')[0]
There is nothing in the HTML code to indicate a post number on the page, just links.

In my opinion there are better ways to do this. Even if you don't want to use BeautifulSoup, I would lean towards regular expressions. However, the task can definitely be accomplished using the code you want. Here's one way, using a list comprehension:
results = [chunk.split('">')[0] for chunk in htmltext.split('<a class="post-list-subject" href="')[1:]]
I tried to model it as closely off of your base code as possible, but I did simplify one of the split arguments to avoid whitespace issues.
In case regular expressions are fair game, here's how you could do it:
import re
target = '<a class="post-list-subject" href="(.*)">'
results = re.findall(target, htmltext)

Consider using Beautiful Soup. It will make your life a lot easier. Pay attention to the choice of parser so that you can get the balance of speed and leniency that is appropriate for your task.

It seems really dicey to try to pre-optimize without establishing your bottleneck is going to be html parsing. If you're worried about performance, why not use lxml? Module imports are hardly ever the bottleneck, and it sounds like you're shooting yourself in the foot here.
That said, this will technically do what you want, but it seriously is not more performant than using an HTML parser like lxml in the long run. Explicitly avoiding an HTML parser will also probably drastically increase your development time as you figure out obscure string manipulation snippets rather than just using the nice tree structure that you get for free with HTML.
strcleaner = lambda x : x.replace('\n', '').replace(' ', '').replace('\t', '')
S = strcleaner(htmltext)
S.split(strcleaner('<a class="post-list-subject" href="'))[1].split(strcleaner('"><div class="thread-link-outer-wrapper">'))[0]
The problem with the code you posted is that whitespace and newlines are characters too.

Easiest way to parse specific pieces of information from HTML

I know the question title isn't amazing, but I can't think of a better way to word it. I have a bit of HTMl that I need to search:
<tr bgcolor="#e2d8d4">
<td>1</td>
<td>12:00AM</td>
<td>Show Name<a name="ID#"></a></td>
<td>Winter 12</td>
<td>Channel</td>
<td>Production Company</td>
<td nowrap>1d 11h 9m (air time)</td>
<td align="center">11</td>
<td>
AniDB</td>
<td>Home</td>
</tr>
The page is several dozen of these html blocks. I need to be able to, with just Show Name, pick out the air time of a given show, as well as the bgcolor. (full page here: http://www.mahou.org/Showtime/Planner/). I am assuming the best bet would be a regexp, but I am not confident in that assumption. I would prefer not to use 3rd party modules (BeautifulSoup). I apologize in advance if the question is vague.

Thank you for doing your research - it's good that you are aware of BeautifulSoup. This would really be the best way to go about solving your problem.
That aside... here is a generic strategy you can choose to implement using regexes (if your sanity is questionable) or using BeautifulSoup (if you're sane.)
It looks like the data you want is always in a table that starts off like:
<table summary="Showtime series for Sunday in a Planner format." border="0" bgcolor="#bfa89b" cellpadding="0" cellspacing="0" width="100%">
You can isolate this by looking for the summary="Showtime series for (Monday|Tuesday|....|Sunday)" attribute of the table, which is unique in the page.
One you have isolated that table, the format of the rows within the table is well defined. I would get <tr> at a time and assume that the second <td> will always contain the airing time, and the third <td> will always contain the show's name.
Regexes can be good for extracting very simple things from HTML, such as "the src paths of all img tags", but once you start talking about nested tags like "find the second <td> tag of each <tr> tag of the table with attribute summary="...", it becomes much harder to do. This is because regular expressions are not designed to work with nested structures.
See the canonical answer to 'regexps and HTML' questions, and Tom Christiansen's explanation of what it takes to use regexps on arbitrary HTML. tchrist proves that you can use regexps to parse any HTML you want - if you're sufficiently determined - but that a proper parsing library like BeautifulSoup is faster, easier, and will give better results.

This was supposed to be a comment, but it turned out too long.
BeautifulSoup's documentation is pretty good, as it contains quite a bit of examples, just be aware that there are two versions and not each of them plays nicely with every version of Python, although probably you won't have problems there (see this: "Beautiful Soup 4 works on both Python 2 (2.7+) and Python 3.").
Furthermore, HTML parsers like BeautifulSoup or lxml clean your HTML before processing it (to make it valid and so you can traverse its tree properly), so they may move certain elements regarded as invalid. Usually, you can disable that feature but then it's not certain you're going to get the results you want.
There are other approaches to solve the task you're asking. However, they're much more involved to implement, so maybe it's not desirable under the conditions you described. But just to let you know, the whole field of information extraction (IE) deals with that kind of issues. Here (PDF) is a more or less recent survey about it, focused mainly on IE for extracting HTML (semi-structured, as they called it) webpages.

Best way transform custom XML like syntax

Using Python.
So basically I have a XML like tag syntax but the tags don't have attributes. So <a> but not <a value='t'>. They close regularly with </a>.
Here is my question. I have something that looks like this:
<al>
1. test
2. test2
test with new line
3. test3
<al>
1. test 4
<al>
2. test 5
3. test 6
4. test 7
</al>
</al>
4. test 8
</al>
And I want to transform it into:
<al>
<li>test</li>
<li> test2</li>
<li> test with new line</li>
<li> test3
<al>
<li> test 4 </li>
<al>
<li> test 5</li>
<li> test 6</li>
<li> test 7</li>
</al>
</li>
</al>
</li>
<li> test 8</li>
</al>
I'm not really looking for a completed solution but rather a push into the right direction. I am just wondering how the folks here would approach the problem. Solely REGEX? write a full custom parser for the attribute-less tag syntax? Hacking up existing XML parsers? etc.
Thanks in advance

I'd recommend start with the following:
from xml.dom.minidom import parse, parseString
xml = parse(...)
l = xml.getElementsByTagName('al')
then traverse all elements in l, examining their text subnodes (as well as <al> nodes recursively).
You may start playing with this right away in the Python console.
It is easy to remove text nodes, then split text chunks with chunk.split('\n') and add <li> nodes back, as you need.
After modifying all the <al> nodes you may just call xml.toxml() to get the resulting xml as text.
Note that the element objects you get from this are linked back to the original xml document object, so do not delete the xml object in the process.
This way I personally consider more straightforward and easy to debug than mangling with multiline regexps.

The way you've described your syntax, it is "XML without attributes". If that's so, it's still XML, so you can use XML tools such as XSLT and XQuery.
If you allow things that aren't allowed in XML, on the other hand, my approach would be to write a parser that handles your non-XML format and delivers XML-compatible SAX events. Then you'll be able to use any XML technology just by plugging in your parser in place of the regular XML parser.

It would depend on what you want to do with it exactly, if it is a one-of script the following suffices:
cat in.txt | perl -pe 'if(!/<\/?al>/){s#^(\s*)([0-9]+\.)?(.*)$#$1<li>$3</li>#}'
And it works. But I wouldn't say it's very robust ;) But if it's for a one-off it's fine.

I am just wondering how the folks here would approach the problem.
I would go for using a parser.
My reasoning is that the operation your are trying to perform isn't merely a syntactic or lexical substitution. It's much more of a grammar transformation, which imply understanding the structure of your document.
In your example, you are not simply enclosing each line between <li> and </li>; you are also enclosing recursively some blocks of document that spans over several lines, if these represent an "item".
Maybe you could put together a regex capable of capturing the interpretative logic and the recursive nature of the problem, but doing that would be like digging a trench with a teaspoon: you could do it, but using a spade (a parser) is a much more logical choice.
An additional reason to use a parser is the "real word". Regex are true "grammar nazis": a glitch in your markup and they won't work. On the other hand, all parser libraries are "flexible" (treat uniformly different spellings like <a></a> and <a/> or HTML's <br> and XHTML's <br/>) and some - like beautifulsoup - are even "forgiving", meaning that they will try to guess (with a surprisingly high level of accuracy) what the document's author wanted to code, even if the document itself fails validation.
Also, a parser-based solution is much more maintainable than a regex-based one. A small change in your document structure might need radical changes of your regex [which by nature tend to become obscure to their very own author after 72 hours or so].
Finally, because you are using python and therefore readability counts, a parser-based solution could potentially result in much more pythonic code than very complex/long/obscure regex.
HTH!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.