I'm using python to build an application which functions in a similar way to an RSS aggregator. I'm using the feedparser library to do this. However, I'm struggling to get the program to correctly detect if there is new content.
I'm mainly concerned with news-related feeds. Besides seeing if a new item has been added to the feed, I also want to be able to detect if a previous article has been updated. Does anybody know how I can use feedparser to do this, bearing in mind that the only compulsory item elements are either the title or the description? I'm willing to assume that the link element will always be present as well.
Feedparser's "id" attribute associated with each item seems to simply be the link to the article so this may help with detecting new articles on the feed, but not with detecting updates to previous articles since the "id" for those will not have changed.
I've looked on previous threads on stackoverflow and some people have suggested hashing the content or hashing title+url but I'm not really sure what that means or how one would go about it (if indeed it is the right approach).
Hashing in this context means to calculate a shorter value to represent each combination of url and title. This approach works when you use a hash function that ensures the odds of a collision (two different items generate the same value) are low.
Traditionally, MD5 has been a good function for this (but be careful not to use it for cryptographic operations - it's deprecated for that purpose).
So for example.
>>> import hashlib
>>> url = "http://www.example.com/article/001"
>>> title = "The Article's Title"
>>> id = hashlib.md5(url + title).hexdigest()
>>> print id
785cbba05a2929a9f76a06d834140439
>>>
This will provide an id that will change if the URL or title changes - indicating that it is a new article.
You can download and add the content of the article to the hash if you also want to detect edits to the article content.
Note, if you do intend to pull entire pages down, you may want to learn about HTTP conditional GET with Python in order to save bandwidth and be a little friendlier to the sites you are hitting.
Related
I am relatively new to web scraping/crawlers and was wondering about 2 issues in the event where a parsed DOM element is not found in the fetched webpage anymore:
1- Is there a clever way to detect if the page has changed? I have read that it's possible to store and compare hashes but I am not sure how effective it is.
2- In case a parsed element is not found in the fetched webpage anymore, if we assume that we know that the same DOM element still exists somewhere in the DOM Tree in a different location, is there a way to somehow traverse the DOM Tree efficiently without having to go over all of its nodes?
I am trying to find out how experienced developers deal with those two issues and would appreciate insights/hints/strategies on how to manage them.
Thank you in advance.
I didn't see this in your tag list so I thought I'd mention this before anything else: a tool called BeautifulSoup, designed specifically for web-scraping.
Web scraping is a messy process. Unless there's some long-standing regularity or direct relationship with the web site, you can't really rely on anything remaining static in the web page - certainly not when you scale to millions of web pages.
With that in mind:
There's no one-fit-all solution. Some ideas:
Use RSS, if available.
Split your scraping into crude categories where some categories have either implied or explicit timestamps (eg: news sites) you can use to trigger an update on your end.
You already mentioned this but hashing works quite well and is relatively cheap in terms of storage. Another idea here is to not hash the entire page but rather only dynamic or elements of interest.
Fetch HEAD, if available.
Download and store previous and current version of the files, then use a utility like diff.
Use a 3rd party service to detect a change and trigger a "refresh" on your end.
Obviously each of the above has its pros and cons in terms of processing, storage, and memory requirements.
As of version 4.x of BeautifulSoup you can use different HTML parsers, namely, lxml, which should allow you to use XPath. This will definitely be more efficient than traversing the entire tree manually in a loop.
Alternatively (and likely even more efficient) is using CSS selectors. The latter is more flexible because it doesn't depend on the content being in the same place; of course this assumes the content you're interested in retains the CSS attributes.
Hope this helps!
I'm able to successfully list tags from a repository using github3 using:
repo.iter_refs(subspace='tags')
That results in a generator of github3.git.Reference objects. Is there a way for me use a similar mechanism to get github3.git.Tag objects, instead? Right now I'm forced to convert each Reference object into my own version of Tag.
So the only way to get a github3.git.Tag object back is if you're trying to retrieve a specific annotated tag (which is a tag created in a very specific manner).
If this is what you're trying to do then your code would look something like
tags = [repo.tag(r.object.sha) for r in repo.iter_refs(subspace='refs')]
You can get a lightweight tag (which is what most tags on GitHub actually are) by either your current method or by doing repo.iter_tags(). Either will work. The latter will return github3.repos.RepoTag, not a github3.git.Tag though because the API returns much different information for each.
I am loading and scrolling on dynamically loading pages. An example is the Facebook "wall", which only loads the next items once you have scrolled to somewhere near the bottom.
I scroll until the page is veeeery long, then I copy the source code, save it as a text file and go on to parsing it.
I would like to extract certain parts of the webpage. I have been using the lxml module in python, but with limited success. On there website they only show examples with pretty short Xpaths.
Below is an example of the function and a path that gets me the user names included on the page.
usersID = elTree.xpath('//a[#class="account-group js-account-group js-action-profile js-user-profile-link js-nav"]')
this works fairly well, however I am getting some errors (another post of mine), such as:
TypeError: 'NoneType' object has no attribute 'getitem'
I have also been looking at the Xpaths that Firebug provides. These are of course much longer and very specific. Here is an example for a reoccuring element on the page:
/html/body/div[2]/div[2]/div/div[2]/div[2]/div/div[2]/div/div/div/div/div[2]/ol[1]/li[26]/ol/li/div/div[2]/p
The part towards the end li[26] shows it is the 26th item in a list of the same element, which are found at the same level of the HTML tree.
I would like to know how I might use such firebug-Xpaths with the lxml library, or of anybody knows of a better way to use Xpaths in general?
Using example HTML code and tools like this for test purposes, the Xpaths from Firebug don't work at all. Is that path just ridiculous in people's experience?
Is is very specific to the source code? Are there any other tools like Firebug that produce more reliable output for use with lxml?
FireBug actually generates really poor xpaths. They are long and fragile because they're incredibly non specific beyond hierarchy.
Pages today are incredibly dynamic.
The best way to work with xpath on dynamic pages is to locate common elements as the hook and perform xpath ops from those as your path root.
What I mean here by common elements is stable structural elements that are highly likely or guaranteed to be present. Pick the one closest to your target in terms of containment hierarchy. Shorter paths are faster and clearer.
From there you need to create paths that locate some specific unique attribute or attribute value on the target element.
Sometimes that's not possible so another strategy is to target the closest uniquely identifiable container element then get all elements similar to yours under that and iterate them looking for your goal.
Highly dynamic pages require sophisticated and dynamic approaches.
Facebook changes a lot and will require script maintenance frequently.
I found two things which, together, worked very well for me.
The first thing:
The lxml package allows the usage of some functions in combination with the Xpath. I used the starts-with function, as follows:
tweetID = elTree.xpath("//div[starts-with(#class, 'the-non-varying-part-of-a-very-long-class-name')]")
When exploring the HTML code (tree) using tools such as Firebug/Firepath, everything is shown nice and neatly - for example:
*
*
When I used the highlighted path, i.e. tweet original-tweet js-original-tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable has-cards has-native-media with-media-forward media-forward cards-forward - to search my elTree within the code above, nothing was found.
Having a look at the actual HTML file I was trying to parse, I saw it was really spread over many lines - like this:
this explains why the lxml package was not finding it according to my search.
The second thing:
I know is not generally recommended as a workaround, but the Python approach that it is "easier to ask for forgiveness than permission" applied in my case - The next things I did was to use the python try / except on a TypeError that I kept getting at seemingly arbitrary lines of my code
This may well be specific to my code, but after checking the output on many cases, it seems as though it worked well for me.
I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
for para in doc.paragraphs:
for run in para.runs:
# How to tell if this run has images or tables?
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?
Thanks!
There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.
python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you:
https://github.com/python-openxml/python-docx/issues/40
There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.
This requirement comes up from time to time so we'll definitely want to add it at some point.
Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.
Assuming doc is of type Document, then what you want to do is have 3 separate iterations:
One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes
The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.
Here is the documentation for more info: python-docx
I just read through the documentation on python-docx.
They mention several times that added content is created at the end of the document, but I didn't notice any way to alter this functionality.
Does anyone know how to add a new page to a pre-existing document, but make it page 1?
Thanks!
The short answer is the library doesn't support that just yet, although those features are high on the backlog so will be among the next to be implemented.
To get it done in the meantime you'll need to go down to the XML level with a "workaround" function. If you want to add this use case on this issue on GitHub I'll put together some workaround code you can use.
https://github.com/python-openxml/python-docx/issues/27