I was using the xgoogle python library for one of my projects. It was working fine till recently. I am not getting the resultset that I used to get before. If anyone who has used this library written by Peter Krummins, faced a similar situation, can you please suggest a work around ?
The presence of BeautifulSoup.py hints that this library uses web scraping to get its result.
A common problem with this is that it will easily break when the design/layout of the page being scraped changes. And the problem you see seems to coincide with the new search results layout that Google introduced just recently.
Another problem is that it often is against the terms of service of the site being scraped. And according to point 5.3 of the Google Terms Of Service it actually is:
You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) [...]
A better idea would be to use the Custom Search API.
Peter Krumin's product xgoogle looks to be extremely useful both to me and I image many others.
https://github.com/pkrumins/xgoogle
For me the current version is 1.3 is not working.
I tried a new install from GitHub, ran the examples and nothing is returned.
Adding a debugger to the source code and tracing the data captured in a query to its disappearance the problem occurs in a routine called search.py subroutine "_extract_results" at a parser command
results = soup.findAll('li', {'class': 'g'})
The soup object has material in it but the "findAll" fails to return anything.
Looks like its searching for lists and if there are none it returns nothing.
I am unsure what html you would try to match to get a result.
If anyone knows how to make the is work I am very interested.
A little more googling and it appears xgoogle is no longer supported or works.
Part of the trouble is that Google changes the layout of its results pages every so often and so any scraping software that assumes some standard layout is in time doomed to failure.
There are however other search engines that are locally installed and thus provide a results layout that are less likely change with upgrades and will not change at all if you don't upgrade.
I am currently investigating Yacy. Easy to install and can be pointed at specific sites if you want.
Related
I have a Node project that's bundled and added to Github as releases. At the moment, it checks my Github for a new release via the API and lets the user download it. The user must then stop the Node server, unzip the release.zip to the folder and overwrite everything to update the project.
What I'm trying to do is write a Python script that I can execute in Node by spawning a new process. This will then kill the Node server using PM2, and then Python script will then check the Github API, grab the download url, downloads it, unzips the contents to the current folder, deletes the zip and then starts up the Node server again.
What I'm struggling with though is checking the Github API and downloading the latest release file. Can anyone point me in the right direction? I've read that wget shouldn't be used in Python, and instead use urlopen
If you are asking for ways to get data from a web server, the two main libraries are:
Requests
Urllib
Personally, I prefer requests. They both have good documentation.
With requests, getting JSON data is as simple as:
r = requests.get("example.com")
r = r.json()
You can add headers and other information easily, though keep in mind that while it supports HTTP, it doesn't support HTTPS.
You need to map out your workflow and dataflow better. You can do it in words or pictures. If you can express your problem clearly and completely in words step by step in list format in words, then translate it to pseudocode. Python is great because you can go almost immediately from a good written description, to pseudocode, to a working implementation. Then at least you have something that works, and you can optimize performance, simplify functionality or usability from there. This is the process of translating a problem into a solution.
When asking questions on SO, you need to show your current thinking, what you've already tried, preferably with your code that doesn't yet work, or work the way you need it to work. People can vote you down and give you negative reputation points if you ask a question with just a vague description, a question that is an obvious cry for help with homework (yours is not that), or a muse or a vague question with not even an attempt at a solution, because it does not contribute back to the community in any way.
Do you have any code or detailed pseudocode steps for checking the GitHub API and checking for the "latest release" of file(s) you are trying to update?
When attempting to use the python library newspaper3 on an archived page url from archive.org it fails to fetch any articles. However when using it on the same live page url it works fine. Please see below:
import newspaper
len(newspaper.build('https://bbc.co.uk/news').articles)
>> 111
len(newspaper.build('https://web.archive.org/web/http://www.bbc.co.uk/news').articles)
>>> 0
Even using the special id hack that returns the original modified page doesn't work:
len(newspaper.build('https://web.archive.org/web/20171219030622id_/http://www.bbc.co.uk/news').articles)
>>> 0
Any help would be much appreciated, thanks!
I find no indication that this library is meant to work with archive.org, or that it works with archive.org.
Both [1][2] lists of sources contain no mention of either archive.org or web.archive.org.
I downloaded the entire repository to search the source code, and it also contains no mention of either Internet Archive domain.
From what I can tell based on this file, the articles attribute is based on RSS/ATOM feeds. I don't think Internet Archive archives those, and even if it does, since they would link back to the live version of the site, some changes in the library itself would be needed to get them to work with Internet Archive.
You've already opened an issue, where you specify it doesn't work at all (even on single articles -- this is probably an issue elsewhere, such as in the node scoring algorithm it uses to decide which nodes contain the article) so if you don't want to dive into the library source code and fix it yourself, all you can do is wait.
I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.
I did some googling and didn't find anything complete for my problem, but it is so generic, there has to be something.
I need feed parsing tool for my Django app (i want to fetch atom feed from somewhere and store its contents). I just found some feedparser.py references but the actual site is a long gone.
Could you provide some pointers?
feedparser is still pretty much the canonical solution for this in Python. It's very far from gone: see the documentation here and the actual project page here
If I add a feed URL to Google Reader or to a desktop feed aggregator, I receive nice results. The URL is:
http://estaticos03.marca.com/rss/futbol_1adivision.xml
But when I fetch the same URL from a script (python script, using feedparser library) I am getting slightly different content for the same results (the title for each entry, for example, is different and all in uppercase).
I believe something is done on the server-side to try to discourage people like me to parse the content for my own projects (the feed is from a popular football newspaper), but I am not sure about it. I tried to pass some user agents (like the google reader one) but still no luck, so maybe they check the IP as well? I am really confused.
Any idea why is this happening to me?
Thanks!
AFAIK Google Reader does some "magic" in the content to beautify it. They strip some tags and styles to avoid breaking their interface.
Can you provide more details on the differences?
Did you changed the user agent of your script? Try to mimic Firefox and see what happen.
All right folks, I found it. I analyzed the source XML received (as #TryPyPy). I had been trusting too much the feedparser library. Latest official version (4.1) has a bug related to mistakeing the title tag from media namespace instead of the original one:
http://code.google.com/p/feedparser/issues/detail?id=76
So I reinstalled from trunk and now everything is OK. Thanks for helping anyway!