I am trying to write a program to retrieve all of the links for questions that have active bounties in a specific tag. I have not yet implemented the specific tag feature, because I am stuck just try to get all of the links.
from re import findall
from urllib.request import urlopen
def fetch_source(url):
return str(urlopen(url).read())
site = 'http://stackoverflow.com/?tab=featured'
def fetch_links(source):
source = fetch_source(source)
return findall("\/questions\/[0-9]*\/(?:[A-z]|\-)+", source)
print(fetch_links(site))
This will fetch many of the links, but it misses a lot of them because my regex only allows [A-z]|\- in the title. I'm not sure how to fix this though because some questions have quotation marks in the titles, and if I allow those, I will not know when the question link ends?
I'm sorry for being new to python, but I am just trying to figure stuff out.
Using regex would become completely infeasible for getting questions by specific tag.
You are correct that your regex is missing a lot of titles, but using findall really isn't appropriate in this situation. Beautiful soup, is a much better tool for retrieving links, and I recommend you look into it.
In this instance, however, the Stack Exchange API has you covered.
For similar questions, just search(or Google) through the API documentation until you see the feature you're looking for, in your case featured question.
Enter the parameters you want, and the API will show generate a link:
https://api.stackexchange.com/2.2/questions/featured?order=desc&sort=votes&tagged=python&site=stackoverflow
Example for retrieving all feature Python questions
Related
I wanted to do this before for some websites but didn't know where to start. This time however I am adamant. I am talking about the scripts where we crawl a website and extract the data we require. My target is this: Basically I have to appear for job interviews in December. There is this site (http://www.geeksforgeeks.org/) which contains large number of questions from previous interviews (like http://www.geeksforgeeks.org/amazon-interview-set-42-on-campus/ & http://www.geeksforgeeks.org/adobe-interview-set-6-campus-mts-1/). Every title has word "set" and a number in it. It is quite cumbersome to keep track of what I have done and what not. So I want to extract questions from each of these pages and put them in a pdf with the title. How can I do this using curl, regex and Scrapy? I am intermediate in C/C++/Java and but have only beginner proficiency in Python. Any help is much appreciated. Also point me to any such scripts you such know of. I want to do this on my own. Just requires a starting point and some guidance. Thanks.
If you want just a starting point, try scrapy a screen-scraping library for python. I would recommend that you use the requests library for making requests. It's by far the simplest option (with no loss of power).
Also, don't try to parse html or xml with a regex. Just don't. Use one of the fine libraries available (beautifulsoup or lxml, or lxml with a beautifulsoup backend are the most popular, but there are others).
I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.
I Have an IRC bot I'm working on, and one of the features I would like it to have is to take any link a person posts and use BeautifulSoup to parse that page. Now, I have the bot working, getting the messages people post, etc. But, how would I pull a link from the IRC message? Say someone says this:
Person: Check out http://www.site.com, it's cool!
How would I take the link out and assign it to a variable for later use, without pulling the other parts of the message?
I think it's something to do with regexs, but I'm not sure.
You will indeed need to use regular expressions.
There's a decent article with a regular expression for matching URLs and somewhat of a description of what it's doing at daring fireball.
You can look at how Django does it here.
Finally, Python's regular expression documentation may also be useful.
You are on the exact track to finish this. You gave yourself the answer with the last sentence of your question. You will use a regular expression with a capture group to get the url and from there you can parse/grab the page that the user has said in the irc.
This site may be of some use for you: http://www.regular-expressions.info/
I am working on a data-mining project for which I need to analyse the progress of discussion in a thread of a forum. I am interested in extracting information like time of post, stats of post's author (no. of posts, joining date, etc.), text of the post, etc.
However while using standard scraping tools (like Scrapy in python) I need to write the regular expressions for detecting these fields in the page's html source. As these tags vary with the type of forum, it is becoming a major problem to tackle the regular expressions for every forum. Is there a standard bank of such regular expressions available, so that they can be used based on the type of forum?
Or is there any other technique to extract these fields from the forum's page.
I wrote some configuration files for some major forums. Hope you can decipher and infer how to parse it.
For VBulletin:
enclosed_section=tag:table,attributes:id;threadslist
thread=tag:a,attributes:id;REthread_title_
list_next_page=type:next_page,attributes:anchor_text;>
post=tag:div,attributes:id;REpost_message_
thread_next_page=type:next_page,attributes:anchor_text;>
enclosed_section is the div that contains links to all the threads
thread is where you'll find the link to each thread
list_next_page is the link to the next page with list of threads
post is the div with the post text.
thread_next_page is the link to the next page of the thread
For Invision:
enclosed_section=tag:table,attributes:id;forum_table
thread=tag:a,attributes:class;topic_title
list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post=tag:div,attributes:class;post entry-content |
thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post_count_section=tag:td,attributes:class;stats
post_count=tag:li,attributes:,reg_exp:(\d+) Repl
You'll still have to create several approaches per forum. But as Henley suggests, there are also a lot of forums that share their structure.
About easily parsing the dates of the forum's threads, dateparser was born from this specific requirement and it could be of great help.
Can anyone point me towards a ready made RSS screen scraper, preferably in Python in order to get full text RSS feeds?
There's a good list of them here, which mentions Feed Parser, which you use like this:
import feedparser
python_wiki_rss_url = "http://www.python.org/cgi-bin/moinmoin/" \
"RecentChanges?action=rss_rc"
feed = feedparser.parse( python_wiki_rss_url )
You can then do things like:
for item in feed["items"]:
print item["title"]
feedparser.org is great
Sorry but it doesn't exist in python, though they do in php. You are more then welcome to use and improve the one I made named scraped. Though it does not do all sites, it is a recipe based system that currently only handles the NYT, WSJ and the Economist. I am working on an all inclusive algorithm, but its a major undertaking. It includes a ton of analysis to the different types of html and xml. Even the 3 sites mentioned above, have vastly different algorithms on how to scrape their sites WSJ being the most complex by far. They screw their HTML up with so much useless crap, mainly to just stop you.
Here is the program I was talking about, it requires lxml but it explains everything in the readme. It reads the config files, parses partial rss feeds, takes links and then scrapes those links, formulating in the end a RSS 2.0 xml file. Which I mainly convert into a ebook for my kindle. I utilize lxml, BeautifulSoup and feedparser.
http://tinyurl.com/yh3s9pa
You can also look at the calibre project, which uses a similar method to the way I do it, on recipes.