I'm trying to scrap a large database for a project of mine, however I find that after I scrap a relatively big amount of data, I stop receiving some of the xml information I'm interested in. I'm not sure if it's because the server is limiting my access or because it starts scraping too fast.
I put a "sleep" line between the scraping loops to overcome this, but as I try to reach more data it doesn't work anymore.
I guess this is a known problem in web scraping but I'm very new to this field so any suggestion will be very helpful.
Note: I tried 'request' with some free proxies but that didn't work either (still some data missing). I also checked the original website and it does have the data I seek.
Edit: It looks like most of that data I'm missing comes from specific attributes that don't load as fast as all other data. So I think I'm looking for a way to tell if this xml I'm looking for has loaded already.
I'm using lxml and requests.
Thanks.
Related
I'm working on a python webscraper that pulls data from a car advertising site. I got the scraping part all done with beatifoulsoup but I've ran into many difficulties trying to store and modify it. I would really appreciate some advice on this part since I'm a lacking knowledge on this part.
So here is what I want to do:
Scrape the data each hour (done).
Store scraped data as a dictionary in a .JSON file (done).
Everytime the ad_link not found in the scraped_data.json set it to dict['Status'] = 'Inactive' (done).
If a cars price changes , print notification + add old price to dictionary. On this part I came across many challenges with the .JSON way.
I've kept using 2 .json files and comparing them to each other (scraped_data_temp , permanent_data.json) but I think this is by far not the best method.
What would you guys suggest? How should I do this? .
What would be the best way to approach manipulating this kind of data ? (Databases maybe? - got no experince with them but I'm eager to learn) and what would be a good way to represent this kind of data, pygal?
Thank you very much.
If you have larger data, I would definitely recommend using some kind of DB. If you don't have the need to use DB server, you can use sqlite. I have used it in the past to save bigger data locally. You can use sqlalchemy in python to interact with DB-s.
As for displaying data, I tend to use matplotlib. It's extremely flexible, has extensive documentation and examples, so you can adjust data to you linking, graphs, charts, etc.
I'm assuming that you are using python3.
I would like to download the data in this table:
http://portal.ujn.gov.rs/Izvestaji/IzvestajiVelike.aspx
I know how to use selenium to go through the pages and the CSS selectors are helpful enough that it shouldn't be too difficult to get all the data...
However, I am curious if anyone knows some way of getting to a json or whatever intermediary object is used to make the html? As in, whatever the raw data format file that gets exported by the server is? Is this possible with aspnet frameworks?
I have found such solutions in the past, but with much simpler web pages and web pages with get requests...
Thank you!
Taking a look at the website (I have no experience with Russian at all but not like it maters much.) It looks to me like it is pulling the information from a database via PHP (In my book the "old" way of doing it) not a JSON file. Which means that your basically stuck doing it the normal web scraping route like you said OR to find a SQL injection (which I am in NO WAY SUGGESTING as it is illegal?) to be able to bypass the limitations of there crappy search page.
I'm learning a practice called 'web scraping' using python. From what I can tell so far the idea is to send out a request to load the site data from a server, store the DOM html in a variable, and then basically data mine the s*** out of the resulting string until you are able to quickly access exactly and only the information you need.
Well I'm ready to start fiddling with statements that might help me do the actual data mining, but first I need to see and understand all of the html in my string. After I've got the hang of it I won't care what the html looks like, but right now I need to be able to reference it to properly analyze my output. so far I've tried google, python.net, youtube, various blogs and etc. But they all look like alianeese.
I'm just looking for the typical stuff you know?
<html><head><meta><script src=""><style src=""><title></title></head><body><div class=""><img src=""></div><div><h1>my page</h1><li></li><li></li><li></li><li></li><li></li><li></li><p>click here</p></div></body></html>
You get what I'm saying? Just a website... that uses like... html... to render some simple structured data.
P.S. This is kind of neat. I went to give this post some tags and I discovered 'simple-html-dom'. So I googled it. Apparently it's some kind of language that lets you parse html from online sources in exactly the way I am trying to. I may check that out later, but I still want to figure out how to do this with python.
EDIT Actually something like this would work fine but it's just so big. I would prefer something smaller to work with.
While it would probably be nice to build your own web pages to use, you can also try looking for pages "optimized for lynx". Lynx is a text-only browser with which "simple" pages naturally work best.
Most of the links you'll find will be dead already, but I found this list for instance, which still has many alive and equally simple pages: http://www.put.com/dead.html (please ignore the content itself... there is no particular reason I chose this example other than that it probably works nicely for your purposes!)
I'm not new to programming—but am (very) new to web-scraping. I'd like to get data from this website in this manner:
Get the team-data from the given URL and store it in some text file.
"Click" the links of each of the team members and store that data in some other text file.
Click various other specific links and store data in its own separate text file.
Again, I'm quite new to this. I have tried opening the specified website with urllib2 (in hopes of being able to parse it with BeautifulSoup), but opening it resulted in a timeout.
Ultimately, I'd like to do something like specify a team's URL to a script, and have said script update associated text files of the team, its players, and various other things in different links.
Considering what I want to do, would it be better to learn how to create a web-crawler, or directly do things via urllib2? I'm under the impression that a spider is faster, but will basically click on links at random unless told to do otherwise (I do not know whether or not this impression is accurate).
I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.