I would like to download the data in this table:
http://portal.ujn.gov.rs/Izvestaji/IzvestajiVelike.aspx
I know how to use selenium to go through the pages and the CSS selectors are helpful enough that it shouldn't be too difficult to get all the data...
However, I am curious if anyone knows some way of getting to a json or whatever intermediary object is used to make the html? As in, whatever the raw data format file that gets exported by the server is? Is this possible with aspnet frameworks?
I have found such solutions in the past, but with much simpler web pages and web pages with get requests...
Thank you!
Taking a look at the website (I have no experience with Russian at all but not like it maters much.) It looks to me like it is pulling the information from a database via PHP (In my book the "old" way of doing it) not a JSON file. Which means that your basically stuck doing it the normal web scraping route like you said OR to find a SQL injection (which I am in NO WAY SUGGESTING as it is illegal?) to be able to bypass the limitations of there crappy search page.
Related
First post - be gentle!
I am starting to learn Python and would like to get information from a table in a web page (https://en.wikipedia.org/wiki/European_Union#Demographics) in to a panda.
I am using Google Colab and from researching a bit I understand the process has something to do with 'web scraping' turning HTML in to .CSV.
Any thoughts welcome please. Worth noting I am constrained by not being able to download additional software due to the secure nature of my work.
Thanks.
You need a library to help you parse the HTML - a well known library for that in Python would be BeautifulSoup.
There are also some available tools online that do this kind of thing for you, and you can take some inspiration from them, even if you can't use them directly: https://wikitable2csv.ggor.de/
As you see this website above use the CSS "table.wikitable" to identify the tables.
You can use Scrapy, a python based scraping framework to get and parse the data as required. In Scrapy, you can create spiders which crawl a set of urls which you have initialized. Furthermore, you can parse the HTML data using something like Beautiful Soup to get your table from the response. The Scrapy documentation in itself is pretty useful and should get you through to set it up quickly! Scrapy also let you export the parsed data as CSV which should help you with the export part.
All the best!
I'm trying to scrap a large database for a project of mine, however I find that after I scrap a relatively big amount of data, I stop receiving some of the xml information I'm interested in. I'm not sure if it's because the server is limiting my access or because it starts scraping too fast.
I put a "sleep" line between the scraping loops to overcome this, but as I try to reach more data it doesn't work anymore.
I guess this is a known problem in web scraping but I'm very new to this field so any suggestion will be very helpful.
Note: I tried 'request' with some free proxies but that didn't work either (still some data missing). I also checked the original website and it does have the data I seek.
Edit: It looks like most of that data I'm missing comes from specific attributes that don't load as fast as all other data. So I think I'm looking for a way to tell if this xml I'm looking for has loaded already.
I'm using lxml and requests.
Thanks.
I have an HTML webpage. It has a search textbox. I want to allow the user to search within a dataset. The dataset is represented by a bunch of files on my server. I wrote a python script which can make that search.
Unfortunately, I'm not familiar with how can I unite the HTML page and a Python script.
The task is to put a python script into the html file so, that:
Python code will be run on the server side
Python code can somehow take the values from the HTML page as input
Python code can somehow put the search results to the HTML webpage as output
Question 1 : How can I do this?
Question 2 : How the python code should be stored on the website?
Question 3 : How it should take HTML values as input?
Question 4 : How can it output the results to the webpage? Do I need to install/use any additional frameworks?
Thanks!
There are too many things to get wrong if you try to implement that by yourself with only what the standard library provides.
I would recommend using a web framework, like flask or django. I linked to the quickstart sections of the comprehensive documentation of both. Basically, you write code and URL specifications that are mapped to the code, e.g. an HTTP GET on /search is mapped to a method returning the HTML page.
You can then use a form submit button to GET /search?query=<param> with the being the user's input. Based on that input you search the dataset and return a new HTML page with results.
Both frameworks have template languages that help you put the search results into HTML.
For testing purposes, web frameworks usually come with a simple webserver you can use. For production purposes, there are better solutions like uwsgi and gunicorn
Also, you should consider putting the data into a database, parsing files for each query can be quite inefficient.
I'm sure you will have more questions on the way, but that's what stackoverflow is for, and if you can ask more specific questions, it is easier to provide more focused answers.
I would look at the cgi library in python.
You should check out Django, its a very flexible and easy Python web-framework.
I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.
The more research I do, the more grim the outlook becomes.
I am trying to Flat Save, or Static Save a webpage with Python. This means merging all the styles to inline properties, and changing all links to absolute URLs.
I've tried nearly every free conversion website, api, and even libraries on github. None are that impressive. The best python implementation I could find for flattening styles is https://github.com/davecranwell/inline-styler. I adapted that slightly for Flask, but the generated file isn't that great. Here's how it looks:
Obviously, it should look better. Here's what it should look like:
https://dzwonsemrish7.cloudfront.net/items/3U302I3Y1H0J1h1Z0t1V/Screen%20Shot%202012-12-19%20at%205.51.44%20PM.png?v=2d0e3d26
It seems like a neverending struggle dealing with Malformed html, unrecognized CSS properties, Unicode errors, etc. So does anyone have a suggestion on a better way to do this? I understand I can go to file -> save in my local browser, but when I am trying to do this en mass, and extract a particular xpath that's not really viable.
It looks like Evernote's web clipper uses iFrames, but that seems more complicated than I think it should be. But at least the clippings look decent on Evernote.
After walking away for a while, I managed to install a ruby library that flattens the CSS much much better than anything else I've used. It's the library behind the very slow web interface here http://premailer.dialect.ca/
Thank goodness they released the source on Github, it's the best hands down.
https://github.com/alexdunae/premailer
It flattens styles, creates absolute urls, works with a URL or string, and can even create plain text email templates. Very impressed with this library.
Update Nov 2013
I ended up writing my own bookmarklet that works purely client side. It is compatible with Webkit and FireFox only. It recurses through each node and adds inline styles then sends the flattened HTML to the clippy.in API to save to the user's dashboard.
Client Side Bookmarklet
It sounds like inline styles might be a deal-breaker for you, but if not, I suggest taking another look at Evernote Web Clipper. The desktop app has an Export HTML feature for web clips. The output is a bit messy as you'd expect with inline styles, but I've found the markup to be a reliable representation of the saved page.
Regarding inline vs. external styles, for something like this I don't see any way around inline if you're doing a lot of pages from different sites where class names would have conflicting style rules.
You mentioned that Web Clipper uses iFrames, but I haven't found this to be the case for the HTML output. You'd likely have to embed the static page as an iFrame if you're re-publishing on another site (legally I assume), but otherwise that shouldn't be an issue.
Some automation would certainly help so you could go straight from the browser to the HTML output, and perhaps for relocating the saved images to a single repo with updated src links in the HTML. If you end up working on something like this, I'd be grateful to try it out myself.