How to web scrape database of users w/o API? - python

For fun, I've learnt since last night how to do basic web scraping, using Python's urllib, urllib2, cookie-jar, and BeautifulSoup. It only took a bit, but I've figured out how to get all information from each user's profile that I need (OKCupid to be exact). However, I've only figured out how to do so, and have no idea how to go through a public database of users without an API from the site.
Is there any easy way to do so? Thanks.

Related

Data-mining Facebook Profile and Returning Data In Terminal

I am slightly new to python coding and I have a project coming up to which I've decided to make some code that when entering a Facebook users URL it will return all data that their profile has to offer. Any help would be greatly appreciated or if you have code that does similar I would love to observe.
I am looking for this to be executed in python.
I would recommend using a web scraping framework with python. There are tons of them. Beautiful Soup, Scrapy are great options. However, most web applications do have security in place to prevent you from scraping data on their platforms. I would recommend you do more research.

What information do I need when scraping a website that requires logging in?

I want to access my business' database on some site and scrape it using Python (I'm using Requests and BS4, I can go further if needed). But I couldn't.
Can someone provide us with info and simple resources on how to scrape such sites.
I'm not talking about providing usernames and passwords. The site requires much more than this.
How do I know the info I am required to provide for my script aside of UN and PW(e.g. how do I know that I must provide, say, an auth token)?
How to deal with the site when there are no HTTP URLs, but hrefs in the form of javascript:__doPostBack?
And in this regard, how do I transit from the logging in page to the page I want (the one contained in the aforementioned mentioned javascript:__doPostBack)?
Are the libraries I'm using enough? or do you recommend using—and learning in my case—something else?
Your help is greatly appreciated and thanked.
You didn't mention what you use for scraping, but since this sounds like a lot of the interaction on this site is based on client-side code, I'd suggest using a real browser to do the scraping, and interacting with the site not using low-level HTTP requests but using client side interaction (such as typing in elements or clicking buttons). This way, you don't need to worry about what form data to send or how to get the URLs of links yourself.
One recommended method of doing this would be to use BeutifulSoup with Selenium / WebDriver. There are multiple resources on how to do this, for example: How can I parse a website using Selenium and Beautifulsoup in python?

Can I scrape this kind of website?

I am learning python right now and I want to level up my knowledge on it particularly scraping. I am now on using Scrapy and getting in to use it along with Splash. I wanted to scrape a more challenging website - an airline website "https://www.airasia.com/en/home.page?cid=1" - one of my web developer friend told me that it would be impossible to scrape this type of websites since no regular json or xml files are returned for the data to be scrape. He said data can only be access using API (he said something about RESTFUL API) I don't somehow believe him. So as not wasting my time, if someone can CONFIRM it, I would be happy and if someone would say it can be scraped, I would be more happy if that guy can give me tips on how to scrape it and hands down if that guy can show proofs..
Many thanks.
Almost ANY website can be scraped but some websites are trickier than others.
Instead of Scrapy, I would recommend using a better alternative called Selenium which happens to have a library for python as well.
Long story made short: You will start a web browser in form of a driver and navigate to the page of your choice and simulate user interactions such as clicking, entering data in forms and submission. You will also be able to run JavaScript functions.
You might also want to do some research on legal constraints to ensure your operation is not unlawful. For instance, refer to case law of Ryanair Ltd v PR Aviation BV (Case C-30/14 CJEU).
You have 2 options: Use their API if they use one, to make http requests and obtain data and informations from their servers.
Or use a python scraping / web test framework, eg scrapy or selenium, to scrap their website directly in a python program.
Scrapy will be harder than selenium on this website because a lot of content is dynamic and will require custom code to trigger. Selenium should be easy to use.

Facebook web crawler

I am attempting to build a web crawler to sign into FaceBook and check the online status of some family members for a project I'm building for my parents. Upon searching, I found that this is attainable through FQL queries on friend online presence, but it seems that this will be removed around April of this year. So I thought that maybe I can just do a basic crawler myself in python that will get the HTML info from online friends in my chat, but when trying to print out the HTML code after attempting to log in, it returns a very large amount of jumbled HTML and javascript that mentions "BigPipe." I see that BigPipe breaks pages into pagelets but I'm a little confused on what to make of this information.
So my questions are, does anyone know of another way to get online statuses other than the FQL queries, has anyone else attempted to crawl Facebook, has anyone attempted to crawl any site with this BigPipe response?
Thank you in advance,
Jake
You may be able to write a FireFox extension. You will not be able to scrape FB without JavaScript. That pretty much rules out most traditional scraping methods.
Using PyQt4.QtWebKit will help to deal with javascript.
Here some basic usage of it : webkit-pyqt-rendering-web-pages
Documentation: PyQt4-qtwebkit.html
I just finished my school project which requires user data from Facebook group members. I used a web crawling tool - Octoparse for data extraction, it's a non-programming application and can be used to crawl different types of data on Facebook. You can go to this tutorial:Facebook Scraping Case Study | Scraping Facebook Groups

Python get data from secured website

Id like to know if there is a way to get information from my banking website with Python, Id like to retrieve my card history and display it, and possibly save it into a text document each month.
I have found the urls ext to login and get the information from the website, which works from a browser, but I have been using liburl2 to "open" the webpages from Python and I have a feeling its not working because of some cookie or session things.
I can get any information I want from a website that does not require a login with urllib2, and then save the actual HTML and go through it later, but I cant on my banks website,
Any help would be appreciated
This is a part of Web-Scraping :
Web-scraping is a standard task that can serve various needs.
Scraping data out of secure-website means https
Handling https is not a problem with mechanize and BeautifulSoup
Although urllib2 with HTTPCookieJar also works fine
If managing the cookies is the problem, then I would recommend mechanize
Considering the case of your BANK-Site :
I would recommend not to play with your account.
If you must then, its not as easy as any normal secure/non-secure site.
These sites are designed to with-stand such scripts.
Problems that you would face with this:
BANK sites will surely have Captcha that is almost impossible to by-pass with a script unless you employee a lot of rocket-science and effort.
Other problem that you will definitely face is javascript, standard scripting solutions are focused to manage cookies, HTML parsing, etc. For processing javascript on links you will have to process js in your python script. That again needs a lot of effort.
Then, AJAX that again comes from javascript fetches data from server after page-load.
So, it will require you to take a lot of effort to do this task.
Also, if you try doing this you risk of blocking access to your account since banking sites are quick to block account access on 3-4 unsuccessful attempt on login or captcha, etc.
So, think before you do.

Categories

Resources