I am trying to use BeautifulSoup(or another web scraping API) to automate web forms. For example, on the login page of Facebook there is also a registration form so lets say i want to fill out this form through automation. So i would need to be able to find the relevant html tags(such as the inputs for first name, last name, etc) and then i would want to take all of that input and push a request to Facebook to make that account, how would this be done?
Even I am beginner in the scraping, I was facing these problems too. To carry out basic scraping operations we can use beautiful soup. While learning more about scraping I came across "Scrapy" tool. We can use Scrapy for many more functionality like you specified. Try out Scrapy here. This is recommended by many professional web scraper .
Related
I want to access my business' database on some site and scrape it using Python (I'm using Requests and BS4, I can go further if needed). But I couldn't.
Can someone provide us with info and simple resources on how to scrape such sites.
I'm not talking about providing usernames and passwords. The site requires much more than this.
How do I know the info I am required to provide for my script aside of UN and PW(e.g. how do I know that I must provide, say, an auth token)?
How to deal with the site when there are no HTTP URLs, but hrefs in the form of javascript:__doPostBack?
And in this regard, how do I transit from the logging in page to the page I want (the one contained in the aforementioned mentioned javascript:__doPostBack)?
Are the libraries I'm using enough? or do you recommend using—and learning in my case—something else?
Your help is greatly appreciated and thanked.
You didn't mention what you use for scraping, but since this sounds like a lot of the interaction on this site is based on client-side code, I'd suggest using a real browser to do the scraping, and interacting with the site not using low-level HTTP requests but using client side interaction (such as typing in elements or clicking buttons). This way, you don't need to worry about what form data to send or how to get the URLs of links yourself.
One recommended method of doing this would be to use BeutifulSoup with Selenium / WebDriver. There are multiple resources on how to do this, for example: How can I parse a website using Selenium and Beautifulsoup in python?
I am learning python right now and I want to level up my knowledge on it particularly scraping. I am now on using Scrapy and getting in to use it along with Splash. I wanted to scrape a more challenging website - an airline website "https://www.airasia.com/en/home.page?cid=1" - one of my web developer friend told me that it would be impossible to scrape this type of websites since no regular json or xml files are returned for the data to be scrape. He said data can only be access using API (he said something about RESTFUL API) I don't somehow believe him. So as not wasting my time, if someone can CONFIRM it, I would be happy and if someone would say it can be scraped, I would be more happy if that guy can give me tips on how to scrape it and hands down if that guy can show proofs..
Many thanks.
Almost ANY website can be scraped but some websites are trickier than others.
Instead of Scrapy, I would recommend using a better alternative called Selenium which happens to have a library for python as well.
Long story made short: You will start a web browser in form of a driver and navigate to the page of your choice and simulate user interactions such as clicking, entering data in forms and submission. You will also be able to run JavaScript functions.
You might also want to do some research on legal constraints to ensure your operation is not unlawful. For instance, refer to case law of Ryanair Ltd v PR Aviation BV (Case C-30/14 CJEU).
You have 2 options: Use their API if they use one, to make http requests and obtain data and informations from their servers.
Or use a python scraping / web test framework, eg scrapy or selenium, to scrap their website directly in a python program.
Scrapy will be harder than selenium on this website because a lot of content is dynamic and will require custom code to trigger. Selenium should be easy to use.
I am attempting to build a web crawler to sign into FaceBook and check the online status of some family members for a project I'm building for my parents. Upon searching, I found that this is attainable through FQL queries on friend online presence, but it seems that this will be removed around April of this year. So I thought that maybe I can just do a basic crawler myself in python that will get the HTML info from online friends in my chat, but when trying to print out the HTML code after attempting to log in, it returns a very large amount of jumbled HTML and javascript that mentions "BigPipe." I see that BigPipe breaks pages into pagelets but I'm a little confused on what to make of this information.
So my questions are, does anyone know of another way to get online statuses other than the FQL queries, has anyone else attempted to crawl Facebook, has anyone attempted to crawl any site with this BigPipe response?
Thank you in advance,
Jake
You may be able to write a FireFox extension. You will not be able to scrape FB without JavaScript. That pretty much rules out most traditional scraping methods.
Using PyQt4.QtWebKit will help to deal with javascript.
Here some basic usage of it : webkit-pyqt-rendering-web-pages
Documentation: PyQt4-qtwebkit.html
I just finished my school project which requires user data from Facebook group members. I used a web crawling tool - Octoparse for data extraction, it's a non-programming application and can be used to crawl different types of data on Facebook. You can go to this tutorial:Facebook Scraping Case Study | Scraping Facebook Groups
I need to develop web app for extracting prices of books from different e-commerce sites like amazon,homeshop18 when user enters book name in the interface and displays all the information.
My questions are
1)how to pass that query to amazon site search box and i can get only the pages relevant to the query instead of crawling the whole site.
2)What can be used to develop this application?BeautifulSoup or scrappy?API's are not available for all e-commerce sites to use it
am new to python.so any help will be highly appreciated
I personnaly use BeautifulSoup to parse web pages, but beware it's a bit slow if you have to parse pages massively. I know that lxml is faster but a bit less coder-friendly.To guess the right parameters (either for an HTTP GET or POST) for getting the result page you want, you should proceed like this:
Switch on the firebug plugin for Firefox or the integrated inspector for Chrome
Go on the web page you're interested in, and do the search
Go into firebug/inspector to see the parameters of the HTTP request Firefox or Chrome sent to the website.
Reproduce the request in your python script. For example using urllib
There is another way to guess the right HTTP GET or POST parameters, it's to use a network analyzer like Wireshark. This is a more detailed approach but feels more like
finding a needle in a haystack once you used the tools in Firefox/Chrome.
I am building a web application as college project (using Python), where I need to read content from websites. It could be any website on internet.
At first I thought of using Screen Scrapers like BeautifulSoup, lxml to read content(data written by authors) but I am unable to search content based upon one logic as each website is developed on different standards.
Thus I thought of using RSS/ Atom (using Universal Feed Parser) but I could only get content summary! But I want all the content, not just summary.
So, is there a way to have one logic by which we can read a website's content using lib's like BeautifulSoup, lxml etc?
Or I should use API's provided by the websites.
My job becomes easy if its a blogger's blog as I can use Google Data API but the trouble is, should I need to write code for every different API for the same job?
What is the best solution?
Using the website's public API, when it exists, is by far the best solution. That is quite why the API exists, it is the way that the website administrators say "use our content". Scraping may work one day and break the next, and it does not imply the website administrator's consent to have their content reused.
You could look into content extraction libraries - I've used Full Text RSS (php) and Boilerpipe (java). Both have web service available so you can easily test if it meets your requirements. Also you can download and run them yourself and further modify its behavior on individual sites.