Fetching data from a website using Python

Fetching data from a website using Python - python

I would like to know whether I can use python for fetching data from a website by giving some specific inputs.
I know i should write somewhat code, but here I'm starting from scratch and bit confused, Hope you'll understand..
Explanation:
This is our university website:
http://exam.cusat.ac.in/
I want to click on the First Link through the program which is Given in the Website as,
Download/View Result for B.Tech V Semester November 2016 - Regular Examination
Then the next page has an option to enter the registration number, as I know the register number, I can assign it to a variable.
Here I want to get the results of multiple students, that is the main aim of the program.
eg: The results starting from 12153600 to 12153660 should be retrieved one by one from the website.
The last thing is that if i can get then results the can I convert it into PDF? If possible can I convert all of those results into a single PDF File as different pages?

These are my observations with the site:
The site uses form to display the result of the student
Form url is http://exam.cusat.ac.in/erp5/cusat/CUSAT-RESULT/Result_Declaration/display_sup_result
Form method is POST
Data passed to url are regno,deg_name,semester,year,result_type
so you need to raise post request to the url with the above mentioned parameters.You can do that in simple python and requests.
import requests # to make requests.
import pdfkit # for saving as pdf
url="http://exam.cusat.ac.in/erp5/cusat/CUSAT-RESULT/Result_Declaration/display_sup_result" #form url
pdfs=[]
payload={ "deg_name":"B.Tech", "semester":"5", "month":"November", "year":"2016", "result_type":"Regular" }
option={'quiet': ''}
for i in range(12153600,12153660+1):
payload.update({"regno":str(i)})
response=requests.post(url,data=payload)
pdfkit.from_string(response.content,str(i)+".pdf",options=option) #saves to 12153600.pdf - 12153660.pdf files
open("result_"+str(i)+".html","w").write(response.content) #This will save results from roll no 12153600 - 12153660 in result_rollno.html files.
This creates 60 seperate pdf files.
To save the response as pdf files you can use pdfkit
refer this for installation, this for tutorial. I want you to go through the pdf saving part as hands on. So I'm skipping saving as pdf part. If you find it difficult there are no of packages to save data as pdf in python which you can google. I prefer this because this accepts a list as inputs/files so you can add all the responses to a list and use this to create a single pdf file.

You should check out the Selenium python library.
You'll be able to achieve what you want with that library. Specifically, you would use Selenium's get function to get your website, selenium's click function to click the first link, and so on.
A lot of researchers use that to simulating click events on websites such as Facebook and gathering the resulting data.

You should checkout Request Module for getting data from Html.
PFB the links for tutorial purpose:
http://docs.python-requests.org/en/master/user/quickstart/
https://media.readthedocs.org/pdf/requests/master/requests.pdf

You can use Python's requests library for sending the requests and BeautifulSoup to parse the html that you receive.
First, you need to inspect the page using your browser's dev tools. If you do that, you will find that each link row is a form element -
<form id="myForm0121x1" action="..." method="post">
<input name="month" value="..." type="hidden">
<input name="year" value="..." type="hidden">
<input name="sem" value="..." type="hidden">
<input name="reg_type" value="..." type="hidden">
<input name="dn" value="..." type="hidden">
<input name="status1" value="..." type="hidden">
</form>
Each link is a POST request to the action attribute's url value along with the input elements. To do this programmatically using requests -
r = requests.post('url',data={'month':'...','year':'...','sem':'...','reg_type':'...','dn':'...','status1':'...'})
If you then check r.content, you will have received the source of the second page. Repeat the above process again for this page, this time changing the data parameter's keys/values accordingly(use the inspector) and add an extra 'regno':'xyz'(where xyz = a registration number), and you will receive the final html content for a student's result page. Parse this using BeautifulSoup and pick up whatever you need.

Related

BeautifulSoup returns No Data Recorded when getting table from web

New to web scraping.
I need to get the Daily Observations table(the long table at the end of the page) data from the page:
https://www.wunderground.com/history/daily/us/tx/greenville/KGVT/date/2015-01-05?cm_ven=localwx_history
The html of the table starts from <table _ngcontent-c16="" class="tablesaw-sortable" id="history-observation-table">
My code is:
url = "https://www.wunderground.com/history/daily/us/tx/greenville/KGVT/date/2015-01-05?cm_ven=localwx_history"
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
soup.findAll(class_="region-content-observation")
And the output is:
[<div class="region-content-observation">
<city-history-observation _nghost-c34=""><div _ngcontent-c34="">
<div _ngcontent-c34="" class="observation-title">Daily Observations</div>
<!-- -->
No Data Recorded
<!-- -->
</div></city-history-observation>
</div>]
So it's not getting the table and returned No Data Recorded, but it did get the title.
And When I tried
soup.findAll(class_="tablesaw-sortable")
or
soup.findAll('tr')
it only returned empty list.
Does anyone know where went wrong?

If you open the web page in Firefox, you can use the Network tab from its Developer Tools to see all the different web resources that are downloaded. The data you are interested in is actually provided by this JSON file – which can be retrieved and then parsed using Python’s json library.
Note: I’ve never scraped a site that uses API keys so I’m not sure about the ethics or best practice in this situation. As a test, I was able to download the JSON file without any problems. However, I suspect Weather Underground wouldn’t want you using their key too many times – and it looks like they no longer provide free weather API keys.

form, input problems when parsing webpage with selenium

I'm parsing webpages with selenium and beautifulsoup4,
and I have a problem with parsing specific webpage.
I got different html source pages when I actually view html source on that page, and parsing with selenium or bs4.
The difference is existence of form and input.
When I parse that page, I got html with
<form action="" method="post" name="fmove">
<input name="goAction" style="display:none" type="submit"/>
</form>
I can't find what to input or submit.
Please let me understand this problem.
Thanks!

I'm going to concentrate on '[finding] what to input or submit' without touching on wider questions. Even so, what I tell you is not guaranteed to yield answers if code associated with that page does not arrange to fill the form's action attribute and/or some of its input elements with name and value pairs.
First, open the page in the Chrome browser. Use the item in the context menu to 'Inspect' the element on the screen to find the Javascript that finally submits that form. Put a breakpoint on the line in the code where this happens. Now reload the page (F5) and exercise the form. The code should stop at the breakpoint. You should be able to see the properties of the form element, including action and the name-value pairs, in the rightmost portion of the screen where you can copy them for use in your own code.
PS: I really must mention that it's difficult to be sure of a lot of this without knowing what site you're scraping. Good luck!

Using POST to use built in search form on webpage

I'm fairly new to Python, and this is my first post to stackoverflow, and as a starting project I'm trying to write a program that will gather the prices of board games from different websites that sell them. As part of this I'm trying to write a function that will use a website's built-in search function to find the webpage I want for a game that I input.
The code I'm using so far is:
import requests
body = {'keywords':'galaxy trucker'}
con = requests.post('http://www.thirstymeeples.co.uk/', data=body)
print(con.content)
My problem is that the webpage it returns is not the webpage I get when I manually input and search for 'galaxy trucker' on the website itself.
The html for the search form in question is
<form method="post" action="http://www.thirstymeeples.co.uk/">
<input type="search" name="keywords" id="keywords" class="searchinput" value>
</form>
I have read this but with that the difference to me seems to be that the search actually appears on the webpage, whereas with mine, the web address provided in the action section does not itself display a search bar. In this example too, there is no id keyword in the html, whereas in mine there is, does this make a difference?

No search form on the index page, but if you do a "manual" search from the "games" page (which does hae a form), you end up on a page with this url:
http://www.thirstymeeples.co.uk/games/search/q?collection=search_games|search_designers|search_publishers&loose_ends=right&search_mode=all&keywords=galaxy+trucker
Notice that this page does take GET params, and that if you change the keywords from "galaxy+trucker" to anything else you get an updated result page. IOW, you want to do a GET request at http://www.thirstymeeples.co.uk/games/search/q:
r = requests.get("http://www.thirstymeeples.co.uk/games/search/q", params={"keywords": "galaxy trucker"})
print(r.content)

A python script that automatically input some text in a website and get its source code

I am doing biomedical named extraction using Python.
Now I have to cross check the results from inputting the text to http://text0.mib.man.ac.uk/software/geniatagger/ and parse the source code of the HTML text that I get after submitting text into it.
I want that the same thing to be done in my GUI itself i.e. it input from GUI that I have made and submit the text into this website and get the source code so that for cross checking I don't have to visit each time from the browser.
Thanks in advance

Actually, this is a great question!
First thing you have to do is to explore a source code of the website a little bit.
If you look at the source code of the website you see this block of code
<form method="POST" action="a.cgi">
<p>
Please enter a text that you want to analyze.
</p>
<p>
<textarea name="paragraph" rows="15" cols="80" wrap="soft">
... some text here ...
### This is a sample. Replace this with your own text.
</textarea>
</p>
<p>
<input type="submit" value="Submit Text" />
<input type="reset" />
</p>
</form>
What you see is that request is send to a.cgi address, since we are already on address
http://text0.mib.man.ac.uk/software/geniatagger/
The data we want to send will be send to address concatenated with this one
http://text0.mib.man.ac.uk/software/geniatagger/a.cgi
But what are we going to send there?
We need a data, data are send as "paragraph" POST parameter, you see that since form has attribute method with value POST, and name of textarea is "paragraph"
We open this using this python code
import urllib
import urllib2
text = """
Further, while specific constitutive binding to the peri-kappa B site is seen in monocytes, stimulation with phorbol esters induces additional, specific binding. Understanding the monocyte-specific function of the peri-kappa B factor may ultimately provide insight into the different role monocytes and T-cells play in HIV pathogenesis.
### This is a sample. Replace this with your own text.
"""
data = {
"paragraph" : text
}
encoded_data = urllib.urlencode(data)
content = urllib2.urlopen("http://text0.mib.man.ac.uk/software/geniatagger/a.cgi",
encoded_data)
print content.readlines()
And what do we get so far? We got an "engine" for your GUI program.
What you can do is parse this content variable with python's HTMLParser (optional)
And you mentioned that you want to display this in GUI?
You can do this using GTK or Qt and map this functionality to a single button, you must read a tutorial , it's really easy for this purpose. If you have problems just comment this post and I can extend this answer with GUI

Programmatic Form Submit

I want to scrape the contents of a webpage. The contents are produced after a form on that site has been filled in and submitted.
I've read on how to scrape the end result content/webpage - but how to I programmatically submit the form?
I'm using python and have read that I might need to get the original webpage with the form, parse it, get the form parameters and then do X?
Can anyone point me in the rigth direction?

Using python, I think it takes the following steps:
parse the web page that contains the form, find out the form submit address, and the submit method ("post" or "get").
this explains form elements in html file
Use urllib2 to submit the form. You may need some functions like "urlencode", "quote" from urllib to generate the url and data for post method. Read the library doc for details.

you'll need to generate a HTTP request containing the data for the form.
The form will look something like:
<form action="submit.php" method="POST"> ... </form>
This tells you the url to request is www.example.com/submit.php and your request should be a POST.
In the form will be several input items, eg:
<input type="text" name="itemnumber"> ... </input>
you need to create a string of all these input name=value pairs encoded for a URL appended to the end of your requested URL, which now becomes
www.example.com/submit.php?itemnumber=5234&otherinput=othervalue etc...
This will work fine for GET. POST is a little trickier.
</motivation>
Just follow S.Lott's links for some much easier to use library support :P

From a similar question - options-for-html-scraping - you can learn that with Python you can use Beautiful Soup.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
The unusual name caught the attention of our host, November 12, 2008.

You can do it with javascript. If the form is something like:
<form name='myform' ...
Then you can do this in javascript:
<script language="JavaScript">
function submitform()
{
document.myform.submit();
}
</script>
You can use the "onClick" attribute of links or buttons to invoke this code. To invoke it automatically when a page is loaded, use the "onLoad" attribute of the element:
<body onLoad="submitform()" ...>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fetching data from a website using Python - python

You should checkout Request Module for getting data from Html. PFB the links for tutorial purpose: http://docs.python-requests.org/en/master/user/quickstart/ https://media.readthedocs.org/pdf/requests/master/requests.pdf

Related

BeautifulSoup returns No Data Recorded when getting table from web

form, input problems when parsing webpage with selenium

Using POST to use built in search form on webpage

A python script that automatically input some text in a website and get its source code

Programmatic Form Submit

Categories

Resources