Using POST to use built in search form on webpage - python

I'm fairly new to Python, and this is my first post to stackoverflow, and as a starting project I'm trying to write a program that will gather the prices of board games from different websites that sell them. As part of this I'm trying to write a function that will use a website's built-in search function to find the webpage I want for a game that I input.
The code I'm using so far is:
import requests
body = {'keywords':'galaxy trucker'}
con = requests.post('http://www.thirstymeeples.co.uk/', data=body)
print(con.content)
My problem is that the webpage it returns is not the webpage I get when I manually input and search for 'galaxy trucker' on the website itself.
The html for the search form in question is
<form method="post" action="http://www.thirstymeeples.co.uk/">
<input type="search" name="keywords" id="keywords" class="searchinput" value>
</form>
I have read this but with that the difference to me seems to be that the search actually appears on the webpage, whereas with mine, the web address provided in the action section does not itself display a search bar. In this example too, there is no id keyword in the html, whereas in mine there is, does this make a difference?

No search form on the index page, but if you do a "manual" search from the "games" page (which does hae a form), you end up on a page with this url:
http://www.thirstymeeples.co.uk/games/search/q?collection=search_games|search_designers|search_publishers&loose_ends=right&search_mode=all&keywords=galaxy+trucker
Notice that this page does take GET params, and that if you change the keywords from "galaxy+trucker" to anything else you get an updated result page. IOW, you want to do a GET request at http://www.thirstymeeples.co.uk/games/search/q:
r = requests.get("http://www.thirstymeeples.co.uk/games/search/q", params={"keywords": "galaxy trucker"})
print(r.content)

Related

How to distinguish between product's page and a regular page

I am trying to scrape:
https://www.lanebryant.com/
My crawler starts from a URL and then goes further to all the links that are mentioned on that page. Now, I scraped for other site and my logic works by checking if URL contains "products" string and then downloads the product's information. In this site there is no such thing as mentioned previously. How do I distinguish between a product's page and a regular page? (All it requires is an if statement. I hope my question is clear. For the record, here is the product's page for this site:
https://www.lanebryant.com/faux-wrap-maxi-dress/prd-358414#color/0000081590
Something that might be helpful in this case is to go through several product pages (visually at first), and to look for similarities in their html. If you're new to this, just go to the page and then do something similar to right click + "View Page Source" (this is the way to do it on Chrome). For the page example you gave, an example of probably relevant element would be: <input type="submit"
class="cta-btn btn btn--full mar-add-to-bag asc-bag-action grid__item"
value="Add to Bag">, which corresponds to the "Add to Bag" button.
Then you might look into how to use BS to actually go through the html elements of the page and do your filtering based on this.
Hope that helps!

Scrape table from html (<tr> and ID method not working)

I'm currently trying to do a web scrape of a table from this website: http://pusatdata.kontan.co.id/reksadana/produk/469/Schroder-90-Plus-Equity-Fund
Specifically the grey table with the headers "TANGGAL/NAB/DIVIDEN/DAILY RETURN (%)".
Below is the code that I use:
import requests
import urllib.request
from bs4 import BeautifulSoup
quote_page = "http://pusatdata.kontan.co.id/reksadana/produk/469/Schroder-90-Plus-Equity-Fund"
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
table = soup.find('div',id='table_flex')
print (table.text)
But no output was generated at all. Appreciate your help. Thank you very much!
When you don't get the results you expect from your code, you need to backtrack to figure out where your code broke.
In this case, your first step would be to check the value of table. As it turns out, table is not None (which would signify a bad selector/soup.find call), so you at least know that you got that much right.
Instead, what you'll notice is that the table_flex div is empty. This isn't terribly surprising to me, but let's pretend I don't know anything and this doesn't make any sense. So the next step would be to pull up a browser and to double check that the DOM (via your browser's inspect tool) has content in the table_flex div.
It does, so now you have to do some real digging. If you run a simple search on the DOM in the inspect window for "table_flex", you'll first see the div that we already know about, but then you'll see some Javascript/jQuery further down the page that references "#table_flex".
This Javascript is part of a $.ajax() call (which you would google and find out is basically a query to a webserver for information). You'll also note that $("#table_flex") has an html() method (which, after more googling, you find out sets the html content for a particular element).
And now we have your answer for why the div is empty: when the webserver is queried for that page, the server sends back a document that has a blank table. The querying party is then expected to execute the Javascript to fill in the rest of the page. Generally speaking, Python modules don't run Javascript (for several reasons), so the table never gets populated.
This tends to be standard operating procedure for dynamic content, as "template" webpages can be cached and quickly distributed (since no additional information is needed) and then the rest of the information is supplied as the user needs it. This can also allow the same document to be used for multiple urls and query arguments without having to generate new documents.
Ultimately, what will probably be easiest for you is to determine whether you can access that API directly yourself and simply query that url instead.
There was no ouput generated because there is no text within the <div> with the id table_flex. So this shouldn't be a surprise.
The ”table” in question can be found under a <div> with the id manajemen_reksadana. The two rows are not directly under that <div> and the whole ”table” is made of <div>s, so it's best to navigate to the known header/label texts, and address the <div> containing the value relative to the <div> with the header/label text:
fund_management_node = soup.find('div', id='manajemen_reksadana')
for label_text in ['PRODUK', 'KATEGORI', 'NAB', 'DAILY RETURN']:
label_node = fund_management_node.find(text=label_text).parent
print(label_node.find_next_sibling('div').text)

form, input problems when parsing webpage with selenium

I'm parsing webpages with selenium and beautifulsoup4,
and I have a problem with parsing specific webpage.
I got different html source pages when I actually view html source on that page, and parsing with selenium or bs4.
The difference is existence of form and input.
When I parse that page, I got html with
<form action="" method="post" name="fmove">
<input name="goAction" style="display:none" type="submit"/>
</form>
I can't find what to input or submit.
Please let me understand this problem.
Thanks!
I'm going to concentrate on '[finding] what to input or submit' without touching on wider questions. Even so, what I tell you is not guaranteed to yield answers if code associated with that page does not arrange to fill the form's action attribute and/or some of its input elements with name and value pairs.
First, open the page in the Chrome browser. Use the item in the context menu to 'Inspect' the element on the screen to find the Javascript that finally submits that form. Put a breakpoint on the line in the code where this happens. Now reload the page (F5) and exercise the form. The code should stop at the breakpoint. You should be able to see the properties of the form element, including action and the name-value pairs, in the rightmost portion of the screen where you can copy them for use in your own code.
PS: I really must mention that it's difficult to be sure of a lot of this without knowing what site you're scraping. Good luck!

Fetching data from a website using Python

I would like to know whether I can use python for fetching data from a website by giving some specific inputs.
I know i should write somewhat code, but here I'm starting from scratch and bit confused, Hope you'll understand..
Explanation:
This is our university website:
http://exam.cusat.ac.in/
I want to click on the First Link through the program which is Given in the Website as,
Download/View Result for B.Tech V Semester November 2016 - Regular Examination
Then the next page has an option to enter the registration number, as I know the register number, I can assign it to a variable.
Here I want to get the results of multiple students, that is the main aim of the program.
eg: The results starting from 12153600 to 12153660 should be retrieved one by one from the website.
The last thing is that if i can get then results the can I convert it into PDF? If possible can I convert all of those results into a single PDF File as different pages?
These are my observations with the site:
The site uses form to display the result of the student
Form url is http://exam.cusat.ac.in/erp5/cusat/CUSAT-RESULT/Result_Declaration/display_sup_result
Form method is POST
Data passed to url are regno,deg_name,semester,year,result_type
so you need to raise post request to the url with the above mentioned parameters.You can do that in simple python and requests.
import requests # to make requests.
import pdfkit # for saving as pdf
url="http://exam.cusat.ac.in/erp5/cusat/CUSAT-RESULT/Result_Declaration/display_sup_result" #form url
pdfs=[]
payload={ "deg_name":"B.Tech", "semester":"5", "month":"November", "year":"2016", "result_type":"Regular" }
option={'quiet': ''}
for i in range(12153600,12153660+1):
payload.update({"regno":str(i)})
response=requests.post(url,data=payload)
pdfkit.from_string(response.content,str(i)+".pdf",options=option) #saves to 12153600.pdf - 12153660.pdf files
open("result_"+str(i)+".html","w").write(response.content) #This will save results from roll no 12153600 - 12153660 in result_rollno.html files.
This creates 60 seperate pdf files.
To save the response as pdf files you can use pdfkit
refer this for installation, this for tutorial. I want you to go through the pdf saving part as hands on. So I'm skipping saving as pdf part. If you find it difficult there are no of packages to save data as pdf in python which you can google. I prefer this because this accepts a list as inputs/files so you can add all the responses to a list and use this to create a single pdf file.
You should check out the Selenium python library.
You'll be able to achieve what you want with that library. Specifically, you would use Selenium's get function to get your website, selenium's click function to click the first link, and so on.
A lot of researchers use that to simulating click events on websites such as Facebook and gathering the resulting data.
You should checkout Request Module for getting data from Html.
PFB the links for tutorial purpose:
http://docs.python-requests.org/en/master/user/quickstart/
https://media.readthedocs.org/pdf/requests/master/requests.pdf
You can use Python's requests library for sending the requests and BeautifulSoup to parse the html that you receive.
First, you need to inspect the page using your browser's dev tools. If you do that, you will find that each link row is a form element -
<form id="myForm0121x1" action="..." method="post">
<input name="month" value="..." type="hidden">
<input name="year" value="..." type="hidden">
<input name="sem" value="..." type="hidden">
<input name="reg_type" value="..." type="hidden">
<input name="dn" value="..." type="hidden">
<input name="status1" value="..." type="hidden">
</form>
Each link is a POST request to the action attribute's url value along with the input elements. To do this programmatically using requests -
r = requests.post('url',data={'month':'...','year':'...','sem':'...','reg_type':'...','dn':'...','status1':'...'})
If you then check r.content, you will have received the source of the second page. Repeat the above process again for this page, this time changing the data parameter's keys/values accordingly(use the inspector) and add an extra 'regno':'xyz'(where xyz = a registration number), and you will receive the final html content for a student's result page. Parse this using BeautifulSoup and pick up whatever you need.

Programmatic Form Submit

I want to scrape the contents of a webpage. The contents are produced after a form on that site has been filled in and submitted.
I've read on how to scrape the end result content/webpage - but how to I programmatically submit the form?
I'm using python and have read that I might need to get the original webpage with the form, parse it, get the form parameters and then do X?
Can anyone point me in the rigth direction?
Using python, I think it takes the following steps:
parse the web page that contains the form, find out the form submit address, and the submit method ("post" or "get").
this explains form elements in html file
Use urllib2 to submit the form. You may need some functions like "urlencode", "quote" from urllib to generate the url and data for post method. Read the library doc for details.
you'll need to generate a HTTP request containing the data for the form.
The form will look something like:
<form action="submit.php" method="POST"> ... </form>
This tells you the url to request is www.example.com/submit.php and your request should be a POST.
In the form will be several input items, eg:
<input type="text" name="itemnumber"> ... </input>
you need to create a string of all these input name=value pairs encoded for a URL appended to the end of your requested URL, which now becomes
www.example.com/submit.php?itemnumber=5234&otherinput=othervalue etc...
This will work fine for GET. POST is a little trickier.
</motivation>
Just follow S.Lott's links for some much easier to use library support :P
From a similar question - options-for-html-scraping - you can learn that with Python you can use Beautiful Soup.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
The unusual name caught the attention of our host, November 12, 2008.
You can do it with javascript. If the form is something like:
<form name='myform' ...
Then you can do this in javascript:
<script language="JavaScript">
function submitform()
{
document.myform.submit();
}
</script>
You can use the "onClick" attribute of links or buttons to invoke this code. To invoke it automatically when a page is loaded, use the "onLoad" attribute of the element:
<body onLoad="submitform()" ...>

Categories

Resources