I want to scrape the contents of a webpage. The contents are produced after a form on that site has been filled in and submitted.
I've read on how to scrape the end result content/webpage - but how to I programmatically submit the form?
I'm using python and have read that I might need to get the original webpage with the form, parse it, get the form parameters and then do X?
Can anyone point me in the rigth direction?
Using python, I think it takes the following steps:
parse the web page that contains the form, find out the form submit address, and the submit method ("post" or "get").
this explains form elements in html file
Use urllib2 to submit the form. You may need some functions like "urlencode", "quote" from urllib to generate the url and data for post method. Read the library doc for details.
you'll need to generate a HTTP request containing the data for the form.
The form will look something like:
<form action="submit.php" method="POST"> ... </form>
This tells you the url to request is www.example.com/submit.php and your request should be a POST.
In the form will be several input items, eg:
<input type="text" name="itemnumber"> ... </input>
you need to create a string of all these input name=value pairs encoded for a URL appended to the end of your requested URL, which now becomes
www.example.com/submit.php?itemnumber=5234&otherinput=othervalue etc...
This will work fine for GET. POST is a little trickier.
</motivation>
Just follow S.Lott's links for some much easier to use library support :P
From a similar question - options-for-html-scraping - you can learn that with Python you can use Beautiful Soup.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
The unusual name caught the attention of our host, November 12, 2008.
You can do it with javascript. If the form is something like:
<form name='myform' ...
Then you can do this in javascript:
<script language="JavaScript">
function submitform()
{
document.myform.submit();
}
</script>
You can use the "onClick" attribute of links or buttons to invoke this code. To invoke it automatically when a page is loaded, use the "onLoad" attribute of the element:
<body onLoad="submitform()" ...>
Related
I'm trying to read the price of a fund which is not available through an API. The fund is listed here https://bors.e24.no/#!/instrument/KL-AFMI2.OSE
At first I thought this would be a simple task so I looked at beautifulsoup, but realized that what I wanted was not returned. A far as I can tell due to the:
<-- ngIf: $root.allowStreamingToggle -->
I'm a beginner so hoping someone can help me with an easy way to get this value.
I see json being returned from the following endpoint in network tab
import requests
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get('https://bors.e24.no/server/components/graphdata/(PRICE)/DAY/KL-AFMI2.OSE?points=500&stop=2019-07-30&period=1weeks', headers = headers).json()
Price is then
r['rows'][0]['values']['series']['c1']['data'][3][1]
The tag "ngIf" almost certainly means that the website you are attempting to scrape is an AngularJS app... in which case, the data is almost certainly NOT in the HTML page you are pulling and attempting to parse with BeautifulSoup.
Rather, the page is probably pulling the data later -- say, via AJAX -- and the rendering it IN to the page via Angular's client-side code.
If all that is right... then BeautifulSoup is not the right tool.
You might have some hope if you can identify the AJAX call that the page is calling, then call THAT directly. Inspect it to see the data structure; if you are lucky maybe it is JSON and then super easy to parse. If that looks promising, then you can probably simply use the requests library, and skip BeautifulSoup. But you have to do the reverse engineering to figure out WHAT you should be calling.
Here, try this: I did a little snooping with the browser console. Is this the data you are looking for? get info for KL-AFMI2.OSE
If so.. then just use that URL directly in requests.
I'm fairly new to Python, and this is my first post to stackoverflow, and as a starting project I'm trying to write a program that will gather the prices of board games from different websites that sell them. As part of this I'm trying to write a function that will use a website's built-in search function to find the webpage I want for a game that I input.
The code I'm using so far is:
import requests
body = {'keywords':'galaxy trucker'}
con = requests.post('http://www.thirstymeeples.co.uk/', data=body)
print(con.content)
My problem is that the webpage it returns is not the webpage I get when I manually input and search for 'galaxy trucker' on the website itself.
The html for the search form in question is
<form method="post" action="http://www.thirstymeeples.co.uk/">
<input type="search" name="keywords" id="keywords" class="searchinput" value>
</form>
I have read this but with that the difference to me seems to be that the search actually appears on the webpage, whereas with mine, the web address provided in the action section does not itself display a search bar. In this example too, there is no id keyword in the html, whereas in mine there is, does this make a difference?
No search form on the index page, but if you do a "manual" search from the "games" page (which does hae a form), you end up on a page with this url:
http://www.thirstymeeples.co.uk/games/search/q?collection=search_games|search_designers|search_publishers&loose_ends=right&search_mode=all&keywords=galaxy+trucker
Notice that this page does take GET params, and that if you change the keywords from "galaxy+trucker" to anything else you get an updated result page. IOW, you want to do a GET request at http://www.thirstymeeples.co.uk/games/search/q:
r = requests.get("http://www.thirstymeeples.co.uk/games/search/q", params={"keywords": "galaxy trucker"})
print(r.content)
I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)
The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.
I'm trying to parse some web pages for future use. For parsing webpages, I've used different modules like urllib, lxml, BeautifulSoup, HTMLParser to reach my goal.
I didn't meet any problem while parsing web pages until I faced the hidden tags.
When I opened the page with a chrome browser and used the developer tools to see elements of page, I was able to see the <embed> part of the code:
<embed type="..." src="..." ID="..." >
and simply can copy/paste manually.
I need to parse ID from this hidden tag. Why can I parse this part from the site by using python? Any way to parse these hidden parts?
I know it's not possible to see some code parts like php and asp in the html source but I suppose it's not the case.
This "hidden" code is probably generated by JavaScript at runtime.
You might have better luck finding out how the JavaScript works and where it gets its data (the URLs) than attempting to have something run the script and then parse the resulting DOM tree...
I would like to access any element in a web page. I know how to do that when I have a form (form = cgi.FieldStorage()), but not when I have, for example, a table.
How can I do that?
Thanks
If you are familiar with javascript, you should be familiar with the DOM. This should help you to get the information you want, seeing how this parses HTML, among other things. Then it's up to you to extract the information you need
HTML parsing using either HTMLParser or Beautiful Soup if you're trying to get data from a web page. You can't really write data to an HTML table like you could do with CGI and forms, so I'm hoping this is what you want.
I personally recommend Beautiful Soup if you want intelligent parsing behavior.
The way to access a table is to parse the HTML. This is different from accessing form data, in that it's not dynamic. Since you mentioned CGI, I'm assuming you're working on the server side of things and that you have the ability to serve whatever content you want. So you could use whatever data you're representing in the table in its raw form before turning it into HTML too.
You can access only data, posted by form (or as GET parameters).
So, you can extract data you need using JavaScript and post it through form