How can I access any element in a web page with Python? - python

I would like to access any element in a web page. I know how to do that when I have a form (form = cgi.FieldStorage()), but not when I have, for example, a table.
How can I do that?
Thanks

If you are familiar with javascript, you should be familiar with the DOM. This should help you to get the information you want, seeing how this parses HTML, among other things. Then it's up to you to extract the information you need

HTML parsing using either HTMLParser or Beautiful Soup if you're trying to get data from a web page. You can't really write data to an HTML table like you could do with CGI and forms, so I'm hoping this is what you want.
I personally recommend Beautiful Soup if you want intelligent parsing behavior.

The way to access a table is to parse the HTML. This is different from accessing form data, in that it's not dynamic. Since you mentioned CGI, I'm assuming you're working on the server side of things and that you have the ability to serve whatever content you want. So you could use whatever data you're representing in the table in its raw form before turning it into HTML too.

You can access only data, posted by form (or as GET parameters).
So, you can extract data you need using JavaScript and post it through form

Related

How to find the correct URL when you made some choices on the web page?

I'm very new to learn about web scraping. By using xpath selector i am trying to get the knowledge on that webpage : https://seffaflik.epias.com.tr/transparency/uretim/planlama/kgup.xhtml
But the point is, whenever you change the date or the powerplant name, URL does not change therefore when you fetch the response, you are getting always the same and wrong answer. Is there a way to find the correct URL or anything else related to HTML Markup etc ?
For a scraping operation like this, you'll need to do a bit more than just load the document and then grab the content. The document in-question relies on JavaScript to load new information from some other resource after the user has defined a particular set of parameters and updated the form.
After loading the document, you'll need to define your search parameters. You can do this via JavaScript injection or via your browser's console. For example, if you were trying to define the value for the first date field, you could use
document.querySelectorAll('#j_idt199 input')[1].value = "Some/New/Date";
Repeat this process for the other fields you wish to define in your search, and then run the following code to programmatically execute your search:
document.querySelector('#j_idt199 button').click();
After that, you can either grab the information you want using plain JS query selectors, or you can implement a scraping library like artoo.js to help you interpret the data and export it.

Python multiple web pages scraping with same starting url string

I am trying to read review data data from alexaskillstore.com website using BeautifulSoup. For this, I am specifying the target url as https://www.alexaskillstore.com/Business-Leadership-Series/B078LNGS5T, where the string after Business-Leadership-Series/ keeps changing for all the different skills.
I want to know how can I input a regular expression or similar code to my input url so that I am able to read every link that starts with https://www.alexaskillstore.com/Business-Leadership-Series/.
You can't. The web is client-server based, so unless the server is kind enough to map the content for you, you have no way to know which URLs will be responsive and which won't.
You may be able to scrape some index page(s) to find the keys (B078LNGS5T and the like) you need. Once you have them all, actually generating the URLs is a simple matter of string substitution.

Get dynamic html table using selenium & parse it using beautifulsoup

I'm trying to get the content of a HTML table generated dynamically by JavaScript in a webpage & parse it using BeautifulSoup to use certain values from the table.
Since the content is generated by JavaScript it's not available in source (driver.page_source).
Is there any other way to obtain the content and use it? It's table containing list of tasks, I need to parse the table and identify whether specific task I'm searching for is available.
As mentioned by Julian, i'd rather check my "Net" tab in Firebug (or similar tool in other browsers) and get the data like this. If the data is JSON, just use json.loads(), if it's html, you can parse it using BS or any other lib as you say. Maybe you would like to try my dummy lib, which simplifies this and returns tables as tablib objects, which you can get as csv, excel, json etc.
You'd need to figure out what HTTP requests the Javascript is making, and make the same ones in your Python code. You can do this by using your favorite browser's development tools, or wireshark if forced.

Grabbing non-HTML data from a website using python

I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
I would really like a python 2.6 solution.
It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.
But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?
Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.
It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.
See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text
If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:
http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2
This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.
Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.
Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.

Programmatic Form Submit

I want to scrape the contents of a webpage. The contents are produced after a form on that site has been filled in and submitted.
I've read on how to scrape the end result content/webpage - but how to I programmatically submit the form?
I'm using python and have read that I might need to get the original webpage with the form, parse it, get the form parameters and then do X?
Can anyone point me in the rigth direction?
Using python, I think it takes the following steps:
parse the web page that contains the form, find out the form submit address, and the submit method ("post" or "get").
this explains form elements in html file
Use urllib2 to submit the form. You may need some functions like "urlencode", "quote" from urllib to generate the url and data for post method. Read the library doc for details.
you'll need to generate a HTTP request containing the data for the form.
The form will look something like:
<form action="submit.php" method="POST"> ... </form>
This tells you the url to request is www.example.com/submit.php and your request should be a POST.
In the form will be several input items, eg:
<input type="text" name="itemnumber"> ... </input>
you need to create a string of all these input name=value pairs encoded for a URL appended to the end of your requested URL, which now becomes
www.example.com/submit.php?itemnumber=5234&otherinput=othervalue etc...
This will work fine for GET. POST is a little trickier.
</motivation>
Just follow S.Lott's links for some much easier to use library support :P
From a similar question - options-for-html-scraping - you can learn that with Python you can use Beautiful Soup.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
The unusual name caught the attention of our host, November 12, 2008.
You can do it with javascript. If the form is something like:
<form name='myform' ...
Then you can do this in javascript:
<script language="JavaScript">
function submitform()
{
document.myform.submit();
}
</script>
You can use the "onClick" attribute of links or buttons to invoke this code. To invoke it automatically when a page is loaded, use the "onLoad" attribute of the element:
<body onLoad="submitform()" ...>

Categories

Resources