python - xPath syntax for second occurence - python

<input name="utf8" type="hidden" value="✓" />
<input name="ohboy" type="hidden" value="I_WANT_THIS" />
<label for="user_email">Email</label>
<input class="form-control" id="user_email" name="user[email]" size="30" type="email" value="" />
I'm kinda stuck here, I was originally going to use find() instead of xpath() because the tag input is in several places in the source, but i figured out that find() only returns the first occurence in the source

Use find(), passing the xpath expression specifying an integer index of an element:
from lxml.html import fromstring
html_data = """<input name="utf8" type="hidden" value="✓" />
<input name="ohboy" type="hidden" value="I_WANT_THIS" />
<label for="user_email">Email</label>
<input class="form-control" id="user_email" name="user[email]" size="30" type="email" value="" />"""
tree = fromstring(html_data)
print tree.find('.//input[2]').attrib['value']
prints:
I_WANT_THIS
But, even better (and cleaner) would be to find the input by name attribute:
print tree.find('.//input[#name="ohboy"]').attrib['value']

Related

Is there a way to get a value usuing beautifulsoup

I'm trying to add all the values from html table into the list. Figured out the way to do it here using
soup.find_all('a'), it gives me CSCI 101
<td><font size="-1" face="Verdana" color="#000080">CSCI 101</font></td>
Now I need to do same thing here. Need to get the number 22481, but I couldn't find the way to do so.
<input type="hidden" name="sel_term" value="202120">
<input type="hidden" name="del_crn" value="00000">
<input type="hidden" name="save_crn" value="">
<td><input type="submit" name="sel_crn" value="22481" style="background-color:transparent;cursor:hand;border:none;color:#8A2BE2"></td>
Any ideas?
soup.find('input', {'type': 'submit'})['value']

How to parse a html string using python scrapy

I have a list of html input elements as below.
lists=[<input type="hidden" name="csrf_token" value="jZdkrMumEBeXQlUTbOWfInDwNhtVHGSxKyPvaipoAFsYqCgRLJzc">,
<input type="text" class="form-control" id="username" name="username">,
<input type="password" class="form-control" id="password" name="password">,
<input type="submit" value="Login" class="btn btn-primary">]
From these I need to extract the attribute values of name, type, and value
For eg:
Consider the input <input type="hidden" name="csrf_token" value="jZdkrMumEBeXQlUTbOWfInDwNhtVHGSxKyPvaipoAFsYqCgRLJzc">
then I need output as following dictionary format
{'csrf_token':('hidden',"jZdkrMumEBeXQlUTbOWfInDwNhtVHGSxKyPvaipoAFsYqCgRLJzc")}
Could anyone please a guidance to solve this
I recommend you to use the Beautiful Soup Python library (https://pypi.org/project/beautifulsoup4/) to get the HTML content and the values of the elements. There are functions already created for that purpose.

Python BeautifulSoup returning wrong list of inputs from find_all()

I have Python 2.7.3 and bs.version is 4.4.1
For some reason this code
from bs4 import BeautifulSoup # parsing
html = """
<html>
<head id="Head1"><title>Title</title></head>
<body>
<form id="form" action="login.php" method="post">
<input type="text" name="fname">
<input type="text" name="email" >
<input type="button" name="Submit" value="submit">
</form>
</body>
</html>
"""
html_proc = BeautifulSoup(html, 'html.parser')
for form in html_proc.find_all('form'):
for input in form.find_all('input'):
print "input:" + str(input)
returns a wrong list of inputs:
input:<input name="fname" type="text">
<input name="email" type="text">
<input name="Submit" type="button" value="submit">
</input></input></input>
input:<input name="email" type="text">
<input name="Submit" type="button" value="submit">
</input></input>
input:<input name="Submit" type="button" value="submit">
</input>
It's supposed to return
input: <input name="fname" type="text">
input: <input type="text" name="email">
input: <input type="button" name="Submit" value="submit">
What happened?
To me, this looks like an artifact of the html parser. Using 'lxml' for the parser instead of 'html.parser' seems to make it work. The downside of this is that you (or your users) then need to install lxml -- The upside is that lxml is a better/faster parser ;-).
As for why 'html.parser' doesn't seem to work correctly in this case, I think it has something to do with the fact that input tags are self-closing. If you explicitly close your inputs, it works:
<input type="text" name="fname" ></input>
<input type="text" name="email" ></input>
<input type="button" name="Submit" value="submit" ></input>
I would be curious to see if we could modify the source code to handle this case ... Doing a little experiment to monkey-patch bs4 indicates that we can do this:
from bs4 import BeautifulSoup
from bs4.builder import _htmlparser
# Monkey-patch the Beautiful soup HTML parser to close input tags automatically.
BeautifulSoupHTMLParser = _htmlparser.BeautifulSoupHTMLParser
class FixedParser(BeautifulSoupHTMLParser):
def handle_starttag(self, name, attrs):
# Old-style class... No super :-(
result = BeautifulSoupHTMLParser.handle_starttag(self, name, attrs)
if name.lower() == 'input':
self.handle_endtag(name)
return result
_htmlparser.BeautifulSoupHTMLParser = FixedParser
html = """
<html>
<head id="Head1"><title>Title</title></head>
<body>
<form id="form" action="login.php" method="post">
<input type="text" name="fname" >
<input type="text" name="email" >
<input type="button" name="Submit" value="submit" >
</form>
</body>
</html>
"""
html_proc = BeautifulSoup(html, 'html.parser')
for form in html_proc.find_all('form'):
for input in form.find_all('input'):
print "input:" + str(input)
Obviously, this isn't a true fix (I wouldn't submit this as a patch to the BS4 folks), but it does demonstrate the problem. Since there is no end-tag, the handle_endtag method is never getting called. If we call it ourselves, things tend to work out (as long as the html doesn't also have a closing input tag ...).
I'm not really sure whose responsibility this bug should be, but I suppose that you could start by submitting it to bs4 -- They might then forward you on to report a bug on the python tracker, I'm not sure...
Don't use nested loop for this, and use lxml , Change your code to this:
inp = []
html_proc = BeautifulSoup(html, 'lxml')
for form in html_proc.find_all('form'):
inp.extend(form.find_all('input'))
for item in inp:
print "input:" + str(item)

extracting values in HTML data

I have data in this HTML format in python:
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" >
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="ky6272M5yMyLqwLSiOD7282n7W/4c5S+PsBnbknDUX8d4iGsUDPboCpQG3F86cgBN3u3/nrEYLDN43eRdevxKrBv6MBnwC8l0l3WLxFOKGpqGUl5KzodoLbQB44LtcSYLudbO+lczSjwyEzsHOrw3IW4VT1HAT/OjPJI36AIf/BAXY/UoKT38X1yrDNE0sf0jk5WOPq+v+wh+Dsw9F6dojZXucY5dmGdNWaigKKn6VSG6tkzqsCFVjYEkzTjj1ItCdstnDZv2LVHRJpQ654Zvcf2IkQOR7p+V+TLRYdR9yOngXh2p/qt6UXYrR4DVUPkgxiCuIjFpSpYvGmHuw3+ocadeLklAtAQZbQF63c+xyogyV4Dm2fW2BT1+fhW+lqoo5aTFcWM+2v2SwfSsRKOMUH9MudewVDP0ro/3w9+OPq1q8hHGDzzbwDJh7nOvyW67DYY1AEp2NV1lCbDwazCX0DHpW/prlmuFMj1zt+mamjoGERWNujqr6FQNgSG1n62VrJMdBhEwYdHNYuWEQorD/EA3ze/5Pmxv7j6PngmoNv9uVtOwq4M3RhtgjS4OY5RsBO8l+Ij74Mqihh5xa0T3D2p5VIBZJW5M3nb6c1yuNqgcNgstqNU2BDwE/T1h+sF8wK7BG0YKQd6BrilABj1+AZZElrS9SdDtjuyKFGWEx2qLHUpWrkys4yy3Icq7xSsf/eDsg==" />
I would like a way to extract the contents of the value attribute using regular expressions in python.
html can be much more complicated.
from bs4 import BeautifulSoup
html = '<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" >'
soup = BeautifulSoup(html, 'lxml')
input_tag = soup.find('input')
input_tag['value']
With BeautifulSoup, you can use the find method of the BeautifulSoup class and extract the value attribute like so:
from bs4 import BeautifulSoup
x = """<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" >"""
soup = BeautifulSoup(x)
print soup.find('input')['value']

POST.getlist() processing

I'm having problems with processing custom form data...
<input type="text" name="client[]" value="client1" />
<input type="text" name="address[]" value="address1" />
<input type="text" name="post[]" value="post1" />
...
<input type="text" name="client[]" value="clientn" />
<input type="text" name="address[]" value="addressn" />
<input type="text" name="post[]" value="postn" />
... (this repeats a couple of times...)
If I do
request.POST.getlist('client[]')
request.POST.getlist('address[]')
request.POST.getlist('post[]')
I get
{u'client:[client1,client2,clientn,...]}
{u'address:[address1,address2,addressn,...]}
{u'post:[post1,post2,postn,...]}
But I need something like this
{
{0:{client1,address1,post1}}
{1:{client2,address2,post2}}
{2:{client3,address3,post3}}
...
}
So that I can save this data to the model. This is probably pretty basic but I'm having problems with it.
Thank you!
Firstly, please drop the [] in the field names. That's a PHP-ism that has no place in Django.
Secondly, if you want related items grouped together, you'll need to change your form. You need to give each field a separate name:
<input type="text" name="client_1" value="client1" />
<input type="text" name="address_1" value="address1" />
<input type="text" name="post_1" value="post1" />
...
<input type="text" name="client_n" value="clientn" />
<input type="text" name="address_n" value="addressn" />
<input type="text" name="post_n" value="postn" />
Now request.POST will contain a separate entry for each field, and you can iterate through:
for i in range(1, n+1):
client = request.POST['client_%s' % i]
address = request.POST['address_%s' % i]
post = request.POST['post_%s' % i]
... do something with these values ...
Now at this point, you probably want to look at model formsets, which can generate exactly this set of forms and create the relevant objects from the POST.

Categories

Resources