extracting values in HTML data

extracting values in HTML data - python

I have data in this HTML format in python:
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" >
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="ky6272M5yMyLqwLSiOD7282n7W/4c5S+PsBnbknDUX8d4iGsUDPboCpQG3F86cgBN3u3/nrEYLDN43eRdevxKrBv6MBnwC8l0l3WLxFOKGpqGUl5KzodoLbQB44LtcSYLudbO+lczSjwyEzsHOrw3IW4VT1HAT/OjPJI36AIf/BAXY/UoKT38X1yrDNE0sf0jk5WOPq+v+wh+Dsw9F6dojZXucY5dmGdNWaigKKn6VSG6tkzqsCFVjYEkzTjj1ItCdstnDZv2LVHRJpQ654Zvcf2IkQOR7p+V+TLRYdR9yOngXh2p/qt6UXYrR4DVUPkgxiCuIjFpSpYvGmHuw3+ocadeLklAtAQZbQF63c+xyogyV4Dm2fW2BT1+fhW+lqoo5aTFcWM+2v2SwfSsRKOMUH9MudewVDP0ro/3w9+OPq1q8hHGDzzbwDJh7nOvyW67DYY1AEp2NV1lCbDwazCX0DHpW/prlmuFMj1zt+mamjoGERWNujqr6FQNgSG1n62VrJMdBhEwYdHNYuWEQorD/EA3ze/5Pmxv7j6PngmoNv9uVtOwq4M3RhtgjS4OY5RsBO8l+Ij74Mqihh5xa0T3D2p5VIBZJW5M3nb6c1yuNqgcNgstqNU2BDwE/T1h+sF8wK7BG0YKQd6BrilABj1+AZZElrS9SdDtjuyKFGWEx2qLHUpWrkys4yy3Icq7xSsf/eDsg==" />
I would like a way to extract the contents of the value attribute using regular expressions in python.

html can be much more complicated.
from bs4 import BeautifulSoup
html = '<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" >'
soup = BeautifulSoup(html, 'lxml')
input_tag = soup.find('input')
input_tag['value']

With BeautifulSoup, you can use the find method of the BeautifulSoup class and extract the value attribute like so:
from bs4 import BeautifulSoup
x = """<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" >"""
soup = BeautifulSoup(x)
print soup.find('input')['value']

Related

in python, how I get the value of the fifth line value , not the second line

in website I have this source code
<div class="estimated-delivery">
<input id="currentCity" type="hidden" value="City">
<input id="currentCityISO" type="hidden" value="CY">
<input id="idUnit" type="hidden" value="3910200523">
<input id="availableOffers" type="hidden" value="4">
and I wrote this in python
lastvalue = page_soup.find("div", {"class": "estimated-delivery"})
lastvalue.input["value"]
the result came back
'City'
I do not need the first value in the second line of website source code ,
I need the value of the fifth line in the website source code which is
'4'
Thanks

I would find this option much easier:
last_value = soup.find("input", {"id": "availableOffers"})["value"]
Full example:
from bs4 import BeautifulSoup
doc = r"""<div class="estimated-delivery">
<input id="currentCity" type="hidden" value="City"></input>
<input id="availableOffers" type="hidden" value="4"></input>
</div>"""
soup = BeautifulSoup(doc, 'html.parser')
soup.find("input", {"id": "availableOffers"})["value"]
>>> '4'

The point I was trying to make in a comment to zerecees answer earlier, was that given the HTML.
<div class="estimated-delivery">
<input id="currentCity" type="hidden" value="City">
<input id="currentCityISO" type="hidden" value="CY">
<input id="idUnit" type="hidden" value="3910200523">
<input id="availableOffers" type="hidden" value="4">
<div>
You can first locate the div container, and then search the div for input elements with a specific id.
from bs4 import BeautifulSoup
page_soup= BeautifulSoup(doc, 'html.parser')
div=page_soup.find("div", {"class": "estimated-delivery"})
inp_val=div.find("input", {"id": "availableOffers"})["value"]
Or if you are not sure about the id's of the input elements, and you want to get the value of the 4th input element (index 3) you can use find_all and do something like this.
inp_val=div.find_all("input")[3]["value"]

How to parse a html string using python scrapy

I have a list of html input elements as below.
lists=[<input type="hidden" name="csrf_token" value="jZdkrMumEBeXQlUTbOWfInDwNhtVHGSxKyPvaipoAFsYqCgRLJzc">,
<input type="text" class="form-control" id="username" name="username">,
<input type="password" class="form-control" id="password" name="password">,
<input type="submit" value="Login" class="btn btn-primary">]
From these I need to extract the attribute values of name, type, and value
For eg:
Consider the input <input type="hidden" name="csrf_token" value="jZdkrMumEBeXQlUTbOWfInDwNhtVHGSxKyPvaipoAFsYqCgRLJzc">
then I need output as following dictionary format
{'csrf_token':('hidden',"jZdkrMumEBeXQlUTbOWfInDwNhtVHGSxKyPvaipoAFsYqCgRLJzc")}
Could anyone please a guidance to solve this

I recommend you to use the Beautiful Soup Python library (https://pypi.org/project/beautifulsoup4/) to get the HTML content and the values of the elements. There are functions already created for that purpose.

Python BeautifulSoup returning wrong list of inputs from find_all()

I have Python 2.7.3 and bs.version is 4.4.1
For some reason this code
from bs4 import BeautifulSoup # parsing
html = """
<html>
<head id="Head1"><title>Title</title></head>
<body>
<form id="form" action="login.php" method="post">
<input type="text" name="fname">
<input type="text" name="email" >
<input type="button" name="Submit" value="submit">
</form>
</body>
</html>
"""
html_proc = BeautifulSoup(html, 'html.parser')
for form in html_proc.find_all('form'):
for input in form.find_all('input'):
print "input:" + str(input)
returns a wrong list of inputs:
input:<input name="fname" type="text">
<input name="email" type="text">
<input name="Submit" type="button" value="submit">
</input></input></input>
input:<input name="email" type="text">
<input name="Submit" type="button" value="submit">
</input></input>
input:<input name="Submit" type="button" value="submit">
</input>
It's supposed to return
input: <input name="fname" type="text">
input: <input type="text" name="email">
input: <input type="button" name="Submit" value="submit">
What happened?

To me, this looks like an artifact of the html parser. Using 'lxml' for the parser instead of 'html.parser' seems to make it work. The downside of this is that you (or your users) then need to install lxml -- The upside is that lxml is a better/faster parser ;-).
As for why 'html.parser' doesn't seem to work correctly in this case, I think it has something to do with the fact that input tags are self-closing. If you explicitly close your inputs, it works:
<input type="text" name="fname" ></input>
<input type="text" name="email" ></input>
<input type="button" name="Submit" value="submit" ></input>
I would be curious to see if we could modify the source code to handle this case ... Doing a little experiment to monkey-patch bs4 indicates that we can do this:
from bs4 import BeautifulSoup
from bs4.builder import _htmlparser
# Monkey-patch the Beautiful soup HTML parser to close input tags automatically.
BeautifulSoupHTMLParser = _htmlparser.BeautifulSoupHTMLParser
class FixedParser(BeautifulSoupHTMLParser):
def handle_starttag(self, name, attrs):
# Old-style class... No super :-(
result = BeautifulSoupHTMLParser.handle_starttag(self, name, attrs)
if name.lower() == 'input':
self.handle_endtag(name)
return result
_htmlparser.BeautifulSoupHTMLParser = FixedParser
html = """
<html>
<head id="Head1"><title>Title</title></head>
<body>
<form id="form" action="login.php" method="post">
<input type="text" name="fname" >
<input type="text" name="email" >
<input type="button" name="Submit" value="submit" >
</form>
</body>
</html>
"""
html_proc = BeautifulSoup(html, 'html.parser')
for form in html_proc.find_all('form'):
for input in form.find_all('input'):
print "input:" + str(input)
Obviously, this isn't a true fix (I wouldn't submit this as a patch to the BS4 folks), but it does demonstrate the problem. Since there is no end-tag, the handle_endtag method is never getting called. If we call it ourselves, things tend to work out (as long as the html doesn't also have a closing input tag ...).
I'm not really sure whose responsibility this bug should be, but I suppose that you could start by submitting it to bs4 -- They might then forward you on to report a bug on the python tracker, I'm not sure...

Don't use nested loop for this, and use lxml , Change your code to this:
inp = []
html_proc = BeautifulSoup(html, 'lxml')
for form in html_proc.find_all('form'):
inp.extend(form.find_all('input'))
for item in inp:
print "input:" + str(item)

Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

I want to get all the titles() in the website.
http://www.shyan.gov.cn/zwhd/web/webindex.action
Now, my code successfully scrapes only one page. However, there are multiple pages available at the site above in which I would like to to scrape.
For example, with the url above, when I click the link to "page 2", the overall url does NOT change. I looked at the page source and saw javascript code to advance to the next page like this: javascript:gotopage(2) or javascript:void(0).
My code is here (get page 1)
from bs4 import Beautifulsoup
import requests
url = 'http://www.shyan.gov.cn/zwhd/web/webindex.action'
r = requests.get(url)
soup = Beautifulsoup(r.content,'lxml')
titles = soup.select('td.tit3 > a')
for title in titles:
print(title.get_text())
How can my code be changed to scrape titles from all the available listed pages?
Thank you very much!

Try to use the following URL format:
http://www.shiyan.gov.cn/zwhd/web/webindex.action?keyWord=&searchType=3&page.currentpage=2&page.pagesize=15&page.pagecount=2357&docStatus=&sendOrg=
The site is using javascript to pass hidden page information to the server to request the next page. When you view the source you will find:
<form action="/zwhd/web/webindex.action" id="searchForm" name="searchForm" method="post">
<div class="item">
<div class="titlel">
<span>留言查询</span>
<label class="dow"></label>
</div>
<input type="text" name="keyWord" id="keyword" value="" class="text"/>
<div class="key">
<ul>
<li><span><input type="radio" checked="checked" value="3" name="searchType"/></span><p>编号</p></li>
<li><span><input type="radio" value="2" name="searchType"/></span><p>关键字</p></li>
</ul>
</div>
<input type="button" class="btn1" onclick="search();" value="查询"/>
</div>
<input type="hidden" id="pageIndex" name="page.currentpage" value="2"/>
<input type="hidden" id="pageSize" name="page.pagesize" value="15"/>
<input type="hidden" id="pageCount" name="page.pagecount" value="2357"/>
<input type="hidden" id="docStatus" name="docStatus" value=""/>
<input type="hidden" id="sendorg" name="sendOrg" value=""/>
</form>

python - xPath syntax for second occurence

<input name="utf8" type="hidden" value="✓" />
<input name="ohboy" type="hidden" value="I_WANT_THIS" />
<label for="user_email">Email</label>
<input class="form-control" id="user_email" name="user[email]" size="30" type="email" value="" />
I'm kinda stuck here, I was originally going to use find() instead of xpath() because the tag input is in several places in the source, but i figured out that find() only returns the first occurence in the source

Use find(), passing the xpath expression specifying an integer index of an element:
from lxml.html import fromstring
html_data = """<input name="utf8" type="hidden" value="✓" />
<input name="ohboy" type="hidden" value="I_WANT_THIS" />
<label for="user_email">Email</label>
<input class="form-control" id="user_email" name="user[email]" size="30" type="email" value="" />"""
tree = fromstring(html_data)
print tree.find('.//input[2]').attrib['value']
prints:
I_WANT_THIS
But, even better (and cleaner) would be to find the input by name attribute:
print tree.find('.//input[#name="ohboy"]').attrib['value']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting values in HTML data - python

html can be much more complicated. from bs4 import BeautifulSoup html = '<input type="hidden" name="EVENTTARGET" id="EVENTTARGET" value="" >' soup = BeautifulSoup(html, 'lxml') input_tag = soup.find('input') input_tag['value']

With BeautifulSoup, you can use the find method of the BeautifulSoup class and extract the value attribute like so: from bs4 import BeautifulSoup x = """<input type="hidden" name="EVENTTARGET" id="EVENTTARGET" value="" >""" soup = BeautifulSoup(x) print soup.find('input')['value']

Related

in python, how I get the value of the fifth line value , not the second line

How to parse a html string using python scrapy

Python BeautifulSoup returning wrong list of inputs from find_all()

Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

python - xPath syntax for second occurence

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting values in HTML data - python

html can be much more complicated. from bs4 import BeautifulSoup html = '<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" >' soup = BeautifulSoup(html, 'lxml') input_tag = soup.find('input') input_tag['value']

With BeautifulSoup, you can use the find method of the BeautifulSoup class and extract the value attribute like so: from bs4 import BeautifulSoup x = """<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" >""" soup = BeautifulSoup(x) print soup.find('input')['value']

Related

in python, how I get the value of the fifth line value , not the second line

How to parse a html string using python scrapy

Python BeautifulSoup returning wrong list of inputs from find_all()

Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

python - xPath syntax for second occurence

Categories

Resources

html can be much more complicated. from bs4 import BeautifulSoup html = '<input type="hidden" name="EVENTTARGET" id="EVENTTARGET" value="" >' soup = BeautifulSoup(html, 'lxml') input_tag = soup.find('input') input_tag['value']

With BeautifulSoup, you can use the find method of the BeautifulSoup class and extract the value attribute like so: from bs4 import BeautifulSoup x = """<input type="hidden" name="EVENTTARGET" id="EVENTTARGET" value="" >""" soup = BeautifulSoup(x) print soup.find('input')['value']