Scrapy - How to get XPATH node as text

Scrapy - How to get XPATH node as text - python

I want to get a node as string from a XHTML node.
for example
<html>
<body>
<div>
<input type="text" value="myText"></input>
<input type="text" value="myText" disabled="true"></input>
</div>
</body>
</html>
I want to get two input's XHTML text as string in an array, like this?
["<input type="text" value="myText"></input>"
"<input type="text" value="myText" disabled="true"></input>"]
Is this possible with scrapys XPATH selector?

Not completely what you're asking for as it only shows the opening tag and for some reason it removes ="true" from the last input, but it might be enough depending on what you're using it for.
>>> response.xpath('//div/input').extract()
[u'<input type="text" value="myText">', u'<input type="text" value="myText" disabled>']

Related

How to use python selenium find_element_by_xpath find out above key word

I am trying to find a string. But it doesn't seem to work.
HTML：
<form name="form1" method="post" action="?cz=del&wbid=7683290543&zjt=aaa&lx=CNAME&xl=%C4%AC%C8%CF&fs=" onSubmit="return b_ifsf('delete？');" id="form1">
<td style="width:120px">
<input type="hidden" name="ip" value="aaa.xxx.com.a.bdydns.com." >
<input type="submit" name="rpt$btnDelete" value="delete" />
</td>
</form>
<form name="form1" method="post" action="?cz=del&wbid=2324242122&zjt=bbb&lx=CNAME&xl=%C4%AC%C8%CF&fs=" onSubmit="return b_ifsf('delete？');" id="form1">
<td style="width:120px">
<input type="hidden" name="ip" value="bbb.xxx.com.a.bdydns.com." >
<input type="submit" name="rpt$btnDelete" value="delete" />
</td>
</form>
<form name="form1" method="post" action="?cz=del&wbid=2324242553&zjt=ccc&lx=CNAME&xl=%C4%AC%C8%CF&fs=" onSubmit="return b_ifsf('delete？');" id="form1">
<td style="width:120px">
<input type="hidden" name="ip" value="ccc.xxx.com.a.bdydns.com." >
<input type="submit" name="rpt$btnDelete" value="delete" />
</td>
</form>
How to find out the key word bbb.xxx.com.a.bdydns.com. and then hit submit to delete it?

#EVNRaja's solution was in the right direction.
To locate the text bbb.xxx.com.a.bdydns.com. then click the associated element with value attribute as delete you can use either of the following solutions:
Using xpath and click():
driver.find_element_by_xpath("//form[#id='form1' and #name='form1']//input[#name='ip' and #value='bbb.xxx.com.a.bdydns.com.']//following::input[1]").click()
Using xpath and submit():
driver.find_element_by_xpath("//form[#id='form1' and #name='form1']//input[#name='ip' and #value='bbb.xxx.com.a.bdydns.com.']//following::input[1]").submit()

The URL you are trying to identify is quoted as hidden element.
The html code you have provided:
<input type="hidden" name="ip" value="bbb.xxx.com.a.bdydns.com." >
All the hidden elements in a browser may have a purpose.
Example:
Consider there is text-field and it doesn't numeric values as input, if end-user enters any numeric values there will be an error code displays next to the text-field.
Here, until we enter a numeric text the error message (text inside html tag element) will be hidden.
In the html code you have shared, the value which you want to inspect was quoted inside input tag and has a type="hidden" name="ip" value="bbb.xxx.com.a.bdydns.com.", we can write a compound xpath as follows:
A example with multiple compound statements:
//input[#type = 'hidden' and #name = 'ip' and contains(#value, 'bbb.xxx.com.a.bdydns.com.')]/following-sibling::input
or
A Simple example:
//input[contains(#value, 'bbb.xxx.com.a.bdydns.com.')]/following-sibling::input
With this xpath code we can directly identify the submit and in the next step you can click the button.

You should be able to use a css selector combination of:
[value='bbb.xxx.com.a.bdydns.com.'] + input
Code:
driver.find_element_by_css_selector("[value='bbb.xxx.com.a.bdydns.com.'] + input").click() #.submit()
The first part is an attribute = value css selector then the "+" is an adjacent sibling combinator, followed by an element selector; saying, find input tag element that is an adjacent sibling element to element with attribute value having value of bbb.xxx.com.a.bdydns.com.

Selecting xpath input given label

So I normally use driver.find_element_by_xpath('//input[#id=(//*[contains(text(), "User Name")]/#for)]') to enter information into text boxes, but this hasn't been working for the following code:
<div class="form-group text-entry required ">
<div class="label-set" id="answerTextBox-30918642">Primary User Name (First and Last Name)</div>
<div class="group-set" role="group" aria-labelledby="answerTextBox-30918642">
<input id="pageResponse_Responses_Index" name="pageResponse.Responses.Index" type="hidden" value="fta30918642">
<input class="freeTextAnswerId form-control" id="pageResponse_Responses_fta30918642__FreeTextAnswerId" name="pageResponse.Responses[fta30918642].FreeTextAnswerId" type="hidden" value="30918642">
<label for="answerTextBox-30918642-free" class="sr-only">Write-In Answer</label>
<input class="form-control free-text" id="answerTextBox-30918642-free" name="pageResponse.Responses[fta30918642].FreeText" type="text" value="">
</div>
</div>
I tried messing with the xpath to select the input,
driver.find_element_by_xpath("//label[contains(.,'User Name')]/following-sibling::input[1]")
but so far nothing I've tried so far has worked correctly. This works to find the element containing the label, driver.find_element_by_xpath("//*[contains(text(), 'User Name')]"), but my issue is then selecting the input to send keys to.

You can simplify your XPath to
//div[contains(.,'User Name')]/following::input[#type='text']
The following axis from MDN.
The following axis indicates all the nodes that appear after the context node, except any descendant, attribute, and namespace nodes.
The other INPUTs are hidden so specifying #type='text' will find only the one you want.

So I was able to get it working integration Santhosh's suggestion:
//*[contains(text(), \"User Name\")]/../div//input[#class='form-control free-text']

Python BeautifulSoup returning wrong list of inputs from find_all()

I have Python 2.7.3 and bs.version is 4.4.1
For some reason this code
from bs4 import BeautifulSoup # parsing
html = """
<html>
<head id="Head1"><title>Title</title></head>
<body>
<form id="form" action="login.php" method="post">
<input type="text" name="fname">
<input type="text" name="email" >
<input type="button" name="Submit" value="submit">
</form>
</body>
</html>
"""
html_proc = BeautifulSoup(html, 'html.parser')
for form in html_proc.find_all('form'):
for input in form.find_all('input'):
print "input:" + str(input)
returns a wrong list of inputs:
input:<input name="fname" type="text">
<input name="email" type="text">
<input name="Submit" type="button" value="submit">
</input></input></input>
input:<input name="email" type="text">
<input name="Submit" type="button" value="submit">
</input></input>
input:<input name="Submit" type="button" value="submit">
</input>
It's supposed to return
input: <input name="fname" type="text">
input: <input type="text" name="email">
input: <input type="button" name="Submit" value="submit">
What happened?

To me, this looks like an artifact of the html parser. Using 'lxml' for the parser instead of 'html.parser' seems to make it work. The downside of this is that you (or your users) then need to install lxml -- The upside is that lxml is a better/faster parser ;-).
As for why 'html.parser' doesn't seem to work correctly in this case, I think it has something to do with the fact that input tags are self-closing. If you explicitly close your inputs, it works:
<input type="text" name="fname" ></input>
<input type="text" name="email" ></input>
<input type="button" name="Submit" value="submit" ></input>
I would be curious to see if we could modify the source code to handle this case ... Doing a little experiment to monkey-patch bs4 indicates that we can do this:
from bs4 import BeautifulSoup
from bs4.builder import _htmlparser
# Monkey-patch the Beautiful soup HTML parser to close input tags automatically.
BeautifulSoupHTMLParser = _htmlparser.BeautifulSoupHTMLParser
class FixedParser(BeautifulSoupHTMLParser):
def handle_starttag(self, name, attrs):
# Old-style class... No super :-(
result = BeautifulSoupHTMLParser.handle_starttag(self, name, attrs)
if name.lower() == 'input':
self.handle_endtag(name)
return result
_htmlparser.BeautifulSoupHTMLParser = FixedParser
html = """
<html>
<head id="Head1"><title>Title</title></head>
<body>
<form id="form" action="login.php" method="post">
<input type="text" name="fname" >
<input type="text" name="email" >
<input type="button" name="Submit" value="submit" >
</form>
</body>
</html>
"""
html_proc = BeautifulSoup(html, 'html.parser')
for form in html_proc.find_all('form'):
for input in form.find_all('input'):
print "input:" + str(input)
Obviously, this isn't a true fix (I wouldn't submit this as a patch to the BS4 folks), but it does demonstrate the problem. Since there is no end-tag, the handle_endtag method is never getting called. If we call it ourselves, things tend to work out (as long as the html doesn't also have a closing input tag ...).
I'm not really sure whose responsibility this bug should be, but I suppose that you could start by submitting it to bs4 -- They might then forward you on to report a bug on the python tracker, I'm not sure...

Don't use nested loop for this, and use lxml , Change your code to this:
inp = []
html_proc = BeautifulSoup(html, 'lxml')
for form in html_proc.find_all('form'):
inp.extend(form.find_all('input'))
for item in inp:
print "input:" + str(item)

How to search for a word and then replace text after it using regular expressions in python?

I'm trying to write a script that will search through a html file and then replace the form action. So in this basic code:
<html>
<head>
<title>Forms</title>
</head>
<body>
<form action="login.php" method="post">
Username: <input type="text" name="username" value="" />
<br />
Password: <input type="password" name="password" value="" />
<br />
<input type="submit" name="submit" value="Submit">
</form>
</body>
</html>
I would like the script to search for form action="login.php" but then only replace the login.php, with say newlogin.php. The key thing is that the form action might change from file to file, i.e. on another html file the login.php might be something totally different, so the regular expression has to search for the form action= and replace the text after it (maybe using the " as limiters?)
My knowledge of regular expressions is pretty basic, for example I'd know how to replace just login.php:
(re.sub('login.php', 'newlogin.php', line))
but obviously it's no use as mentioned above if the login.php changes from file to file.
Any help is much appreciated!
Thanks all =)

You can use regex, or just simple string manipulation. Just a test case.
for line in open("file"):
if "form action" in line:
line=line.rstrip()
a=line.split('<form action="')
a[-1] = '"newlogin" ' + a[-1].split()[-1]
line = '<form action='.join(a)
print line

Make the re catch 2 groups, the form and everything leading up to the 1st quote after action, and the action content.
Use the 1st group for the replacement, followed by the new action:
re.sub(r'(<form.*?action=")([^"]+)', r'\1newlogin.php', content)

You cant try this technique:
(<form[^>]*action=")[^"]*
pseudo-code:
regex.replace(input, pattern, concat(\1, new_value))
You can use this regex:
(?<=<form[^>]*action=")[^"]*

HTML input array parsing in Python (GAE)

I'm two days in to Python and GAE, thanks in advance for the help.
I have an input array in HTML like this:
<input type="text" name="p_item[]">
<input type="text" name="p_item[]">
<input type="text" name="p_item[]">
I want to parse the input in Python, and I'm trying this, which isn't working:
items = self.request.get('p_item')
for n in range(1,len(items)):
self.response.out.write('Item '+n+': '+items[n])
What is the correct way to do this?

Change your html to this
<input type="text" name="p_item">
<input type="text" name="p_item">
<input type="text" name="p_item">
and use the self.request.get_all() method http://code.google.com/appengine/docs/python/tools/webapp/requestclass.html#Request_get_all
p.s. For reference, there is no concept of arrays for GET/POST data, your form gets transformed a key=value string separated by '&' e.g.
p_item=1&p_item=3&p_item=15
etc, it's up to the web framework to interpret whether a parameter is an array.
Edit: oops, just read the comments that you figured this out already, oh well :P

I would recommend doing some debugging if this sort of issue comes up. Make things simple and write out your variable values and ensure you get what you expect at each step. Do something like the following:
<form method="get">
<input type="text" name="single_key" />
<input type="text" name="array_key[some_key]" />
<input type="submit" />
</form>
And see what happens when running the following Python on the backend:
single_value = self.request.get('single_key')
self.response.out.write(str(single_value))
array_value = self.request.get('array_key')
self.response.out.write(str(array_value))
Based on the output you should have a better idea of what to get the desired results or how to add more detail to your question if you still don't understand a certain behavior.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy - How to get XPATH node as text - python

Related

How to use python selenium find_element_by_xpath find out above key word

Selecting xpath input given label

Python BeautifulSoup returning wrong list of inputs from find_all()

How to search for a word and then replace text after it using regular expressions in python?

HTML input array parsing in Python (GAE)

Categories

Resources