I'd like to extract and use the red alphabet of the code below through 'Selenium', so please give me some advice on how to do it
The alphabet changes randomly on every try
<td>
<input type="text" name="WKey" id="As_wkey" value="" maxlength="10" class="inputType" style="width: 300px;" title="password" />
<span id="myspam" style="padding: 2px;">
<span style="font-size: 12pt; font-weight: bold; color: red;">H</span>123
<span style="font-size: 12pt; font-weight: bold; color: red;">R</span>
<span style="font-size: 12pt; font-weight: bold; color: red;">8</span>6789
</span>
(type red word.)
</td>
here is my code
red_characters_element = driver.find_element(By.ID, 'myspam')
red_characters_elements = red_characters_element.find_elements(by = By.CSS_SELECTOR, value="span[style='font-size: 12pt; font-weight: bold; color: red;']")
print(red_characters_elements)
result []
Given all the Red colored alphabets are inside of the <span> tag. You can retrieve it using tag.
red_characters_element = driver.find_element(By.ID, 'myspam')
red_characters_elements = red_characters_element.find_elements(By.TAG_NAME, 'span')
for red_character in red_characters_elements:
print(red_character.text)
Results :
H
R
8
If you need only red letters, you can try using the java script inside the selenium:
driver.execute_script('return document.querySelectorAll("[style*=red]")')
You get an array of objects where the style has the attribute "red", with the for loop you can get the values or anything else
Related
I'm trying to fill a field with text inputs from a CSV, Send Keys works fine with all fields except for the below one
<div class="col-xs-12 col-md-6">
<div class="custom-select" data-qa="work-tags" data-testid="work-tags" aria-disabled="false">
<div class="custom-select__label">Tags</div>
<div class=" css-2b097c-container">
<div class=" css-yk16xz-control">
<div class=" css-1hwfws3">
<div class=" css-1wa3eu0-placeholder">Select</div>
<div class="css-1g6gooi">
<div class="" style="display: inline-block;"><input autocapitalize="none" autocomplete="off" autocorrect="off" id="react-select-10-input" spellcheck="false" tabindex="0" type="text" aria-autocomplete="list" value="" style="box-sizing: content-box; width: 2px; background: 0px center; border: 0px; font-size: inherit; opacity: 1; outline: 0px; padding: 0px; color: inherit;">
<div
style="position: absolute; top: 0px; left: 0px; visibility: hidden; height: 0px; overflow: scroll; white-space: pre; font-size: 14px; font-family: "Open Sans", sans-serif; font-weight: 400; font-style: normal; letter-spacing: normal; text-transform: none;"></div>
</div>
</div>
</div>
<div class=" css-1wy0on6"><span class=" css-1okebmr-indicatorSeparator"></span>
<div aria-hidden="true" class=" css-tlfecz-indicatorContainer"><svg height="20" width="20" viewBox="0 0 20 20" aria-hidden="true" focusable="false" class="css-19bqh2r"><path d="M4.516 7.548c0.436-0.446 1.043-0.481 1.576 0l3.908 3.747 3.908-3.747c0.533-0.481 1.141-0.446 1.574 0 0.436 0.445 0.408 1.197 0 1.615-0.406 0.418-4.695 4.502-4.695 4.502-0.217 0.223-0.502 0.335-0.787 0.335s-0.57-0.112-0.789-0.335c0 0-4.287-4.084-4.695-4.502s-0.436-1.17 0-1.615z"></path></svg></div>
</div>
</div>
</div>
</div>
</div>
From UI I can simply input text and save.
I have tried the following but didn't work.
driver.find_element_by_xpath("//div[#data-qa='work-tags']//div[#class=' css-2b097c-container']//div[#class=' css-yk16xz-control']").click()
time.sleep(1)
driver.find_element_by_xpath("//div[#data-qa='work-tags']//div[#class=' css-2b097c-container']//div[#class=' css-yk16xz-control']").send_keys(SSID_rows[SSIDs][1],Keys.TAB)
Thank you
You're trying to put text into div. Try to use input node:
driver.find_element_by_id("react-select-10-input").send_keys(SSID_rows[SSIDs][1],Keys.TAB)
This is a link to HTML I want to scrape
https://pk.khaadi.com/unstitched/r20206-red-r20206-red-pk.html
<div class="swatch-attribute-options clearfix">
<div class="swatch-option color selected" option-type="1" option-
id="61" option-label="RED" option-tooltip-thumb="" option-tooltip-
value="#ee0000" "="" style="background: #ee0000 no-repeat center;
background-size: initial;">
</div>
<div class="swatch-option color selected" option-type="1" option-
id="73" option-label="YELLOW" option-tooltip-thumb="" option-tooltip-
value="#feed00" "="" style="background: #feed00 no-repeat center;
background-size: initial;">
</div>
</div>
Color = S_Driver.find_elements_by_xpath( '//*[#id="product-options-wrapper"]/div/div/div[1]/div' )
The Xpath is of the outer div in which both color div are present
for c in Color:
n_Color.append(c.get_attribute( 'option-label' ))
print( n_Color + '\n' )
This how i tried to extract the color through 'option-label' attribute
Change the xpath with:
//div[#class='swatch-option color']
Created based on the provided screenshot, hope that there are no other matches on page based on this one. If so, change it with:
//div[#class='swatch-option color' and #option-type='1']
I am building HTML table from the list through lxml.builder and striving to make a link in one of the table cells
List is generated in a following way:
with open('some_file.html', 'r') as f:
table = etree.parse(f)
p_list = list()
rows = table.iter('div')
p_list.append([c.text for c in rows])
rows = table.xpath("body/table")[0].findall("tr")
for row in rows[2:]:
p_list.append([c.text for c in row.getchildren()])
HTML file which I parse is the same that is generated further by lxml, i.e. I set up some sort of recursion for testing purposes.
And here is how I build table
from lxml.builder import E
page = (
E.html(
E.head(
E.title("title")
),
E.body(
....
*[E.tr(
*[
E.td(E.a(E.img(src=str(col)))) if ind == 8 else
E.td(E.a(str(col), href=str(col))) if ind == 9 else
E.td(str(col)) for ind, col in enumerate(row)
]
) for row in p_list ]
When I specify link via literals all is going fine.
E.td(E.a("link", href="url_address"))
However, when I try to output list element value (which is https://blahblahblah.com) as a link
E.td(E.a(str(col), href=str(col)))
cell is empty, just nothing is showed in the cell.
If I specify link text as a literal and put str (col) into href, the link is showed normally, but instead of real href it contains the name of the generated html file.
If I output just that col value as a string
E.td(str(col))
it is showed normally, i.e. it is not empty. What is wrong with E.a and E.img elements?
Just noticed that this happens only if I build list from html file. When I build list manually, like this, all is output fine.
p_list = []
p_element = ['id']
p_element.append('value')
p_element.append('value2')
p_list.append(p_element)
Current output (pay attention to <a> and <href> tags)
<html>
<head>
<title>page</title>
</head>
<body>
<style type="text/css">
th {
background-color: DeepSkyBlue;
text-align: center;
vertical-align: bottom;
height: 150px;
padding-bottom: 3px;
padding-left: 5px;
padding-right: 5px;
}
.vertical {
text-align: center;
vertical-align: middle;
width: 20px;
margin: 0px;
padding: 0px;
padding-left: 3px;
padding-right: 3px;
padding-top: 10px;
white-space: nowrap;
-webkit-transform: rotate(-90deg);
-moz-transform: rotate(-90deg);
}</style>
<h1>title</h1>
<p>This is another paragraph, with a</p>
<table border="2">
<tr>
<th>
<div class="vertical">ID</div>
</th>
...
<th>
<div class="vertical">I blacklisted him</div>
</th>
</tr>
<tr>
<td>1020</td>
<td>ТаисияСтрахолет</td>
<td>No</td>
<td>Female</td>
<td>None</td>
<td>Санкт-Петербург</td>
<td>Росiя</td>
<td>None</td>
<td>
<a>
<img src="
"/>
</a>
</td>
<td>
<a href="
">
</a>
</td>
...
</tr>
</table>
</body>
</html>
Desired output
<html>
<head>
<title>page</title>
</head>
<body>
<style type="text/css">
th {
background-color: DeepSkyBlue;
text-align: center;
vertical-align: bottom;
height: 150px;
padding-bottom: 3px;
padding-left: 5px;
padding-right: 5px;
}
.vertical {
text-align: center;
vertical-align: middle;
width: 20px;
margin: 0px;
padding: 0px;
padding-left: 3px;
padding-right: 3px;
padding-top: 10px;
white-space: nowrap;
-webkit-transform: rotate(-90deg);
-moz-transform: rotate(-90deg);
}</style>
<h1>title</h1>
<p>This is another paragraph, with a</p>
<table border="2">
<tr>
<th>
<div class="vertical">ID</div>
</th>
...
<th>
<div class="vertical">I blacklisted him</div>
</th>
</tr>
<tr>
<td>1019</td>
<td>МихаилПавлов</td>
<td>No</td>
<td>Male</td>
<td>None</td>
<td>Санкт-Петербург</td>
<td>Росiя</td>
<td>C.-Петербург</td>
<td>
<a>
<img src="http://i.imgur.com/rejChZW.jpg"/>
</a>
</td>
<td>
link
</td>
...
</tr>
</table>
</body>
</html>
Got it myself. The problem was not in generating but in parsing HTML. Parsing function didn't fetch IMG and A tags nested in TD and these elements of the list were empty. Due to the rigorous logic of the program (fetching from file + fetching from site API) I wasn't able to detect the cause of the issue.
The correct parsing logic should be:
for row in rows[1:]:
data.append([
c.find("a").text if c.find("a") is not None else
c.find("img").attrib['src'] if c.find("img") is not None else
c.text
for c in row.getchildren()
])
Just a guess but it looks like 'str()' could be escaping it. Try E.td(E.a(col, href=col))
I'm getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2661' in position 1409: ordinal not in range(128)
I'm very green to programming still, so have mercy on me and my ignorance. But I understand the error to be that it's not able to handle unicode characters. There's that at least one unicode char, but there could be countless others that'll perk up in that feed.
I've done some looking for others who've had similar problems, but I can't can't find a solution I understand or can make work.
#import library to do http requests:
import urllib
from xml.dom.minidom import parseString, parse
f = open('games.html', 'w')
document = urllib.urlopen('https://itunes.apple.com/us/rss/topfreemacapps/limit=300/genre=12006/xml')
dom = parse(document)
image = dom.getElementsByTagName('im:image')
title = dom.getElementsByTagName('title')
price = dom.getElementsByTagName('im:price')
address = dom.getElementsByTagName('id')
imglist = []
titlist = []
pricelist = []
addlist = []
i = 0
j = 20
k = 40
f.write('''\
<!DOCTYPE html>
<html>
<head>
<style type="text/css">
<!--
A:link {text-decoration: none; color: #246DA8;}
A:visited {text-decoration: none; color: #246DA8;}
A:active {text-decoration: none; color: #40A9E3;}
A:hover {text-decoration: none; color: #40A9E3;}
.box {
vertical-align:middle;
width: 180px;
height: 120px;
border: 1px solid #99c;
padding: 5px;
margin: 0px;
margin-left: auto;
margin-right: auto;
-moz-border-radius: 5px;
border-radius: 5px;
-webkit-border-radius: 5px;
background-color:#ffffff;
font-family: Arial, Helvetica, sans-serif; color: black;
font-size: small;
font-weight: bold;
}
-->
</style>
</head>
<body>
''')
for i in range(0,len(image)):
if image[i].getAttribute('height') == '53':
imglist.append(image[i].firstChild.nodeValue)
for i in range(1,len(title)):
titlist.append(title[i].firstChild.nodeValue)
for i in range(0,len(price)):
pricelist.append(price[i].firstChild.nodeValue)
for i in range(1,len(address)):
addlist.append(address[i].firstChild.nodeValue)
for i in range(0,20):
f.write('''
<div style="width: 600px;">
<div style="float: left; width: 200px;">
<div class="box" align="center">
<div align="center">
''' + titlist[i] + '''<br>
<img src="''' + imglist[i] + '''" alt="" width="53" height="53" border="0" ><br>
<span>''' + pricelist[i] + '''</span>
</div>
</div>
</div>
<div style="float: left; width: 200px;">
<div class="box" align="center">
<div align="center">
''' + titlist[i+j] + '''<br>
<img src="''' + imglist[i+j] + '''" alt="" width="53" height="53" border="0" ><br>
<span>''' + pricelist[i+j] + '''</span>
</div>
</div>
</div>
<div style="float: left; width: 200px;">
<div class="box" align="center">
<div align="center">
''' + titlist[i+k] + '''<br>
<img src="''' + imglist[i+k] + '''" alt="" width="53" height="53" border="0" ><br>
<span>''' + pricelist[i+k] + '''</span>
</div>
</div>
</div>
<br style="clear: left;" />
</div>
<br>
''')
f.write('''</body>''')
f.close()
The basic problem is that you're concatenating the Unicode strings with ordinary byte-strings without converting them using a proper encoding; in these cases, ASCII is used by default (which, clearly, can't handle extended characters).
The line in your script that does this is too long to quote, but another practical example which displays the same problem could look like this:
parameter = u"foo \u2661"
sys.stdout.write(parameter + " bar\n")
You will need to instead encode the Unicode strings with an explicitly specified encoding, e.g. like this:
parameter = u"foo \u2661"
sys.stdout.write(parameter.encode("utf8") + " bar\n")
In your case, you can do this in your loops so as to not have to specify it on every concatenation:
for i in range(1,len(title)):
titlist.append(title[i].firstChild.nodeValue.encode("utf8"))
--
Also, while we're at it, you can improve your code by not iterating through the elements using an integer index. For instance, instead of this:
title = dom.getElementsByTagName('title')
for i in range(1,len(title)):
titlist.append(title[i].firstChild.nodeValue.encode("utf8"))
... you can do this instead:
for title in dom.getElementsByTagName('title')
titlist.append(title.firstChild.nodeValue.encode("utf8"))
To start off here's my current code in its entirety:
import urllib
from BeautifulSoup import BeautifulSoup
import sgmllib
import re
page = 'http://www.sec.gov/Archives/edgar/data/\
8177/000114036111018563/form10k.htm'
sock = urllib.urlopen(page)
raw = sock.read()
soup = BeautifulSoup(raw)
tablelist = soup.findAll('table')
class MyParser(sgmllib.SGMLParser):
def parse(self, segment):
self.feed(segment)
self.close()
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.descriptions = []
self.inside_td_element = 0
self.starting_description = 0
def start_td(self, attributes):
for name, value in attributes:
if name == "valign":
self.inside_td_element = 1
self.starting_description = 1
else:
self.inside_td_element = 1
self.starting_description = 1
def end_td(self):
self.inside_td_element = 0
def handle_data(self, data):
if self.inside_td_element:
if self.starting_description:
self.descriptions.append(data)
self.starting_description = 0
else:
self.descriptions[-1] += data
def get_descriptions(self):
return self.descriptions
counter = 0
trlist = []
dtablelist = []
while counter < len(tablelist):
trsegment = tablelist[counter].findAll('tr')
trlist.append(trsegment)
strsegment = str(trsegment)
myparser = MyParser()
myparser.parse(strsegment)
sub = myparser.get_descriptions()
dtablelist.append(sub)
counter = counter + 1
ex = []
dtablelist = [s for s in dtablelist if s != ex]
So what I want to accomplish is take all the tables from an html document, then reprint them onto an Excel spreadsheet. So when I create trlist the output looks like this:
print trlist[1]
[<tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT- SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div>
</td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
</tr>, <tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font> </font></div>
</td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold">< <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div>
</div>
</td>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
</tr>,...
As you can see each item in trlist is each individual row ( . . . ) of the table which is what I want. But when I run each trlist item through my sgmllib parser to retrieve the contents between the tags I get this output:
print dtablelist[1]
['\nTitle of each class\n', 'Name of exchange', '\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n', '\n$1.00 per share\n']
As you can see, the output is each of the contents as their own individual string, instead of a list of the contents of each table row (). So essentially I want the output:
[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']]
Is it because I have to turn trlist into a string before I parse it with MyParser? Does anyone know any way around this, allowing me to parse lists within lists (aka Inception shit)?
Using lxml.html:
>>> import lxml.html
>>> data = ["<tr><td>test</td><td>help</td></tr>", "<tr><td>data1</td><td>data2</td></tr>"]
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data]
[['test', 'help'], ['data1', 'data2']]
And here is some more complete code. It stores the text in a list containing a list of tables, and each table has a list of tr's, and each tr has a list of all the text.
import urllib
import lxml.html
data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read()
tree = lxml.html.fromstring(data)
tables = []
for tbl in tree.iterfind('.//table'):
tele = []
tables.append(tele)
for tr in tbl.iterfind('.//tr'):
text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0]
tele.append(text)
print tables
Hope this helps, cheers!
If somebody is searching for a solution of the same problem but is using python 3:
You don't have to use an external library for parsing an HTML table even if you are using python 3. There the SGMLParser class was replaced by HTMLParser from html.parser. I've written code for a simple derived HTMLParser class. It is here in a github repo. It simply does remember the current scope of a <td>, <tr> or <table> tag. The advantages over using etree are that it runs correctly on non-xml-compliant html and that it doesn't use external libraries.
You can use that class (here named HTMLTableParser) the following way:
import urllib.request
from html_table_parser import HTMLTableParser
target = 'http://www.twitter.com'
# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')
# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)
The output of this is a list of 2D-lists representing tables. It looks maybe like this:
[[[' ', ' Anmelden ']],
[['Land', 'Code', 'Für Kunden von'],
['Vereinigte Staaten', '40404', '(beliebig)'],
['Kanada', '21212', '(beliebig)'],
...
['3424486444', 'Vodafone'],
[' Zeige SMS-Kurzwahlen für andere Länder ']]]