Im working on a crypto trading system, I don't have an access to the exchange API at the moment so I decided to try the solution using Selenium automation.
What I cannot figure out is how to move vue slider in exchange (to set a buying amount to 100%).
This is my code:
from selenium.webdriver import ActionChains
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
import time
from selenium.webdriver.common.keys import Keys
import io
import subprocess
#proc = subprocess.Popen("./ChannelMessages.py", stdout=subprocess.PIPE)
chrome_options = Options()
#chrome_options.add_argument('--no-sandbox')
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
driver = webdriver.Chrome(options=chrome_options)
executor_url = driver.command_executor._url
session_id = driver.session_id
print(session_id)
print(executor_url)
driver.get("https://www.hotbit.io/exchange?symbol=XRP_USDT")
time.sleep(10)
en = driver.find_element('xpath', '//*[#id="app"]/div[1]/div[2]/div/div/div/div[1]/div[1]/div[4]/ul/li/form[1]/section[1]/div[4]/div/div/div[1]')
move = ActionChains(driver)
move.click_and_hold(en).move_by_offset(50, 0).release().perform()
This is a slider code:
<div data-v-33e6e6c8="" class="percent-box"><div class="vue-slider vue-slider-ltr v-left-slider" style="padding: 7px 0px; width: auto; height: 4px;"><div class="vue-slider-rail"><div class="vue-slider-process" style="height: 100%; top: 0px; left: 0%; width: 0%; transition-property: width, left; transition-duration: 0.5s;"></div><div class="vue-slider-marks"><div class="vue-slider-mark vue-slider-mark-active" style="height: 100%; width: 4px; left: 0%;"><div class="vue-slider-mark-step vue-slider-mark-step-active"></div></div><div class="vue-slider-mark" style="height: 100%; width: 4px; left: 25%;"><div class="vue-slider-mark-step"></div></div><div class="vue-slider-mark" style="height: 100%; width: 4px; left: 50%;"><div class="vue-slider-mark-step"></div></div><div class="vue-slider-mark" style="height: 100%; width: 4px; left: 75%;"><div class="vue-slider-mark-step"></div></div><div class="vue-slider-mark" style="height: 100%; width: 4px; left: 100%;"><div class="vue-slider-mark-step"></div></div></div><div aria-valuetext="0%" class="vue-slider-dot vue-slider-dot-hover" role="slider" aria-valuenow="0" aria-valuemin="0" aria-valuemax="100" aria-orientation="horizontal" tabindex="0" style="width: 14px; height: 14px; transform: translate(-50%, -50%); top: 50%; left: 0%; transition: left 0.5s ease 0s;"><div class="vue-slider-dot-handle"></div><div class="vue-slider-dot-tooltip vue-slider-dot-tooltip-top"><div class="vue-slider-dot-tooltip-inner vue-slider-dot-tooltip-inner-top"><span class="vue-slider-dot-tooltip-text">0%</span></div></div></div></div></div></div>
Maybe I'm looking for the wrong element in "driver.find_element", tried different elements thought, not sure.
P.S. I tried to locate elements using "name" and "xpath", tried basically all levels of classes to use, but still I couldn't move it, or even select it.
Any help will be much appreciated!
P.P.S: Resolved!
Needed to add a line:
from selenium.webdriver.common.by import By
and to modify "driver.find_element" line:
en = driver.find_element(By.CLASS_NAME, 'v-left-slider')
Needed to add a line:
from selenium.webdriver.common.by import By
and to modify "driver.find_element" line:
en = driver.find_element(By.CLASS_NAME, 'v-left-slider')
Related
I'd like to extract and use the red alphabet of the code below through 'Selenium', so please give me some advice on how to do it
The alphabet changes randomly on every try
<td>
<input type="text" name="WKey" id="As_wkey" value="" maxlength="10" class="inputType" style="width: 300px;" title="password" />
<span id="myspam" style="padding: 2px;">
<span style="font-size: 12pt; font-weight: bold; color: red;">H</span>123
<span style="font-size: 12pt; font-weight: bold; color: red;">R</span>
<span style="font-size: 12pt; font-weight: bold; color: red;">8</span>6789
</span>
(type red word.)
</td>
here is my code
red_characters_element = driver.find_element(By.ID, 'myspam')
red_characters_elements = red_characters_element.find_elements(by = By.CSS_SELECTOR, value="span[style='font-size: 12pt; font-weight: bold; color: red;']")
print(red_characters_elements)
result []
Given all the Red colored alphabets are inside of the <span> tag. You can retrieve it using tag.
red_characters_element = driver.find_element(By.ID, 'myspam')
red_characters_elements = red_characters_element.find_elements(By.TAG_NAME, 'span')
for red_character in red_characters_elements:
print(red_character.text)
Results :
H
R
8
If you need only red letters, you can try using the java script inside the selenium:
driver.execute_script('return document.querySelectorAll("[style*=red]")')
You get an array of objects where the style has the attribute "red", with the for loop you can get the values or anything else
I have an HTML tag like the following:
print(tag)
<td style="background-color: #e5e5e5;">
<p style="
margin: 0;
font-size: 12px;
line-height: 16px;
font-family: Arial, sans-serif;
text-align: center;
">10</p>
</td>
I can update the value from 10 to 15 in the tag with BeautifulSoup:
tag.p.contents[0].replaceWith(str(15))
However, I haven't figured out a way to update the values in the style tags; because they seem to be a part of their parent, or 'base' tags.
For example, how would I update the tag to the following?
print(tag) -->
<td style="background-color: #762157;">
<p style="
margin: 0;
font-size: 12px;
line-height: 17px;
font-family: Arial, sans-serif;
text-align: center;
">10</p>
</td>
I change the background-color to #762157 and line-height to 17px;
In the context of the html you posted, style is not a tag, but an attribute of the p tag. Here is one way to modify that p attribute (and you can apply the same for td style attribute):
from bs4 import BeautifulSoup as bs
html = '''
<td style="background-color: #e5e5e5;">
<p style="
margin: 0;
font-size: 12px;
line-height: 16px;
font-family: Arial, sans-serif;
text-align: center;
">10</p>
</td>
'''
soup = bs(html, 'html.parser')
print('OLD SOUP')
print(soup.prettify())
print('______________')
p_style_attribute = soup.select_one('p').get('style')
new_p_style_attr = '''
margin: 3;
font-size: 17px;
line-height: 26px;
font-family: Arial, sans-serif;
text-align: center;
'''
soup.select_one('p')['style'] = new_p_style_attr
print('NEW SOUP')
print(soup.prettify())
This will print out in terminal:
OLD SOUP
<td style="background-color: #e5e5e5;">
<p style="
margin: 0;
font-size: 12px;
line-height: 16px;
font-family: Arial, sans-serif;
text-align: center;
">
10
</p>
</td>
______________
NEW SOUP
<td style="background-color: #e5e5e5;">
<p style="
margin: 3;
font-size: 17px;
line-height: 26px;
font-family: Arial, sans-serif;
text-align: center;
">
10
</p>
</td>
Here is the documentation for BeautifulSoup:
https://beautiful-soup-4.readthedocs.io/en/latest/index.html
A regex approach. Collect the new information about the style into a dictionary, (tag_name, attr_value)-pairs and pass it to update_style function for an in-place modification of the soup.
The substitution is invariant wrt white spaces and ;. A **subs could also be possible.
from bs4 import BeautifulSoup as bs
import re
def update_style(soup, subs):
for tag_name, attr in subs.items():
tag = soup.find(tag_name)
k, v = attr.split(':')
tag['style'] = re.sub(rf'{k}:(.+);', f'{k}: {v.strip(" ;")};', tag['style'])
html = '''
<td style="background-color: #e5e5e5;">
<p style="
margin: 0;
font-size: 12px;
line-height: 16px;
font-family: Arial, sans-serif;
text-align: center;
">10</p>
</td>
'''
soup = bs(html, 'lxml')
subs = {'td': 'background-color: #762157;', 'p': 'line-height: 17px;'}
update_style(soup, subs)
print(soup.prettify())
I'm trying to upload a image to https://www.alibaba.com/ through Selenium.
So i find the element which allows me to do that:
driver = webdriver.Chrome(r'C:\Users\migue\Desktop\WorkerBot\Drivers\chromedriver')
driver.maximize_window()
driver.get('https://www.alibaba.com/');
time.sleep(5)
#Open menu to upload image
wait = WebDriverWait(driver, 5)
x = True
while x:
x = False
try:
search_camara = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'i.ui-searchbar-imgsearch-icon')))
search_camara.click()
except:
x = True
driver.refresh()
time.sleep(5)
searcher1 = driver.find_element_by_xpath('//*[#id="J_SC_header"]/header/div[2]/div[2]/div/div/form/div[2]/div[3]/div[1]/div/div')
print(searcher1.get_attribute('innerHTML'))
When i print out searcher1 i get:
<div class="upload-btn-wrapper"><div class="upload-btn" style="z-index: 1;">Upload Image</div><div id="html5_1cu85jlnu116m14sle1omtch5s3_container" class="moxie-shim moxie-shim-html5" style="position: absolute; top: 14px; left: 183px; width: 109px; height: 28px; overflow: hidden; z-index: 0;"><input id="html5_1cu85jlnu116m14sle1omtch5s3" type="file" style="font-size: 999px; opacity: 0; position: absolute; top: 0px; left: 0px; width: 100%; height: 100%;" multiple="" accept="image/jpeg,image/png,image/bmp"></div>
Max 2MB per Image
That's the element i need to upload the image, but when i try to do the following:
Option 1
searcher1.find_element_by_class_name('.moxie-shim moxie-shim-html5')
Option 2
searcher1.find_element_by_class_name('upload-btn')
Option 3
searcher1.find_element_by_xpath('/div')
I get the following (for option 3, for example):
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/div"}
What's the problem? I'm stuck :(
For relative xpath, you need to put a . in front of it. Try:
searcher1.find_element_by_xpath('./div')
I am automating our Web application using Python with Selenium Webdriver.
I log into the application and I want to click the Administration button.
When i run my code it cannot find the Administration button by my Xpath. I have tried a few different ways.
If i enter //div[7]/div/div in selenium IDE and click Find it highlights the Administration button. I do not know why it won't find it when i run the code.
I would prefer to use CSS as that is faster than Xpath.
I need some help please.
I get the following error:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: {"method":"xpath","selector":"html/body/div[2]/div[2]/div/div[2]/div/div[2]/div/div[7]/div/div"}
I inspect the HTML element. The full HTML is as follows:
<html style="overflow: hidden;">
<head>
<body style="margin: 0px;">
<html style="overflow: hidden;">
<head>
<body style="margin: 0px;">
<iframe id="__gwt_historyFrame" style="position: absolute; width: 0; height: 0; border: 0;" tabindex="-1" src="javascript:''">
<html>
</iframe>
<noscript> <div style="width: 22em; position: absolute; left: 50%; margin-left: -11em; color: red; background-color: white; border: 1px solid red; padding: 4px; font-family: sans-serif;"> Your web browser must have JavaScript enabled in order for this application to display correctly.</div> </noscript>
<script src="spinner.js" type="text/javascript">
<script type="text/javascript">
<script src="ClearCore/ClearCore.nocache.js" type="text/javascript">
<script defer="defer">
<iframe id="ClearCore" src="javascript:''" style="position: absolute; width: 0px; height: 0px; border: medium none;" tabindex="-1">
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<script>
<script type="text/javascript">
<script type="text/javascript">
</head>
<body>
</html>
</iframe>
<div style="position: absolute; z-index: -32767; top: -20cm; width: 10cm; height: 10cm; visibility: hidden;" aria-hidden="true"> </div>
<div style="position: absolute; left: 0px; top: 0px; right: 0px; bottom: 0px;">
<div style="position: absolute; z-index: -32767; top: -20ex; width: 10em; height: 10ex; visibility: hidden;" aria-hidden="true"> </div>
<div style="position: absolute; overflow: hidden; left: 0px; top: 0px; right: 0px; bottom: 0px;">
<div style="position: absolute; left: 0px; top: 0px; right: 0px; bottom: 0px;">
<div style="position: absolute; z-index: -32767; top: -20ex; width: 10em; height: 10ex; visibility: hidden;" aria-hidden="true"> </div>
<div style="position: absolute; overflow: hidden; left: 1px; top: 1px; right: 1px; bottom: 1px;">
<div class="gwt-TabLayoutPanel" style="position: absolute; left: 0px; top: 0px; right: 0px; bottom: 0px;">
<div style="position: absolute; z-index: -32767; top: -20ex; width: 10em; height: 10ex; visibility: hidden;" aria-hidden="true"> </div>
<div style="position: absolute; overflow: hidden; left: 0px; top: 0px; right: 0px; height: 30px;">
<div class="gwt-TabLayoutPanelTabs" style="position: absolute; left: 0px; right: 0px; bottom: 0px; width: 16384px;">
<div class="gwt-TabLayoutPanelTab GEGQEWXCK gwt-TabLayoutPanelTab-selected" style="background-color: rgb(254, 255, 238);">
<div class="gwt-TabLayoutPanelTab GEGQEWXCK" style="background-color: rgb(254, 255, 238);">
<div class="gwt-TabLayoutPanelTab GEGQEWXCK" style="background-color: rgb(254, 255, 238);">
<div class="gwt-TabLayoutPanelTab GEGQEWXCK" style="background-color: rgb(254, 255, 238);">
<div class="gwt-TabLayoutPanelTab GEGQEWXCK" style="background-color: rgb(254, 255, 238);">
<div class="gwt-TabLayoutPanelTab GEGQEWXCK" style="background-color: rgb(254, 255, 238);">
<div class="gwt-TabLayoutPanelTab GEGQEWXCK" style="background-color: rgb(254, 255, 238);">
<div class="gwt-TabLayoutPanelTabInner">
<div class="gwt-HTML">Administration</div>
</div>
</div>
</div>
</div>
<div style="position: absolute; overflow: hidden; left: 0px; top: 30px; right: 0px; bottom: 0px;">
</div>
</div>
<div style="position: absolute; overflow: hidden; top: 1px; right: 1px; width: 30px; height: 25px;">
<div style="position: absolute; overflow: hidden; left: 0px; top: -25px; right: 0px; height: 25px;">
</div>
</div>
</div>
<div style="display: none;" aria-hidden="true"></div>
</body>
</html>
My code is as follows:
element.py
from selenium.webdriver.support.ui import WebDriverWait
class BasePageElement(object):
def __set__(self, obj, value):
driver = obj.driver
WebDriverWait(driver, 100).until(
lambda driver: driver.find_element_by_name(self.locator))
driver.find_element_by_name(self.locator).send_keys(value)
def __get__(self, obj, owner):
driver = obj.driver
WebDriverWait(driver, 100).until(
lambda driver: driver.find_element_by_name(self.locator))
element = driver.find_element_by_name(self.locator)
return element.get_attribute("value")
locators.py
from selenium.webdriver.common.by import By
class MainPageLocators(object):
Submit_button = (By.ID, 'submit')
usernameTxtBox = (By.ID, 'unid')
passwordTxtBox = (By.ID, 'pwid')
submitButton = (By.ID, 'button')
AdministrationButton = (By.CSS_SELECTOR, 'div.gwt-HTML.firepath-matching-node')
AdministrationButtonXpath = (By.XPATH, '//html/body/div[2]/div[2]/div/div[2]/div/div[2]/div/div[7]/div/div')
AdministrationButtonCSS = (By.CSS_SELECTOR, '/body/div[2]/div[2]/div/div[2]/div/div[2]/div/div[7]/div/div')
AdministrationButtonXpath2 = (By.XPATH, 'html/body/div[2]/div[2]/div/div[2]/div/div[2]/div/div[7]/div/div/text()')
AdministrationButtonXpath3 = (By.XPATH, '//div[7]/div/div')
contentFrame = (By.ID, 'ClearCore')
Page.py
from element import BasePageElement
from locators import MainPageLocators
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
class SearchTextElement(BasePageElement):
class BasePage(object):
def __init__(self, driver):
self.driver = driver
class LoginPage(BasePage):
search_text_element = SearchTextElement()
def userLogin_valid(self):
userName_textbox = self.driver.find_element(*MainPageLocators.usernameTxtBox)
userName_textbox.clear()
userName_textbox.send_keys("riaz.ladhani")
password_textbox = self.driver.find_element(*MainPageLocators.passwordTxtBox)
password_textbox.clear()
password_textbox.send_keys("test123")
submitButton = self.driver.find_element(*MainPageLocators.submitButton)
submitButton.click()
#mydriver.find_element_by_xpath(xpaths['usernameTxtBox']).clear()
def clickAdministration_button(self):
#administrationButton = self.driver.find_element(*MainPageLocators.AdministrationButton)
content_frame = self.driver.find_element(*MainPageLocators.contentFrame)
self.driver.switch_to.frame(content_frame)
#self.driver.switch_to.frame(*MainPageLocators.contentFrame)
#self.driver.Switch_to().Frame(*MainPageLocators.contentFrame)
#administrationButtonCSS = self.driver.find_element(*MainPageLocators.AdministrationButtonCSS)
#administrationButtonXpath= self.driver.find_element(*MainPageLocators.AdministrationButtonXpath)
#administrationButtonXpath= self.driver.find_element(*MainPageLocators.AdministrationButton_CSS_regex)
#administrationButtonCSS2 = self.driver.find_element(*MainPageLocators.AdministrationButtonCSS2)
adminButton = self.driver.find_element(*MainPageLocators.AdministrationButtonXpath3)
adminButton.click()
LoginPage_TestCase.py
import unittest
from selenium import webdriver
import page
class LoginPage_TestCase(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.get("http://my-pc.company.local:8080/clearcore")
def test_login_valid_user(self):
login_page = page.LoginPage(self.driver)
login_page.userLogin_valid()
login_page.ClickAdministration_button()
def tearDown(self):
self.driver.close()
if __name__ == "__main__":
unittest.main()
As the “Administration button” is located under the frame whose id is “ClearCore” and it is not in the webpage. That is the reason why the element is unable to locate while executing the code.
So before clicking that button you need to switch to that frame either by using
1. driver.switch_to_window("windowName")
2. driver.switch_to_frame("frameName")
Once we are done with working on frames, we will have to come back to the parent frame which can be done using:
driver.switch_to_default_content()
I Have finally managed to solve my issue. The dev said I had to wait for the page to have fully completed loading. The page was still loading the JavaScript functions when all the elements were displayed on the screen.
I first tried time.sleep(30) then click the button. It worked. Waiting for 30 secs every time is not efficient. I then used WebDriverWait and this is more efficient.
Here is the code i used:
WebDriverWait(mydriver, 10).until(lambda d: mydriver.find_element_by_xpath("//div[. = 'Administration']").click())
You have to use
driver.switch_to_frame("__gwt_historyFrame");
before you Administration button click code. This code will take WebDriver into frame, then only WebDriver able to find button inside the frame,
if you want to come out of the frame to navigate outside,
use
driver.switch_to_default_content()
*"__gwt_historyFrame" this is your frame name
To start off here's my current code in its entirety:
import urllib
from BeautifulSoup import BeautifulSoup
import sgmllib
import re
page = 'http://www.sec.gov/Archives/edgar/data/\
8177/000114036111018563/form10k.htm'
sock = urllib.urlopen(page)
raw = sock.read()
soup = BeautifulSoup(raw)
tablelist = soup.findAll('table')
class MyParser(sgmllib.SGMLParser):
def parse(self, segment):
self.feed(segment)
self.close()
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.descriptions = []
self.inside_td_element = 0
self.starting_description = 0
def start_td(self, attributes):
for name, value in attributes:
if name == "valign":
self.inside_td_element = 1
self.starting_description = 1
else:
self.inside_td_element = 1
self.starting_description = 1
def end_td(self):
self.inside_td_element = 0
def handle_data(self, data):
if self.inside_td_element:
if self.starting_description:
self.descriptions.append(data)
self.starting_description = 0
else:
self.descriptions[-1] += data
def get_descriptions(self):
return self.descriptions
counter = 0
trlist = []
dtablelist = []
while counter < len(tablelist):
trsegment = tablelist[counter].findAll('tr')
trlist.append(trsegment)
strsegment = str(trsegment)
myparser = MyParser()
myparser.parse(strsegment)
sub = myparser.get_descriptions()
dtablelist.append(sub)
counter = counter + 1
ex = []
dtablelist = [s for s in dtablelist if s != ex]
So what I want to accomplish is take all the tables from an html document, then reprint them onto an Excel spreadsheet. So when I create trlist the output looks like this:
print trlist[1]
[<tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT- SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div>
</td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
</tr>, <tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font> </font></div>
</td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold">< <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div>
</div>
</td>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
</tr>,...
As you can see each item in trlist is each individual row ( . . . ) of the table which is what I want. But when I run each trlist item through my sgmllib parser to retrieve the contents between the tags I get this output:
print dtablelist[1]
['\nTitle of each class\n', 'Name of exchange', '\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n', '\n$1.00 per share\n']
As you can see, the output is each of the contents as their own individual string, instead of a list of the contents of each table row (). So essentially I want the output:
[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']]
Is it because I have to turn trlist into a string before I parse it with MyParser? Does anyone know any way around this, allowing me to parse lists within lists (aka Inception shit)?
Using lxml.html:
>>> import lxml.html
>>> data = ["<tr><td>test</td><td>help</td></tr>", "<tr><td>data1</td><td>data2</td></tr>"]
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data]
[['test', 'help'], ['data1', 'data2']]
And here is some more complete code. It stores the text in a list containing a list of tables, and each table has a list of tr's, and each tr has a list of all the text.
import urllib
import lxml.html
data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read()
tree = lxml.html.fromstring(data)
tables = []
for tbl in tree.iterfind('.//table'):
tele = []
tables.append(tele)
for tr in tbl.iterfind('.//tr'):
text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0]
tele.append(text)
print tables
Hope this helps, cheers!
If somebody is searching for a solution of the same problem but is using python 3:
You don't have to use an external library for parsing an HTML table even if you are using python 3. There the SGMLParser class was replaced by HTMLParser from html.parser. I've written code for a simple derived HTMLParser class. It is here in a github repo. It simply does remember the current scope of a <td>, <tr> or <table> tag. The advantages over using etree are that it runs correctly on non-xml-compliant html and that it doesn't use external libraries.
You can use that class (here named HTMLTableParser) the following way:
import urllib.request
from html_table_parser import HTMLTableParser
target = 'http://www.twitter.com'
# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')
# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)
The output of this is a list of 2D-lists representing tables. It looks maybe like this:
[[[' ', ' Anmelden ']],
[['Land', 'Code', 'Für Kunden von'],
['Vereinigte Staaten', '40404', '(beliebig)'],
['Kanada', '21212', '(beliebig)'],
...
['3424486444', 'Vodafone'],
[' Zeige SMS-Kurzwahlen für andere Länder ']]]