Specific HTML block fetch and parse using Regular Expression (Python) - python

I am trying to Parse a html file using Python without using any external module. The reason is I am triggering a jenkins job and running into some import issues with lxml and BeautifulSoup (tried resolving it and I think
somewhere I am doing over engineering to get my stuff done)
Input:
<tr class="test">
<td class="test">
BA
</td>
<td class="duration">
0.000s
</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="passRate">
N/A
</td>
</tr>
<tr class="test">
<td class="test">
Aa
</td>
<td class="duration">
0.000s
</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="passRate">
N/A
</td>
</tr>
<tr class="test">
<td class="test">
GG
</td>
<td class="duration">
0.390s
</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="passRate">
N/A
</td>
</tr>
<tr class="suite">
<td colspan="2" class="totalLabel">Total</td>
<td class="zero number">271</td>
<td class="zero number">0</td>
<td class="fail number">3</td>
<td class="zero number">4</td>
<td class="passRate suite">
98%
</td>
</tr>
Output:
I want to take that specific block of tr tag with the class "suite" (check at the end) and then pull the values for all the td tags and assign too.
~~~~~~~~~~~~~~~~~~~~~~~~~~
Eg. The output will be:
271
0
3
4
98%
Finally I want to assign these values to the variables...so my final output will be:
A = 271
B = 0
C = 3
D = 4
D = 98%
(all variables in new lines)
~~~~~~~~~~~~~~~~~~~~~~~~~~
Here is what I tried with lxml:
tree = parse(HTML_FILE)
tds = tree.xpath("//tr[#class='suite']//td/text()")
val = map(str.strip, tds)
This works out locally but I really want to do something without any external dependencies. Shall I use strip() or open a file using os.path.isFile(). I may not be correct but advise/walk me through what would be solution to do this.
**The most difficult part that I can think of is "in the last tr tag block of my input, couple of the sub td tags have class = zero number" and so how do you solve it.
**the approach I could think of is take out that block and then remove all the tags except the content and then assign line by line. However, I am not good at regular expressions.
This is not the duplicate of Parse HTML file using Python without external module ...this is a different input and different output expected question.

Related

How to select <tr> tags inside of a div with aspecific css attrbute via beautifulsoup?

I want to scrape several columns of text contained in td tags with a common css attribute inside of tr with a common css attribute inside of a table with a specific class inside of a div
For example, this is exactly how the website is structured.
<div class="stats-table>
<table class=stats_table>
<tbody>
<tr data-row="0">
<td data-stat="games">38</td>
<td data-stat="wins">29</td>
<td data-stat="draws">6</td>
<td data-stat="losses">3</td>
<td data-stat="points">93</td>
</tr>
<tr data-row="1">
<td data-stat="games">38</td>
<td data-stat="wins">28</td>
<td data-stat="draws">8</td>
<td data-stat="losses">2</td>
<td data-stat="points">92</td>
</tr>
.
.
.
<tr data-row="19">
<td data-stat="games">38</td>
<td data-stat="wins">5</td>
<td data-stat="draws">7</td>
<td data-stat="losses">26</td>
<td data-stat="points">22</td>
</tr>
</tbody>
</table>
</div>
I want to get the texts enclosed in the td tags
I have tried solving this problem by writing the code below
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.select(".stats_table")
all_data = [l.get_text(strip=True) for l in soup.select(".stats_table:has(> [data-row])")]
print(all_data)
But when I try to execute this code, I get an empty list. I need your help on this matter, thanks.
Why your solution did not work?
> is used when the element that you are selecting has the parent that you specified on the left side. But since in your case the parent of the td is tbody and not element with class .stats_table. So as stated if you specify the parent class in the selector it would work as expected. tr tag below is not necessary in the selector.
Also has tag means that selector matches element with class .stats_table that directly contains an element that has some element with data-row attribute in it.
soup.select(".stats_table tbody:has(> tr[data-row])")
But this won't give you the expected output. To get the expected output follow this below.
Solution
I see that you specifically want all the element "that has an attribute [data-row] inside the table class stats-table".
There are 2 ways in which you can do this.
Using regex
import re
html = '''
<div class="stats-table">
<table class="stats_table">
<tbody>
<tr data-row="0">
<td data-stat="games">38</td>
<td data-stat="wins">29</td>
<td data-stat="draws">6</td>
<td data-stat="losses">3</td>
<td data-stat="points">93</td>
</tr>
<tr data-row="1">
<td data-stat="games">38</td>
<td data-stat="wins">28</td>
<td data-stat="draws">8</td>
<td data-stat="losses">2</td>
<td data-stat="points">92</td>
</tr>
<tr data-row="19">
<td data-stat="games">38</td>
<td data-stat="wins">5</td>
<td data-stat="draws">7</td>
<td data-stat="losses">26</td>
<td data-stat="points">22</td>
</tr>
</tbody>
</table>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
datastats = soup.find_all("td", {"data-stat" : re.compile(r".*")})
for stat in datastats:
print(stat.text)
which gives us the expected output
[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]
Using CSS Selector
The below selector means that select all the td tags that has an attribute data-stat inside the table that has class stats_table. You may or may not use td beside [data-stat] as ("table.stats_table td[data-stat]")
datastats = soup.select("table.stats_table [data-stat]")
for stat in datastats:
print(stat.text)
which gives us the same output
[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]
You can find more information on CSS_SELECTOR here

How to get the text of an element with id with Selenium

Consider:
<tr id="pair_12">
<td class="left first">
<span class="ceFlags USD"> </span>
USD
</td>
<td class="" id="last_12_12">1</td>
<td class="pid-2124-last" id="last_12_17">0,8979</td>
<td class="pid-2126-last" id="last_12_3">0,7695</td>
<td class="pid-3-last" id="last_12_2">109,94</td>
<td class="pid-4-last" id="last_12_4">0,9708</td>
<td class="pid-7-last" id="last_12_15">1,3060</td>
<td class="pid-2091-last greenBg" id="last_12_1">1,4481</td>
<td class="pid-18-last greenBg" id="last_12_9">5,8637</td>
</tr>
I want to access, for example, the "5,8637" value and it also refreshes for every other second or so. Here is the website maybe it helps you to help me better link.
driver = Chrome(webdriver)
driver.get("https://tr.investing.com/currencies/exchange-rates-table")
eur_usd = driver.find_element_by_id("last_17_12").text
worked for me!
Use:
By id = By.id("ANY_ID");
Use the getText(id); function under Selenium WebDriver.

How do I loop over this outerHTML code to get out certain data? (I don't know how to webscrape it so I want to try this)

I am trying to get a list that matches India's districts to its district codes as they were during the 2011 population census. Below I will post a small subset of the outerHTML I copied from a government website. I am trying to loop over it and extract a string and an int from each little html box and store these ideally in a pandas dataframe on the same row. The HTML blocks look like this, I represent 2, there are around 700 in my txt file:
<tr>
<td width="5%">1</td>
<td>603</td>
<td align="left">**NICOBARS**</td>
<td align="left">NICOBARS </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NIC</td>
<td align="left">02</td>
<td align="left">**638**</td>
<td align="left">
Not Covered
</td>
<td width="5%" align="center"><i class="fa fa-eye" aria-hidden="true"></i>
</td>
<td width="5%" align="center"><i class="fa fa-history" aria-hidden="true"></i>
</td>
<td width="5%" align="center">
</td>
<td width="3%" align="center">
<!-- Merging issue revert beck 05/10/2017 -->
<i class="fa fa-map-marker" aria-hidden="true"></i>
</td>
</tr>
<tr>
<td width="5%">2</td>
<td>632</td>
<td align="left">**NORTH AND MIDDLE ANDAMAN**</td>
<td align="left">NORTH AND MIDDLE ANDAMAN </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NMA</td>
<td align="left"></td>
<td align="left">**639**</td>
<td align="left">
Not Covered
I have put ** around ** the values that I want to get from the text file. I was wonder how I could loop through this text to extract this data. I thought about start counting each time after I encounter and than extract the data of the 1st and 6st but I don't know how to code this. Hope anyone is willing to help out. Or maybe anyone who already has this list, would be great!
If you're able to get the text of the entire html table, you can use df = pd.read_html(html_text_string). 50% of the time, it works everytime!
pd.read_html <-- docs

How to locate web element (Xpath: multiple conditions)

I am trying to locate a specific element with XPath (for a script that I work on in Python with Selenium module). I've looked it up on the internet but I can't find solution to my problem, which is:
there is a table that consists of many 'tr'. Each 'tr' consist of couple 'td' that share the same class (namely:"Zelle"). I'd like to find a 'tr' that meets two conditions. First it has to contain text "auto" in one td, then it has to contain text "Abge" in ANOTHER td- finally it has to contain "a" element of class "Tabelle".
I wrote something like this:
("//tr//td[#class='Zelle'][contains(.,'auto')]//following::td[#class='Zelle'][contains(.,'Abge')]//following::a[#class='Tabelle']")
But when I try it in developers console I get all tr's with td that contain "auto" (even if the second td doesn't contain "Abge"). How to write statement that would return ONLY tr's with BOTH "auto" and "Abge"?
Sample HTML:
<tr>
<td class="Zelle">auto</td>
<td class="Zelle">Abge</td>
<td class="Zelle">Some characters</td>
<td class="Zelle">
<a class="Tabelle" href="blablahblah.aspx?xxxx=Number"></a></td>
</tr>
<tr>
<td class="Zelle">auto</td>
<td class="Zelle">Abge</td>
<td class="Zelle">Some characters</td>
<td class="Zelle">
<a class="Tabelle" href="blablahblah.aspx?xxxx=Number"></a></td>
</tr>
<tr>
<td class="Zelle">auto</td>
<td class="Zelle">Some text(but no "Abge")</td>
<td class="Zelle">Some characters</td>
<td class="Zelle">
<a class="Tabelle" href="blablahblah.aspx?xxxx=Number"></a></td>
</tr>
<tr>
<td class="Zelle">some text (but not "auto")</td>
<td class="Zelle">Abge</td>
<td class="Zelle">Some characters</td>
<td class="Zelle">
<a class="Tabelle" href="blablahblah.aspx?xxxx=Number"></a></td>
</tr>
Try to use below XPath
//tr[td[contains(., "auto")] and td[contains(., "Abge")] and .//a[#class="Tabelle"]]

how to remove table row based on contents of one of the cells in the row using python?

I have html document with table like:
<tr>
<td width="3%"><input type="checkbox", name="chk"></td>
<td width="10%">101</td>
<td width="4%">Fix</td>
<td width="5%">2.00</td>
<td width="6%">09:28:03</td>
<td width="5%">5</td>
<td width="9%">6026866.421</td>
<td width="9%">6525118.804</td>
<td width="5%">149.124</td>
<td width="8%">3533692.676</td>
<td width="8%">1174580.462</td>
<td width="8%">5161083.095</td>
<td width="5%">0.009</td>
<td width="5%">0.016</td>
<td width="5%">2.14</td>
<td width="7%">07/09</td></tr>
<br>
<tr>
<td width="3%"><input type="checkbox", name="chk"></td>
<td width="10%">101</td>
<td width="4%">Fix</td>
<td width="5%">0.00</td>
<td width="6%">09:28:03</td>
<td width="5%">5</td>
<td width="9%">6026866.421</td>
<td width="9%">6525118.804</td>
<td width="5%">149.124</td>
<td width="8%">3533692.676</td>
<td width="8%">1174580.462</td>
<td width="8%">5161083.095</td>
<td width="5%">0.009</td>
<td width="5%">0.016</td>
<td width="5%">2.14</td>
<td width="7%">07/09</td></tr>
and so on....
I need to remove rows where the fourth cell content is '0.00' and leave
only these with '2.00' or maybe would be easier to remove only even rows.
what is the most simple way to achieve it using python?
Using Beautiful Soup (this is just a start, there's much to improve, like how to check for zero and you also have to make up your mind if you want to check the third or the fourth cell):
soup = BeautifulSoup(open('yourhtml.html').read())
for tr in soup('tr'):
if tr('td')[3].text == '0.00':
tr.extract()
You might want to look at Beautiful Soup, a python parser for HTML and XML.

Categories

Resources