How to Extract nth line String after Pattern match in Python?

How to Extract nth line String after Pattern match in Python? - python

I have a text file with below content, all i need to extract 29565618> after Specific String Match(highlighted/bold below)
<div title="Available on both MOS and OTN">OracleJDK8 Update 212 <strong>(public)</strong></div>
Note: The href tag is above on the 2nd line after this patter match in the input text file.
Input Text File:
<tr>
<td class="km">29565618</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206839</td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206838</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206859</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
Expected Output:
29565618
My Code:
with open('file.txt') as f:
my_list = list(f)
try:
if my_list.index('JDK') > 0 and my_list.index('public') > 0:
print(string[4:-4])
except:
pass

You can do it with Beautiful Soup like this:
from bs4 import BeautifulSoup
html_doc = """
<tr>
<td class="km">29565618</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206839</td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206838</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206859</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>"""
soup = BeautifulSoup(html_doc, 'html.parser')
trs = soup.find_all('tr')
for tr in trs:
if tr.div:
div_text = tr.div.get_text()
if "JDK" in div_text and "public" in div_text:
for td in tr.find_all('td'):
td_text = td.get_text()
if td_text.isdigit():
print(td_text)
Output:
29565618

If data is your HTML snippet from the question, this script:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for a in soup.select('td.km:has(~ td.km) > a'):
if re.findall(r' JDK.*?\(public\)', a.find_next('td', class_='km').text):
print(a.text)
prints:
29565618

soup = BeautifulSoup(html_doc, 'html.parser')
match = soup.find(text=lambda t: "JDK" in t)
if match and 'public' in match.parent.text:
print(match.find_previous('a').text)
Thanks for #Andrej Kesely

You can use:
(?=<a.*?>(.*)</a>)
Check here, it uses your data to confirm the match: https://regex101.com/r/W2wV2I/1/

What about this
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = ''' <tr>
<td class="km">29565618</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206839</td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206838</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206859</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>'''
doc = SimplifiedDoc(html)
trs = doc.trs.contains(['JDK','public'])
for tr in trs:
print(tr.a.text) # 29565618

Related

Returning None when scraping href using Python

Hi I'm trying to scrape 151 Heavy Duty Rubber Gloves - Ex Large from table with following inspect script. Can someone please help with the right Python script?
[<table border="0" class="ProductBox" id="Added0">
<tr>
<td align="center" colspan="2">
<div style="width:100%;float:left;display:inline;float:left;height:37px;"><div style="float:left;font-size:16px;font-family: 'Roboto Condensed', sans-serif;color:white;margin-top:4%;margin-left:6%;"> </div></div>
</td></tr><tr>
<td align="center" colspan="2" height="60px;" valign="top">
<div class="PromoPriceText"> <br/><br/></div><div class="StdPrice">£0.69</div><div class="UnitCost">(£0.69/Unit)</div>
</td>
</tr>
<tr>
<td align="center" colspan="2" height="185">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
<img alt="" class="effectfront" id="prod" src="/~uldir/104373t.jpg" style="height:165px !important;"/></a>
</td>
</tr>
<tr>
<td class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"><input name="product_code" type="hidden" value="104373"/>104373</td>
<td align="right" class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
POR 0%
</td>
<td align="right" class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
VAT 20%
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;height:50px;">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
**151 Heavy Duty Rubber Gloves - Ex Large**</a></td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
1s x 1
</td>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;float:right;width:98%;text-align:right;">
<div class="tooltip">
<div class="IconWishNS" id="IconWishNS104373" onclick="AddToWish('104373','A')" style="display:inline-block;">
<span class="tooltiptext tooltip-bottom" style="font-size:12px;">Add to Wish List</span></div>
</div>
<span class="OKStatus">In Stock </span>
</td>
</tr>
<tr>
<td colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
<table style="margin-top : 10px;" width="100%">
<tr>
<td>
<img align="middle" alt="Take 1 Off Qty" src="/images/minus.png"/>
</td>
<td>
<input class="iQtyBox" id="104373_qty" maxlength="4" name="104373_qty" oninput="this.value=(parseInt(this.value)||'')" tabindex="1" type="text" value="1"/>
</td>
<td>
<img align="middle" alt="Add 1 To Qty" src="/images/add.png"/>
</td>
<td align="right">
<button class="subBlackButtonDiv subButtonDiv" style="width:70px;margin:0px;" type="button" value="add">Add</button>
</td>
</tr>
</table>
I tied to use the following
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find_all('table')
for i in table:
links = [link.get('href') for link in i.find_all('a')]
print(links)
which unfortunately returns: ['/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '#', '#', '#']

Can use the td.ProductDetails a selector (an a tag inside td with the class ProductDetails) to target the text you are interested in, then call .strip() a few times to remove extra characters:
DATA = """<table border="0" class="ProductBox" id="Added0">
<tr>
...
</table>"""
from bs4 import BeautifulSoup
from typing import Optional
def extract_name(data: str) -> Optional[str]:
soup = BeautifulSoup(data, "html.parser")
links = soup.select("td.ProdDetails a")
if len(links) >= 1:
return links[0].text.strip().strip("*").strip()
else:
return None
print(extract_name(DATA))
# like above
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tables = soup.find_all('table')
text = extract_name(tables[0])
Output: 151 Heavy Duty Rubber Gloves - Ex Large

Getting value from page with BeautifulSoup

I have the following page structure:
<tr class="small data-row" bgcolor="#f9f9f9">.</tr>
<td class="stats1" align="right">0</td>
<td class="stats1" align="right">0</td>
<td class="stats1" align="right">0</td>
<td class="stats1 stats-dash" align="right">-</td>
.
.
.
<tr class="small data-row" bgcolor="#ffffff">.</tr>
<tr class="small data-row" bgcolor="#f9f9f9">.</tr>
<tr class="small" bgcolor="#eff6ef">.</tr>
<td class="stats1" align="right">215</td>
<td class="stats1" align="right">183</td>
<td class="stats1" align="right">0</td>
<td class="stats1 stats-dash" align="right">-</td>
</tr>
I would like to get this second value == 183, but I am not sure how to do it. I tried in that way:
content = driver.page_source
soup = BeautifulSoup(content)
for elm in soup.select(".stats1"):
val=elm.get("align")
and the output is:
right
<td align="right" class="stats1">215</td>
if I got 183 instead of 215 I could use .split, but in this case I get only this first value.

.select() will return a list of elements. Just call that element by index:
from bs4 import BeautifulSoup
html = '''<tr class="small data-row" bgcolor="#f9f9f9">.</tr>
<tr class="small" bgcolor="#ffffff">.</tr>
<td class="stats1" align="right">215</td>
<td class="stats1" align="right">183</td>
<td class="stats1" align="right">0</td>
<td class="stats1 stats-dash" align="right">-</td>
</tr>'''
soup = BeautifulSoup(html, 'html.parser')
elm = soup.select(".stats1")[1]
Output:
print(elm.text)
183

how do we select the child element tbody after extracting the entire html?

I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better.
i have extracted the html which is as shown below
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
i tried to parse find_all('tbody') but was unsuccessful
#table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
html = browser.page_source
soup = bs(html, "lxml")
table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
table_body = table.find('tbody')
rows = table.select('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])values
I'm trying to save values in "listmaintext" class
Error message
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Another way to do this using next_sibling
from bs4 import BeautifulSoup as bs
html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
</html>'''
soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)

from bs4 import BeautifulSoup
data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>'''
soup = BeautifulSoup(data, 'lxml')
s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))
Prints:
ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]

Web Scraping tables from an HTML file

Hello all I am hoping to get some help with taking the tables in my HTML file and importing them into a csv file. I am very very new to web scraping so for give me if I am completely wrong with my code. The HTML file holds three separate table I am trying to extract; estimate, sampling error, and number of non-zero plots in estimate.
My code is shown below:
#import necessary libraries
import urllib2
import pandas as pd
#specify URL
table = "file:///C:/Users/TMccw/Anaconda2/FiaAPI/outFArea18.html"
#Query the website & return the html to the variable 'page'
page = urllib2.urlopen(table)
#import the bs4 functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable & store it in bs4 format
soup = BeautifulSoup(page, 'html.parser')
#Print out the html code with the function prettify
print soup.prettify()
#Find the tables & check type
table2 = soup.find_all('table')
print(table2)
print type(table2)
#Create new table as a dataframe
new_table = pd.DataFrame(columns=range(0,4))
#Extract the info from the HTML code
soup.find('table').find_all('td'),{'align':'right'}
#Remove the tags and extract table info into CSV
???
Here is the HTML for the first table "Estimate":
` Estimate:
</b>
</caption>
<tr>
<td>
</td>
<td align="center" colspan="5">
<b>
Ownership group
</b>
</td>
</tr>
<tr>
<th>
<b>
Forest type group
</b>
</th>
<td>
<b>
Total
</b>
</td>
<td>
<b>
National Forest
</b>
</td>
<td>
<b>
Other federal
</b>
</td>
<td>
<b>
State and local
</b>
</td>
<td>
<b>
Private
</b>
</td>
</tr>
<tr>
<td nowrap="">
<b>
Total
</b>
</td>
<td align="right">
4,875,993
</td>
<td align="right">
195,438
</td>
<td align="right">
169,500
</td>
<td align="right">
392,030
</td>
<td align="right">
4,119,025
</td>
</tr>
<tr>
<td nowrap="">
<b>
White / red / jack pine group
</b>
</td>
<td align="right">
40,492
</td>
<td align="right">
3,426
</td>
<td align="right">
-
</td>
<td align="right">
10,850
</td>
<td align="right">
26,217
</td>
</tr>
<tr>
<td nowrap="">
<b>
Loblolly / shortleaf pine group
</b>
</td>
<td align="right">
38,267
</td>
<td align="right">
11,262
</td>
<td align="right">
997
</td>
<td align="right">
4,015
</td>
<td align="right">
21,993
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other eastern softwoods group
</b>
</td>
<td align="right">
25,181
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
25,181
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic softwoods group
</b>
</td>
<td align="right">
5,868
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
662
</td>
<td align="right">
5,206
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / pine group
</b>
</td>
<td align="right">
144,238
</td>
<td align="right">
9,592
</td>
<td align="right">
-
</td>
<td align="right">
21,475
</td>
<td align="right">
113,171
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / hickory group
</b>
</td>
<td align="right">
3,480,272
</td>
<td align="right">
152,598
</td>
<td align="right">
123,900
</td>
<td align="right">
285,305
</td>
<td align="right">
2,918,470
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / gum / cypress group
</b>
</td>
<td align="right">
76,302
</td>
<td align="right">
-
</td>
<td align="right">
12,209
</td>
<td align="right">
9,311
</td>
<td align="right">
54,782
</td>
</tr>
<tr>
<td nowrap="">
<b>
Elm / ash / cottonwood group
</b>
</td>
<td align="right">
652,001
</td>
<td align="right">
7,105
</td>
<td align="right">
25,431
</td>
<td align="right">
46,096
</td>
<td align="right">
573,369
</td>
</tr>
<tr>
<td nowrap="">
<b>
Maple / beech / birch group
</b>
</td>
<td align="right">
346,718
</td>
<td align="right">
10,871
</td>
<td align="right">
818
</td>
<td align="right">
12,748
</td>
<td align="right">
322,281
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other hardwoods group
</b>
</td>
<td align="right">
21,238
</td>
<td align="right">
585
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
20,653
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic hardwoods group
</b>
</td>
<td align="right">
2,441
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
2,441
</td>
</tr>
<tr>
<td nowrap="">
<b>
Nonstocked
</b>
</td>
<td align="right">
42,975
</td>
<td align="right">
-
</td>
<td align="right">
6,144
</td>
<td align="right">
1,570
</td>
<td align="right">
35,261
</td>
</tr>
</table>
<br/>
<table border="4" cellpadding="4" cellspacing="4">
<caption>
<b>`

I made four tables almost identical to yours and put them into a fairly respectable page of HTML. Then I ran this code.
>>> import bs4
>>> import pandas as pd
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'html.parser')
>>> tables = soup.findAll('table')
>>> for t, table in enumerate(tables):
... df = pd.read_html(str(table), skiprows=2)
... df[0].to_csv('table%s.csv' % t)
The results were four files like this, named table0.csv through table3.csv.
,0,1,2,3,4,5
0,Total,4875993,195438,169500,392030,4119025
1,White / red / jack pine group,40492,3426,-,10850,26217
2,Loblolly / shortleaf pine group,38267,11262,997,4015,21993
3,Other eastern softwoods group,25181,-,-,-,25181
4,Exotic softwoods group,5868,-,-,662,5206
5,Oak / pine group,144238,9592,-,21475,113171
6,Oak / hickory group,3480272,152598,123900,285305,2918470
7,Oak / gum / cypress group,76302,-,12209,9311,54782
8,Elm / ash / cottonwood group,652001,7105,25431,46096,573369
9,Maple / beech / birch group,346718,10871,818,12748,322281
10,Other hardwoods group,21238,585,-,-,20653
11,Exotic hardwoods group,2441,-,-,-,2441
12,Nonstocked,42975,-,6144,1570,35261
Perhaps the main thing I should mention is that I skipped the same number of rows in each table that BeautifulSoup delivered. If the number of header lines in the tables varies then you will have to do something more clever or just discard lines in the output files and omit the skiprows parameter.

Unsure as to what the exact question is here but right off the bat I can see an error that will throw you off a bit.
new_table = pd.DataFrame(columns=range(0-4))
Needs to be
new_table = pd.DataFrame(columns=range(0,4))
The result of range(0-4) is actually range(-4) which evaluates to range(0,-4) whereas you want range(0,4). You can just pass range(4) as the parameter or range(0,4).

BeautifulSoup not parsing every tag of the html

I'm having a problem with BeautifulSoup not completely parsing the html received. I tried with both lxml and html5lib parsers and I had the same problem.
html = '<td style="vertical-align: top">1</td> <td style="vertical-align: top"><span class="ui-icon country flg-fr"></span>\t</td><td class="pn"><a class="player-link" href="/Players/25604">Hugo Lloris <span class="incident-wrapper"></span> </a><span class="player-meta-data">29</span><span class="player-meta-data">, GK </span></td> <td class="ShotsTotal ">0\t</td><td class="ShotOnTarget ">0\t</td><td class="KeyPassTotal ">0\t</td><td class="PassSuccessInMatch ">88\t</td><td class="DuelAerialWon ">0\t</td><td class="Touches ">35\t</td><td class="rating ">6.24</td> <td style="text-align: left"><span class="incident-wrapper"></span></td> '
parsed_html = ipdb> BeautifulSoup(html, 'html5lib')
<html><head></head><body>1 <span class="ui-icon country flg-fr"></span> <a class="player-link" href="/Players/25604">Hugo Lloris <span class="incident-wrapper"></span> </a><span class="player-meta-data">29</span><span class="player-meta-data">, GK </span> 0 0 0 88 0 35 6.24 <span class="incident-wrapper"></span> </body></html>

It is working for me. I execute the following code (using beautifulsoup4==4.4.1):
from bs4 import BeautifulSoup
html = """
<td style="vertical-align: top">1</td>
<td style="vertical-align: top"><span class="ui-icon country flg-fr"></span>\t</td>
<td class="pn"><a class="player-link" href="/Players/25604">Hugo Lloris <span class="incident-wrapper"></span> </a><span
class="player-meta-data">29</span><span class="player-meta-data">, GK </span></td>
<td class="ShotsTotal ">0\t</td>
<td class="ShotOnTarget ">0\t</td>
<td class="KeyPassTotal ">0\t</td>
<td class="PassSuccessInMatch ">88\t</td>
<td class="DuelAerialWon ">0\t</td>
<td class="Touches ">35\t</td>
<td class="rating ">6.24</td>
<td style="text-align: left"><span class="incident-wrapper"></span></td>
"""
parsed_html = BeautifulSoup(html, 'html5lib')
print(html)
And I've got the following html printed:
<td style="vertical-align: top">1</td>
<td style="vertical-align: top"><span class="ui-icon country flg-fr"></span> </td>
<td class="pn"><a class="player-link" href="/Players/25604">Hugo Lloris <span class="incident-wrapper"></span> </a><span
class="player-meta-data">29</span><span class="player-meta-data">, GK </span></td>
<td class="ShotsTotal ">0 </td>
<td class="ShotOnTarget ">0 </td>
<td class="KeyPassTotal ">0 </td>
<td class="PassSuccessInMatch ">88 </td>
<td class="DuelAerialWon ">0 </td>
<td class="Touches ">35 </td>
<td class="rating ">6.24</td>
<td style="text-align: left"><span class="incident-wrapper"></span></td>
Don't see anything missing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Extract nth line String after Pattern match in Python? - python

If data is your HTML snippet from the question, this script: import re from bs4 import BeautifulSoup soup = BeautifulSoup(data, 'html.parser') for a in soup.select('td.km:has(~ td.km) > a'): if re.findall(r' JDK.*?\(public\)', a.find_next('td', class_='km').text): print(a.text) prints: 29565618

soup = BeautifulSoup(html_doc, 'html.parser') match = soup.find(text=lambda t: "JDK" in t) if match and 'public' in match.parent.text: print(match.find_previous('a').text) Thanks for #Andrej Kesely

You can use: (?=<a.?>(.)</a>) Check here, it uses your data to confirm the match: https://regex101.com/r/W2wV2I/1/

Related

Returning None when scraping href using Python

Getting value from page with BeautifulSoup

how do we select the child element tbody after extracting the entire html?

Web Scraping tables from an HTML file

BeautifulSoup not parsing every tag of the html

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Extract nth line String after Pattern match in Python? - python

If data is your HTML snippet from the question, this script: import re from bs4 import BeautifulSoup soup = BeautifulSoup(data, 'html.parser') for a in soup.select('td.km:has(~ td.km) > a'): if re.findall(r' JDK.*?\(public\)', a.find_next('td', class_='km').text): print(a.text) prints: 29565618

soup = BeautifulSoup(html_doc, 'html.parser') match = soup.find(text=lambda t: "JDK" in t) if match and 'public' in match.parent.text: print(match.find_previous('a').text) Thanks for #Andrej Kesely

You can use: (?=<a.*?>(.*)</a>) Check here, it uses your data to confirm the match: https://regex101.com/r/W2wV2I/1/

Related

Returning None when scraping href using Python

Getting value from page with BeautifulSoup

how do we select the child element tbody after extracting the entire html?

Web Scraping tables from an HTML file

BeautifulSoup not parsing every tag of the html

Categories

Resources

You can use: (?=<a.?>(.)</a>) Check here, it uses your data to confirm the match: https://regex101.com/r/W2wV2I/1/