Can this beautifulsoup script be simplified with Regex?

Can this beautifulsoup script be simplified with Regex? - python

I wrote some beautifulsoup scripts, and one part seems really redundant, I am thinking if it can be simplified with Regex.
All posts from this forum are marked with different colors, what I did is to search each color with one line. For six colors I did six lines with only one words difference.
red = soup.find_all('a', style="font-weight: bold;color: red")
blue = soup.find_all('a', style="font-weight: bold;color: blue")
green = soup.find_all('a', style="font-weight: bold;color: green")
purple = soup.find_all('a', style="font-weight: bold;color: purple")
orange = soup.find_all('a', style="font-weight: bold;color: orange")
lime = soup.find_all('a', style="color: green")
I am not sure if it is possible to be simplified. Maybe something like:
re.compile("(color: red|blue|green|purple|orange)", re.(whatever the letter is))
if it's not regex, or could it be something else?
This is partial DOM:
<th class="common">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[美臀]</em> <span id="thread_10431427">(本中)(HND-???) 二宮ひかり</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
</th>
<td class="author">
<cite>
第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>6 </cite>
<em>2019-4-22</em>
</td>
<td class="nums"><strong>2</strong> / <em>12234</em></td>
<td class="nums">5.02G / MP4
</td>
<td class="lastpost">
<em>2019-4-23 20:22</em>
<cite>by zj376104288</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
<!-- 三級置頂分開 -->
<tbody id="stickthread_10431424">
<tr>
<td class="folder"><img src="images/green001/folder_common.gif"/></td>
<td class="icon">
  </td>
<th class="common">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[VR]</em> <span id="thread_10431424">(WAAP)(WPVR-???)葵百合香</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
</th>
<td class="author">
<cite>
第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>5 </cite>
<em>2019-4-22</em>
</td>
<td class="nums"><strong>0</strong> / <em>7265</em></td>
<td class="nums">3.85G / MP4
</td>
<td class="lastpost">
<em>2019-4-22 20:57</em>
<cite>by 第一會所新片</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
<!-- 三級置頂分開 -->
<tbody id="stickthread_10431423">
<tr>
<td class="folder"><img src="images/green001/folder_common.gif"/></td>
<td class="icon">
  </td>
<th class="common">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[VR]</em> <span id="thread_10431423">(KMP)(SAVR-???)舞島あかり</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
</th>
<td class="author">
<cite>
第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>4 </cite>
<em>2019-4-22</em>
</td>
<td class="nums"><strong>0</strong> / <em>6226</em></td>
<td class="nums">23.39G / MP4
</td>
<td class="lastpost">
<em>2019-4-22 20:57</em>
<cite>by 第一會所新片</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
<!-- 三級置頂分開 -->
<tbody id="stickthread_10431422">
<tr>
<td class="folder"><img src="images/green001/folder_common.gif"/></td>
<td class="icon">
  </td>

You can pass a attribute list to css select with ends with operator
[style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange']
So,
items = [item for item in soup.select("[style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange']")

Related

Returning None when scraping href using Python

Hi I'm trying to scrape 151 Heavy Duty Rubber Gloves - Ex Large from table with following inspect script. Can someone please help with the right Python script?
[<table border="0" class="ProductBox" id="Added0">
<tr>
<td align="center" colspan="2">
<div style="width:100%;float:left;display:inline;float:left;height:37px;"><div style="float:left;font-size:16px;font-family: 'Roboto Condensed', sans-serif;color:white;margin-top:4%;margin-left:6%;"> </div></div>
</td></tr><tr>
<td align="center" colspan="2" height="60px;" valign="top">
<div class="PromoPriceText"> <br/><br/></div><div class="StdPrice">£0.69</div><div class="UnitCost">(£0.69/Unit)</div>
</td>
</tr>
<tr>
<td align="center" colspan="2" height="185">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
<img alt="" class="effectfront" id="prod" src="/~uldir/104373t.jpg" style="height:165px !important;"/></a>
</td>
</tr>
<tr>
<td class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"><input name="product_code" type="hidden" value="104373"/>104373</td>
<td align="right" class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
POR 0%
</td>
<td align="right" class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
VAT 20%
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;height:50px;">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
**151 Heavy Duty Rubber Gloves - Ex Large**</a></td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
1s x 1
</td>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;float:right;width:98%;text-align:right;">
<div class="tooltip">
<div class="IconWishNS" id="IconWishNS104373" onclick="AddToWish('104373','A')" style="display:inline-block;">
<span class="tooltiptext tooltip-bottom" style="font-size:12px;">Add to Wish List</span></div>
</div>
<span class="OKStatus">In Stock </span>
</td>
</tr>
<tr>
<td colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
<table style="margin-top : 10px;" width="100%">
<tr>
<td>
<img align="middle" alt="Take 1 Off Qty" src="/images/minus.png"/>
</td>
<td>
<input class="iQtyBox" id="104373_qty" maxlength="4" name="104373_qty" oninput="this.value=(parseInt(this.value)||'')" tabindex="1" type="text" value="1"/>
</td>
<td>
<img align="middle" alt="Add 1 To Qty" src="/images/add.png"/>
</td>
<td align="right">
<button class="subBlackButtonDiv subButtonDiv" style="width:70px;margin:0px;" type="button" value="add">Add</button>
</td>
</tr>
</table>
I tied to use the following
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find_all('table')
for i in table:
links = [link.get('href') for link in i.find_all('a')]
print(links)
which unfortunately returns: ['/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '#', '#', '#']

Can use the td.ProductDetails a selector (an a tag inside td with the class ProductDetails) to target the text you are interested in, then call .strip() a few times to remove extra characters:
DATA = """<table border="0" class="ProductBox" id="Added0">
<tr>
...
</table>"""
from bs4 import BeautifulSoup
from typing import Optional
def extract_name(data: str) -> Optional[str]:
soup = BeautifulSoup(data, "html.parser")
links = soup.select("td.ProdDetails a")
if len(links) >= 1:
return links[0].text.strip().strip("*").strip()
else:
return None
print(extract_name(DATA))
# like above
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tables = soup.find_all('table')
text = extract_name(tables[0])
Output: 151 Heavy Duty Rubber Gloves - Ex Large

Xpath Python Extract Data From Table Between Two Headings

I'm trying to extract data from a table that lies in between two headers in an html file using Python. IN this case, the required id to lookup lies in a span inside a header (I need id="Perlis", which lies between Perlis and Kedah):
<h2>
<span class="mw-headline" id="Perlis">Perlis</span>
<span class="mw-editsection">
<span class="mw-editsection-bracket">[</span>
edit
<span class="mw-editsection-bracket">]</span>
</span>
</h2>
<table class="wikitable" style="text-align:center; font-size:90%; width:100%;">
<tbody>
<tr>
<th width="30"># </th>
<th width="150">Constituency s </th>
<th width="150">Winner </th>
<th width="80">Votes </th>
<th width="80">Majority </th>
<th width="150">Opponent(s) </th>
<th width="80">Votes </th>
<th width="150">Incumbent </th>
<th width="80">
<b>Incumbent Majority</b>
</th>
</tr>
<tr>
<td colspan="13">
BN
<b>2</b> | GS
<b>0</b> | PH
<b>1</b> | Independent
<b>0</b>
</td>
</tr>
<tr align="center">
<td rowspan="2">P1 </td>
<td rowspan="2">
Padang Besar
</td>
<td rowspan="2" bgcolor="#B5BED9">
Zahidi Zainul Abidin
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>15,032</b>
</td>
<td rowspan="2">
<b>1,438</b>
</td>
<td bgcolor="#F18A8F">Izizam Ibrahim <br /> ( <b>PH</b>- <b>PPBM</b>) </td>
<td>
<b>13,594</b>
</td>
<td rowspan="2" bgcolor="#B5BED9">
Zahidi Zainul Abidin
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>7,426</b>
</td>
</tr>
<tr>
<td bgcolor="#B2DBB2">Mokhtar Senik <br /> ( <b>GS</b>- <b>PAS</b>) </td>
<td>
<b>7,874</b>
</td>
</tr>
<tr align="center">
<td rowspan="2">P2 </td>
<td rowspan="2">
Kangar
</td>
<td rowspan="2" bgcolor="#C7F2F2">Noor Amin Ahmad <br /> ( <b>PH</b>- <b>PKR</b>) </td>
<td rowspan="2">
<b>20,909</b>
</td>
<td rowspan="2">
<b>5,603</b>
</td>
<td bgcolor="#B5BED9">Ramli Shariff <br /> ( <b>BN</b>- <b>UMNO</b>) </td>
<td>
<b>15,306</b>
</td>
<td rowspan="2" bgcolor="#B5BED9">
Shaharuddin Ismail
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>4,037</b>
</td>
</tr>
<tr>
<td bgcolor="#B2DBB2">Mohamad Zahid Ibrahim <br /> ( <b>GS</b>- <b>PAS</b>) </td>
<td>
<b>8,465</b>
</td>
</tr>
</tbody>
</table>
<h2>
<span class="mw-headline" id="Kedah">Kedah</span>
<span class="mw-editsection">
<span class="mw-editsection-bracket">[</span>
edit
<span class="mw-editsection-bracket">]</span>
</span>
</h2>
<table class="wikitable" style="text-align:center; font-size:90%; width:100%;"></table>
This is the resulting JSON that I am trying to construct:
[
{
"state": "Perlis",
"constituencies": [
{
"id": "P1",
"name": "Padang Besar"
},
{
"id": "P2",
"name": "Kangar"
}
]
}
]
I'd like to know how to reference the specific table so I can extract the data into a JSON format. I have used Scrapy before but not sure how to in this case- this is what I had in mind:
class PostSpider(scrapy.Spider):
name = 'manual_spider'
start_urls = [
'%URL%'
]
def parse(self, response):
doc = response.xpath('//comment()').getall() //This is the bit I need
//code continues here

Python Beautiful Soup Iterate over Multiple Tables

Trying to find multiple tables using the CSS names and I am only getting the CSS in the output initially. I want to loop over each of the small tables and from there each row contains player info with the tds attributes about each player. How come what I have there doesn't actually print the table contents to begin with? I want to confirm I have made this first step right, before I then go on and into
the tr and tds for each mini table. I think part of the issue is that the first table.
My program -
import requests
from bs4 import BeautifulSoup
#url = 'https://www.skysports.com/premier-league-table'
base_url = 'https://www.skysports.com'
# Squad Data
squad_url = base_url + '/liverpool-squad'
squad_r = requests.get(squad_url)
print(squad_r.status_code)
premier_squad_soup = BeautifulSoup(squad_r.text, 'html.parser')
premier_squad_table = premier_squad_soup.find_all = ('table', {'class': 'table -small no-wrap football-squad-table '})
print(premier_squad_table)
HTML -
each table looks like the following but with a different title
<table class="table -small no-wrap football-squad-table " title="Goalkeeper">
<colgroup>
<col class="" style="">
<col class="digit-4 -bp30-hdn">
<col class="digit-3 ">
<col class="digit-3 ">
<col class="digit-3 ">
</colgroup>
<thead>
<tr class="text-s -interact text-h6" style="">
<th class=" text-h4 -txt-left" title="">Goalkeeper</th>
<th class=" text-h6" title="Played">Pld</th>
<th class=" text-h6" title="Goals">G</th>
<th class=" text-h6" title="Yellow Cards ">YC</th>
<th class=" text-h6" title="Red Cards">RC</th>
</tr>
</thead>
<tbody>
<tr class="text-h6 -center">
<td>
<a href="/football/player/141016/alisson-ramses-becker">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Alisson Ramses Becker</h6></span>
</div>
</a>
</td>
<td>
13 (0) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="text-h6 -center">
<td>
<a href="/simon-mignolet">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Simon Mignolet</h6></span>
</div>
</a>
</td>
<td>
1 (0) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="text-h6 -center">
<td>
<a href="/football/player/153304/kamil-grabara">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Kamil Grabara</h6></span>
</div>
</a>
</td>
<td>
1 (1) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
Output -
200
('table', {'class': 'table -small no-wrap football-squad-table '})

Had to find the div first to then get the table inside the div
premier_squad_div = premier_squad_soup.find('div', {'class': '-bp30-box col span1/1'})
premier_squad_table = premier_squad_div.find_all('table', {'class': 'table -small no-wrap football-squad-table '})

Not able to scrape checkbox value from each table tr

Please see the below html table
<table width=900 cellspacing=0 border=0 cellpadding=5 style='border-top:1px solid silver;border-left:1px solid silver;border-right:1px solid silver;'>
<tr >
<td style='border-bottom:1px solid silver;background:#ffffff;'>
<input checked type=checkbox name=jobs[] value='610974'>
<table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
<tr>
<td style='background:lightgreen;' valign=top>
<img src='../images/checkwhite.png' style='width:30px;'>
</td>
<td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT 06109 </td>
<tr>
<td>Your Input</td>
<td>123 CHARTER RD WETHERSFIELD CT 06109</td>
</tr>
</table>
<br clear=all>
<div style='margin-left:40px;'>09/11/2018
<br>Exterior BPO - Light Photo Set (3 photos*)
<br>$9.00 We found a rep 6.2 miles from job.
<span style='color:silver'> 640x480 Add Datestamp, </span>
<br clear=all>
<div style=float:left;'></div>
</div>
</td>
</td>
<tr >
<td style='border-bottom:1px solid silver;background:#ffffff;'>
<div style='color:red; font-weight:bold; '>Warning... Duplicate Found!</div>
<input checked type=checkbox name=jobs[] value='610975'>
<table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
<tr>
<td style='background:lightgreen;' valign=top>
<img src='../images/checkwhite.png' style='width:30px;'>
</td>
<td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT 06109 </td>
<tr>
<td>Your Input</td>
<td>123 CHARTER RD WETHERSFIELD CT 06109</td>
</tr>
</table>
<br clear=all>
<div style='margin-left:40px;'>09/11/2018
<br>Exterior BPO - Light Photo Set (3 photos*)
<br>$9.00 We found a rep 6.2 miles from job.
<span style='color:silver'> 640x480 Add Datestamp, </span>
<br clear=all>
<div style=float:left;'>
I need the output as :
id="610974" and Address="123 CHARTER RD WETHERSFIELD CT 06109" [Ist checkbox value is id and corresponding address]
id="610975" and Address="123 CHARTER RD WETHERSFIELD CT 06109" [Ist checkbox value is id and corresponding address]
etc....
soup = BeautifulSoup(bodystrip, "lxml")
for tr in response.find_all('tr'):
tds = tr.find_all('td')
print(tds[0].text)
jobid = tds[0].find('input')
print(jobid)
this is getting error on address are properly getting

With Scrapy:
for input_node in response.xpath('//input[#name="jobs[]"]'):
id = input_node.xpath(./#value).extract_first()
address = input_node.xpath('./following-sibling::table[1]//td[.="Your Input"]/following-sibling::td[1]/text()').extract_first()

With beautifulsoup this should work:
for job in soup.find_all('input',attrs={"type":"checkbox"}):
print(job['value'])
print(job.parent.find_all('td',attrs={'style':True})[1].text)

How to scrape span ids' texts in beautifulsoup in the following html?

<div align="justify" style="text-align: center">
<div>
<table cellspacing="0" rules="all" border="1" id="ContentPlaceHolder1_grd_reminder" style="width:555px;border-collapse:collapse;">
<tr>
<th class="grdheading2" scope="col">Book</th>
<th class="grdheading2" scope="col">Issue Date</th>
<th class="grdheading2" scope="col">Submition Date</th>
</tr>
<tr>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label1_0">Engineering Mechanics</span>
</td>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label2_0">17-Oct-2016</span>
</td>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label3_0">31-Oct-2016</span>
</td>
</tr>
<tr>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label1_1">ATB of Engineering Mathematics</span>
</td>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label2_1">17-Oct-2016</span>
</td>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label3_1">31-Oct-2016</span>
</td>
</tr>
</table>
</div>
</div>
I want to extract the text Engineering mechanics and it's corresponding date (text) 31-Oct-2016 and the textATB of Engineering Mathematics and it's corresponding date (text) 31-Oct-2016. All of these all located in the span ids. How can I extract and print them? I'm new to web scraping.

First you can use find_all() to find all tr tags, and using loop you can use find_all() to find all span tags in every tr. This way you can control scraped data
html = '''<div align="justify" style="text-align: center">
<div>
<table cellspacing="0" rules="all" border="1" id="ContentPlaceHolder1_grd_reminder" style="width:555px;border-collapse:collapse;">
<tr>
<th class="grdheading2" scope="col">Book</th><th class="grdheading2" scope="col">Issue Date</th><th class="grdheading2" scope="col">Submition Date</th>
</tr><tr>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label1_0">Engineering Mechanics</span>
</td><td>
<span id="ContentPlaceHolder1_grd_reminder_Label2_0">17-Oct-2016</span>
</td><td>
<span id="ContentPlaceHolder1_grd_reminder_Label3_0">31-Oct-2016</span>
</td>
</tr><tr>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label1_1">ATB of Engineering Mathematics</span>
</td><td>
<span id="ContentPlaceHolder1_grd_reminder_Label2_1">17-Oct-2016</span>
</td><td>
<span id="ContentPlaceHolder1_grd_reminder_Label3_1">31-Oct-2016</span>
</td>
</tr>
</table>
</div>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
trs = soup.find_all('tr')
for tr in trs:
spans = tr.find_all('span')
if spans:
print 'title:', spans[0].text
print 'date:', spans[2].text
Result
title: Engineering Mechanics
date: 31-Oct-2016
title: ATB of Engineering Mathematics
date: 31-Oct-2016

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can this beautifulsoup script be simplified with Regex? - python

Related

Returning None when scraping href using Python

Xpath Python Extract Data From Table Between Two Headings

Python Beautiful Soup Iterate over Multiple Tables

Not able to scrape checkbox value from each table tr

How to scrape span ids' texts in beautifulsoup in the following html?

Categories

Resources