Extracting values from HTML in python

Extracting values from HTML in python - python

My project involves web scraping using python. In my project I need to get data about a given its registration. I have managed to get the html from the site into python but I am struggling to extract the values.
I am using this website: https://www.carcheck.co.uk/audi/N18CTN
from bs4 import BeautifulSoup
import requests
url = "https://www.carcheck.co.uk/audi/N18CTN"
r= requests.get(url)
soup = BeautifulSoup(r.text)
print(soup)
I need to get this information about the vehicle
<td>AUDI</td>
</tr>
<tr>
<th>Model</th>
<td>A3</td>
</tr>
<tr>
<th>Colour</th>
<td>Red</td>
</tr>
<tr>
<th>Year of manufacture</th>
<td>2017</td>
</tr>
<tr>
<th>Top speed</th>
<td>147 mph</td>
</tr>
<tr>
<th>Gearbox</th>
<td>6 speed automatic</td>
How would I go about doing this?

Since you don't have extensive experience with BeautifulSoup, you can effortlessly match the table containing the car information using a CSS selector and then you can extract the header and data rows to combine them into a dictionary:
import requests
from bs4 import BeautifulSoup
url = "https://www.carcheck.co.uk/audi/N18CTN"
soup = BeautifulSoup(requests.get(url).text, "lxml")
# Select the table containing the car information using CSS selector
table = soup.select_one("div.page:nth-child(2) > div:nth-child(4) > div:nth-child(1) > table:nth-child(1)")
# Extract header rows from the table and store them in a list
headers = [th.text for th in table.select("th")]
# Extract data rows from the table and store them in a list
data = [td.text for td in table.select("td")]
# Combine header rows and data rows into a dictionary using a dict comprehension
car_info = {key: value for key, value in zip(headers, data)}
print(car_info)
Ouput:
{'Make': 'AUDI', 'Model': 'A3', 'Colour': 'Red', 'Year of manufacture': '2017', 'Top speed': '147 mph', 'Gearbox': '6 speed automatic'}
In order to obtain the CSS selector pattern of the table you can use the devtools of your browser:

You can use this example to get you started how to get information from this page:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.carcheck.co.uk/audi/N18CTN'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('tr:has(th):has(td):not(:has(table))'):
header = row.find_previous('h1').text.strip()
title = row.th.text.strip()
text = row.td.text.strip()
all_data.append((header, title, text))
df = pd.DataFrame(all_data, columns = ['Header', 'Title', 'Value'])
print(df.head(20).to_markdown(index=False))
Prints:
Header
Title
Value
General information
Make
AUDI
General information
Model
A3
General information
Colour
Red
General information
Year of manufacture
2017
General information
Top speed
147 mph
General information
Gearbox
6 speed automatic
Engine & fuel consumption
Power
135 kW / 184 HP
Engine & fuel consumption
Engine capacity
1.968 cc
Engine & fuel consumption
Cylinders
4
Engine & fuel consumption
Fuel type
Diesel
Engine & fuel consumption
Consumption city
42.0 mpg
Engine & fuel consumption
Consumption extra urban
52.3 mpg
Engine & fuel consumption
Consumption combined
48.0 mpg
Engine & fuel consumption
CO2 emission
129 g/km
Engine & fuel consumption
CO2 label
D
MOT history
MOT expiry date
2023-10-27
MOT history
MOT pass rate
83 %
MOT history
MOT passed
5
MOT history
Failed MOT tests
1
MOT history
Total advice items
11

Related

Web Scraping data with BS4 - Python

I have been trying to export a web scraped document from the below code.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url="https://www.marketwatch.com/tools/markets/stocks/country/sri-lanka/1"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
cse = pd.DataFrame(columns=["Name", "Exchange", "Sector"])
for row in soup.find('tbody').find('tr'): ##for row in soup.find("tbody").find_all('tr'):
col = row.find("td")
Name = col[0].text
Exchange = col[1].text
Sector = col[2].text
cse = cse.append({"Name":Company_Name,"Exchange":Exchange_code,"Sector":Industry}, ignore_index=True)
but I am receiving an error 'TypeError: 'int' object is not subscriptable'. Can anyone help me to crack this out?

You need to know the difference between .find() and .find_all().
The only difference is that find_all() returns a list containing the single result, and find() just returns the result.
Since you are using col = row.find_all("td"), col is not a list. So you get this error -
'TypeError: 'int' object is not subscriptable'
Since you need to iterate over all the <tr> and inturn <td> inside every <tr>, you have to use find_all().
You can try this out.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url="https://www.marketwatch.com/tools/markets/stocks/country/sri-lanka/1"
data = requests.get(url).text
soup = BeautifulSoup(data, 'lxml')
cse = pd.DataFrame(columns=["Name", "Exchange", "Sector"])
for row in soup.find('tbody').find_all('tr'):
col = row.find_all("td")
Company_Name = col[0].text
Exchange_code = col[1].text
Industry = col[2].text
cse = cse.append({"Name":Company_Name,"Exchange":Exchange_code,"Sector":Industry}, ignore_index=True)
Name ... Sector
0 Abans Electricals PLC (ABAN.N0000) ... Housewares
1 Abans Finance PLC (AFSL.N0000) ... Finance Companies
2 Access Engineering PLC (AEL.N0000) ... Construction
3 ACL Cables PLC (ACL.N0000) ... Industrial Electronics
4 ACL Plastics PLC (APLA.N0000) ... Industrial Products
.. ... ... ...
145 Lanka Hospital Corp. PLC (LHCL.N0000) ... Healthcare Provision
146 Lanka IOC PLC (LIOC.N0000) ... Specialty Retail
147 Lanka Milk Foods (CWE) PLC (LMF.N0000) ... Food Products
148 Lanka Realty Investments PLC (ASCO.N0000) ... Real Estate Developers
149 Lanka Tiles PLC (TILE.N0000) ... Building Materials/Products
[150 rows x 3 columns]

Scrape zoho-analitics externally stored table. Is it possible?

I am trying to scrape a zoho-analytics table from this webpage for a project at the university. For the moment I have no ideas. I can't see the values in the inspect, and therefore I cannot use Beautifulsoup in Python (my favourite one).
enter image description here
Does anybody have any idea?
Thanks a lot,
Joseph

I tried it with BeautifulSoup, seems like you can't soup these values that are inside the table because they are not on the website but stored externally(?)
EDIT:
https://analytics.zoho.com/open-view/938032000481034014
This is the link the table and its data are stored.
So I tried scraping from it with bs4 and it works.
The class of the rows is "zdbDataRowDiv"
Try:
container = page_soup.findAll("div","class":"zdbDataRowDiv")
Code explanation:
container # the variable where your data is stored, name it how you like
page_soup # your html page you souped with BeautifulSoup
findAll("tag",{"attribute":"value"}) # this function finds every tag which has the specific value inside its attribute

They are stored within the <script> tags in json format. Just a matter of pulling those out and parsing:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'https://flo.uri.sh/visualisation/4540617/embed'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var _Flourish_data_column_names = ' in script.text:
json_str = script.text
col_names = json_str.split('var _Flourish_data_column_names = ')[-1].split(',\n')[0]
cols = json.loads(col_names)
data = json_str.split('_Flourish_data = ')[-1].split(',\n')[0]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
headers = cols['rows']['columns']
for row in jsonData['rows']:
rows.append(row['columns'])
table = pd.DataFrame(rows,columns=headers)
for col in headers[1:]:
table.loc[table[col] != '', col] = 'A'
Output:
print (table)
Company Climate change Forests Water security
0 Danone A A A
1 FIRMENICH SA A A A
2 FUJI OIL HOLDINGS INC. A A A
3 HP Inc A A A
4 KAO Corporation A A A
.. ... ... ... ...
308 Woolworths Limited A
309 Workspace Group A
310 Yokogawa Electric Corporation A A
311 Yuanta Financial Holdings A
312 Zalando SE A
[313 rows x 4 columns]

Script produces wrong results when linebreak comes into play

I've written a script in python to scrape some disorganized content located within b tags and thier next_sibling from a webpage. The thing is my script fails when linebreaks come between. I'm trying to extract the title's and their concerning description from that page starting from CHIEF COMPLAINT: Bright red blood per rectum to just before Keywords:.
Website address
I've tried so far with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select_one("hr").find_next_siblings('b'):
print(item.text,item.next_sibling)
The portion of output giving me unwanted results are like:
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
How can I get the titles and their concerning description accordingly?

Here's a scraper that's more robust compared to yesterday's solutions.
How to loop through scraping multiple documents on multiple web pages using BeautifulSoup?
How can I grab the entire body text from a web page using BeautifulSoup?
It extracts, title, description and all sections properly
import re
import copy
import requests
from bs4 import BeautifulSoup, Tag, Comment, NavigableString
from urllib.parse import urljoin
from pprint import pprint
import itertools
import concurrent
from concurrent.futures import ThreadPoolExecutor
BASE_URL = 'https://www.mtsamples.com'
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url)
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
return soup
def clean_soup(soup: BeautifulSoup) -> BeautifulSoup:
soup = copy.copy(soup)
h1 = soup.select_one('h1')
kw_re = re.compile('.*Keywords.*', flags=re.IGNORECASE)
kw = soup.find('b', text=kw_re)
for el in (*h1.previous_siblings, *kw.next_siblings):
el.extract()
kw.extract()
for ad in soup.select('[id*="ad"]'):
ad.extract()
for script in soup.script:
script.extract()
for c in h1.parent.children:
if isinstance(c, Comment):
c.extract()
return h1.parent
def extract_meta(soup: BeautifulSoup) -> dict:
h1 = soup.select_one('h1')
title = h1.text.strip()
desc_parts = []
desc_re = re.compile('.*Description.*', flags=re.IGNORECASE)
desc = soup.find('b', text=desc_re)
hr = soup.select_one('hr')
for s in desc.next_siblings:
if s is hr:
break
if isinstance(s, NavigableString):
desc_parts.append(str(s).strip())
elif isinstance(s, Tag):
desc_parts.append(s.text.strip())
description = '\n'.join(p.strip() for p in desc_parts if p.strip())
return {
'title': title,
'description': description
}
def extract_sections(soup: BeautifulSoup) -> list:
titles = [b for b in soup.select('b') if b.text.isupper()]
parts = []
for t in titles:
title = t.text.strip(': ').title()
text_parts = []
for s in t.next_siblings:
# walk forward until we see another title
if s in titles:
break
if isinstance(s, Comment):
continue
if isinstance(s, NavigableString):
text_parts.append(str(s).strip())
if isinstance(s, Tag):
text_parts.append(s.text.strip())
text = '\n'.join(p for p in text_parts if p.strip())
p = {
'title': title,
'text': text
}
parts.append(p)
return parts
def extract_page(url: str) -> dict:
soup = make_soup(url)
clean = clean_soup(soup)
meta = extract_meta(clean)
sections = extract_sections(clean)
return {
**meta,
'sections': sections
}
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
page = extract_page(url)
pprint(page, width=2000)
output:
{'description': 'Status post colonoscopy. After discharge, experienced bloody bowel movements and returned to the emergency department for evaluation.\n(Medical Transcription Sample Report)',
'sections': [{'text': 'Bright red blood per rectum', 'title': 'Chief Complaint'},
# some elements removed for brevity
{'text': '', 'title': 'Labs'},
{'text': 'WBC count: 6,500 per mL\nHemoglobin: 10.3 g/dL\nHematocrit:31.8%\nPlatelet count: 248 per mL\nMean corpuscular volume: 86.5 fL\nRDW: 18%', 'title': 'Cbc'},
{'text': 'Sodium: 131 mmol/L\nPotassium: 3.5 mmol/L\nChloride: 98 mmol/L\nBicarbonate: 23 mmol/L\nBUN: 11 mg/dL\nCreatinine: 1.1 mg/dL\nGlucose: 105 mg/dL', 'title': 'Chem 7'},
{'text': 'PT 15.7 sec\nINR 1.6\nPTT 29.5 sec', 'title': 'Coagulation Studies'},
{'text': 'The patient receive ... ula.', 'title': 'Hospital Course'}],
'title': 'Sample Type / Medical Specialty: Gastroenterology\nSample Name: Blood per Rectum'}

Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology& Sample=941-BloodperRectum'
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html,'html.parser')
# Cut the division containing required text,used Right Click and Inspect element in broweser to find the respective div/tag
sampletext_div = soup.find('div', {'id': "sampletext"})
print(sampletext_div.find('h1').text) # TO print header
Output:
Sample Type / Medical Specialty: Gastroenterology
Sample Name: Blood per Rectum
Code:
# Find all the tag
b_all=sampletext_div.findAll('b')
for b in b_all[4:]:
print(b.text, b.next_sibling)
Output:
CHIEF COMPLAINT: Bright red blood per rectum
HISTORY OF PRESENT ILLNESS: This 73-year-old woman had a recent medical history significant for renal and bladder cancer, deep venous thrombosis of the right lower extremity, and anticoagulation therapy complicated by lower gastrointestinal bleeding. Colonoscopy during that admission showed internal hemorrhoids and diverticulosis, but a bleeding site was not identified. Five days after discharge to a nursing home, she again experienced bloody bowel movements and returned to the emergency department for evaluation.
REVIEW OF SYMPTOMS: No chest pain, palpitations, abdominal pain or cramping, nausea, vomiting, or lightheadedness. Positive for generalized weakness and diarrhea the day of admission.
PRIOR MEDICAL HISTORY: Long-standing hypertension, intermittent atrial fibrillation, and hypercholesterolemia. Renal cell carcinoma and transitional cell bladder cancer status post left nephrectomy, radical cystectomy, and ileal loop diversion 6 weeks prior to presentation, postoperative course complicated by pneumonia, urinary tract infection, and retroperitoneal bleed. Deep venous thrombosis 2 weeks prior to presentation, management complicated by lower gastrointestinal bleeding, status post inferior vena cava filter placement.
MEDICATIONS: Diltiazem 30 mg tid, pantoprazole 40 mg qd, epoetin alfa 40,000 units weekly, iron 325 mg bid, cholestyramine. Warfarin discontinued approximately 10 days earlier.
ALLERGIES: Celecoxib (rash).
SOCIAL HISTORY: Resided at nursing home. Denied alcohol, tobacco, and drug use.
FAMILY HISTORY: Non-contributory.
PHYSICAL EXAM: 
LABS: 
CBC: 
CHEM 7: 
COAGULATION STUDIES: 
HOSPITAL COURSE: The patient received 1 liter normal saline and diltiazem (a total of 20 mg intravenously and 30 mg orally) in the emergency department. Emergency department personnel made several attempts to place a nasogastric tube for gastric lavage, but were unsuccessful. During her evaluation, the patient was noted to desaturate to 80% on room air, with an increase in her respiratory rate to 34 breaths per minute. She was administered 50% oxygen by nonrebreadier mask, with improvement in her oxygen saturation to 89%. Computed tomographic angiography was negative for pulmonary embolism.
Keywords:
gastroenterology, blood per rectum, bright red, bladder cancer, deep venous thrombosis, colonoscopy, gastrointestinal bleeding, diverticulosis, hospital course, lower gastrointestinal bleeding, nasogastric tube, oxygen saturation, emergency department, rectum, thrombosis, emergency, department, gastrointestinal, blood, bleeding, oxygen,
NOTE : These transcribed medical transcription sample reports and examples are provided by various users and
are for reference purpose only. MTHelpLine does not certify accuracy and quality of sample reports.
These transcribed medical transcription sample reports may include some uncommon or unusual formats;
this would be due to the preference of the dictating physician. All names and dates have been
changed (or removed) to keep confidentiality. Any resemblance of any type of name or date or
place or anything else to real world is purely incidental.

How to write csv file from scraped data from web in python

I am trying to scrape data from web pages and able to scrape also.
After using below script getting all div class data but I am confused how to write data in CSV file like.
First Data in the first name column
Last name data in last name column
.
.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'
page = urlopen(html)
data = BeautifulSoup(page, 'html.parser')
name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b
for i in range(len(name_box)):
data = name_box[i].text.strip()
Data:
Information Type
Individual
First Name
KACHAM
Middle Name
Last Name
RAJESHWAR
Father Full Name
RAMAIAH
Do you have any Past Experience ?
No
Do you have any registration in other State than registred State?
No
House Number
8-2-293/82/A/446/1
Building Name
SAI KRUPA
Street Name
ROAD NO 20
Locality
JUBILEE HILLS
Landmark
JUBILEE HILLS
State
Telangana
Division
Division 1
District
Hyderabad
Mandal
Shaikpet
Village/City/Town
Pin Code
500033
Office Number
04040151614
Fax Number
Website URL
Authority Name
Plan Approval Number
1/18B/06558/2018
Project Name
SKV S ANANDA VILAS
Project Status
New Project
Proposed Date of Completion
17/04/2024
Litigations related to the project ?
No
Project Type
Residential
Are there any Promoter(Land Owner/ Investor) (as defined by Telangana RERA Order) in the project ?
Yes
Sy.No/TS No.
00
Plot No./House No.
10-2-327
Total Area(In sqmts)
526.74
Area affected in Road widening/FTL of Tanks/Nala Widening(In sqmts)
58.51
Net Area(In sqmts)
1
Total Building Units (as per approved plan)
1
Proposed Building Units(as per agreement)
1
Boundaries East
PLOT NO 213
Boundaries West
PLOT NO 215
Boundaries North
PLOT NO 199
Boundaries South
ROAD NO 8
Approved Built up Area (In Sqmts)
1313.55
Mortgage Area (In Sqmts)
144.28
State
Telangana
District
Hyderabad
Mandal
Maredpally
Village/City/Town
Street
ROAD NO 8
Locality
SECUNDERABAD COURT
Pin Code
500026
above is the data getting after run above code.
Edit
for i in range(len(name_box)):
data = name_box[i].text.strip()
print (data)
fname = 'out.csv'
with open(fname) as f:
next(f)
for line in f:
head = []
value = []
for row in line:
head.append(row)
print (row)
Expected
Information Type | First | Middle Name | Last Name | ......
Individual | KACHAM | | RAJESHWAR | .....
I have 200 url but all url data is not same means some of these missing. I want to write such way if data not avaialble then write anotthing just blank.
Please suggest. Thank you in advance

to write to csv you need to know what value should be in head and body, in this case head value should be html element contain <label
from urllib2 import urlopen
from bs4 import BeautifulSoup
html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'
page = urlopen(html)
data = BeautifulSoup(page, 'html.parser')
name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b
heads = []
values = []
for i in range(len(name_box)):
data = name_box[i].text.strip()
dataHTML = str(name_box[i])
if 'PInfoType' in dataHTML:
# <div class="col-md-3 col-sm-3" id="PInfoType">
# empty value, maybe additional data for "Information Type"
continue
if 'for="2"' in dataHTML:
# <label for="2">No</label>
# it should be head but actually value
values.append(data)
elif '<label' in dataHTML:
# <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
# head or top row
heads.append(data)
else:
# <div class="col-md-3 col-sm-3">Individual</div>
# value for second row
values.append(data)
csvData = ', '.join(heads) + '\n' + ', '.join(values)
with open("results.csv", 'w') as f:
f.write(csvData)
print "finish."

Question: How to write csv file from scraped data
Read the Data into a dict and use csv.DictWriter(... to write to CSV file.
Documentations about:
csv.DictWriter
while
next
break
Mapping Types — dict
Skip the first line, as it's the title
Loop Data lines
key = next(data)
value = next(data)
Break loop if no further data
Build dict[key] = value
After finishing the loop, write dict to CSV file
Output:
{'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)

Scrape only selected text from tables using Python/Beautiful soup/pandas

I am new to Python and am using beautiful soup for web scraping for a project.
I am hoping to only get parts of the text in a list/dictionary. I started with the following code:
url = "http://eng.mizon.co.kr/productlist.asp"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')
This helped me parse data into tables and ONE of the items from table looked as below:
<table border="0" cellpadding="0" cellspacing="0" width="235">
<tr>
<td align="center" height="238"><img alt="LL IN ONE SNAIL REPAIR CREAM, SNAIL REPAIR BLEMISH BALM, WATERMAX MOISTURE B.B CREAM, WATERMAX AQUA GEL CREAM, CORRECT COMBO CREAM, GOLD STARFISH ALL IN ONE CREAM, S-VENOM WRINKLE TOX CREAM, BLACK SNAIL ALL IN ONE CREAM, APPLE SMOOTHIE PEELING GEL, REAL SOYBEAN DEEP CLEANSING OIL, COLLAGEN POWER LIFTING CREAM, SNAIL RECOVERY GEL CREAM" border="0" src="http://www.mizon.co.kr/images/upload/product/20150428113514_3.jpg" width="240"/></td>
</tr>
<tr>
<td align="center" height="43" valign="middle"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><span class="style3">ENJOY VITAL-UP TIME Lift Up Mask <br/>
Volume:25ml</span></a></td>
</tr>
</table>
For each such item in the table, I would like to extract only the following from the last data cell in table above:
1) The four digit number in a href = javascript:fnMoveDetail(7499)
2) Name under class:style3
3) volume under class:style3
The next lines in my code were as follows:
df = pd.read_html(str(tables), skiprows={0}, flavor="bs4")[0]
a_links = soup.find_all('a', attrs={'class':'style3'})
stnid_dict = {}
for a_link in a_links:
cid = ((a_link['href'].split("javascript:fnMoveDetail("))[1].split(")")[0])
stnid_dict[a_link.text] = cid
My objective is to use the numbers to go to individual links and then match the info scraped on this page to each link.
What would be the best way to approach this?

use a tag which contains javascript href as anchor, find all span and then get it's parent tag.
url = "http://eng.mizon.co.kr/productlist.asp"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
spans = soup.select('td > a[href*="javascript:fnMoveDetail"] > span')
for span in spans:
href = span.find_parent('a').get('href').strip('javascript:fnMoveDetail()')
name, volume = span.get_text(strip=True).split('Volume:')
print(name, volume, href)
out:
Dust Clean up Peeling Toner 150ml 8235
Collagen Power Lifting EX Toner 150ml 8067
Collagen Power Lifting EX Emulsion 150ml 8068
Barrier Oil Toner 150ml 8059
Barrier Oil Emulsion 150ml 8060
BLACK CLEAN UP PORE WATER FINISHER 150ml 7650
Vita Lemon Sparkling Toner 150ml 7356
INTENSIVE SKIN BARRIER TONER 150ml 7110
INTENSIVE SKIN BARRIER EMULSION 150ml 7111

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.