Python: Get text from all HTML child elements texts with lxml xpath

Python: Get text from all HTML child elements texts with lxml xpath - python

I am using the lxml xpath of python. I am able to extract text if I give the full path to a HTML tag. However I can't extract all the text from a tag and it's child elements into a list. So for example given this html I would like to get all the texts of the "example" class:
<div class="example">
"Some text"
<div>
"Some text 2"
<p>"Some text 3"</p>
<p>"Some text 4"</p>
<span>"Some text 5"</span>
</div>
<p>"Some text 6"</p>
</div>
I would like to get:
["Some text", "Some text 2", "Some text 3", "Some text 4", "Some text 5", "Some text 6"]

mzjn-s anwer is correct. After some trial and error I've managed to get it working. This is what the end code looks like. You need to put //text() to the end of the xpath. It is without refactoring for the moment, so there will definitely be some mistakes and bad practices but it works.
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
page = session.get("The url you are webscraping")
content = page.content
htmlsite = urllib.request.urlopen("The url you are webscraping")
soup = BeautifulSoup(htmlsite, 'lxml')
htmlsite.close()
tree = html.fromstring(content)
scraped = tree.xpath('//html[contains(#class, "no-js")]/body/div[contains(#class, "container")]/div[contains(#class, "content")]/div[contains(#class, "row")]/div[contains(#class, "col-md-6")]/div[contains(#class, "clearfix")]//text()')
I've tried it out on the team introduction page of keeleyteton.com. It returned the following list which is correct (although needs lots of amending!) because they are in different tags and some are children tags. Thank you for the help!
['\r\n ', '\r\n ', 'Nicholas F. Galluccio', '\r\n ', '\r\n ', 'Managing Director and Portfolio Manager', '\r\n ', 'Teton Small Cap Select Value', '\r\n ', 'Keeley Teton Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Scott R. Butler', '\r\n ', '\r\n ', 'Senior Vice President and Portfolio Manager ', '\r\n ', 'Teton Small Cap Select Value', '\r\n ', 'Keeley Teton Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Thomas E. Browne, Jr., CFA', '\r\n ', '\r\n ', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Brian P. Leonard, CFA', '\r\n ', '\r\n
', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Robert M. Goldsborough', '\r\n ', '\r\n ', 'Research Analyst', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', '\r\n ', '\r\n ', 'Brian R. Keeley, CFA', '\r\n ', '\r\n ', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Edward S. Borland', '\r\n ', '\r\n
', 'Research Analyst', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Kevin M. Keeley', '\r\n ', '\r\n ', 'President', '\r\n
', '\r\n ', '\r\n ', 'Deanna B. Marotz', '\r\n ', '\r\n ', 'Chief Compliance Officer', '\r\n ']

Related

PFF data scraping not recognizing text

I am trying to scrape PFF.com for football grades with selenium, I am trying to get a specific grade for all Quarterbacks. Problem is, it doesn't seem like it's capturing the text as .text isn't working but I am not getting any NoSuchElementException.
Here's my code:
service = Service(executable_path="C:\\chromedriver.exe")
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=op)
driver.get("https://premium.pff.com/nfl/positions/2022/REG/passing?position=QB")
sleep(2)
sign_in = driver.find_element(By.XPATH, '/html/body/div/div/header/div[3]/button')
sign_in.click()
sleep(2)
email = driver.find_element(By.XPATH, '/html/body/div/div/div/div/div/div/form/div[1]/input')
email.send_keys(my_email)
password = driver.find_element(By.XPATH,
'/html/body/div/div/div/div/div/div/form/div[2]/input')
password.send_keys(my_password)
sleep(2)
sign_in_2 = driver.find_element(By.XPATH,
'/html/body/div/div/div/div/div/div/form/button')
sign_in_2.click()
sleep(2)
all_off_grades = driver.find_elements(By.CSS_SELECTOR, '.kyber-table
.kyber-grade-badge__info-text div')
all_qb_names = driver.find_elements(By.CSS_SELECTOR, '.kyber-table .p-1 a')
qb_grades = []
qb_names = []
for grade in all_off_grades:
qb_grades.append(grade.text)
for qb_name in all_qb_names:
qb_names.append(qb_name.text)
print(qb_grades)
print(qb_names)
The lists keep showing as empty.
Here are the elements I am trying to pull, but for every QB, I already confirmed the other QB's have the same class names for their grade and name.
<div class="kyber-grade-badge__info-text">91.5</div>
need to pull the 91.5
<a class="p-1" href="/nfl/players/2022/REG/josh-allen/46601/passing">Josh Allen</a>
need to pull Josh Allen

#Jbuck3 I tried modifying the locator and it works for me. I am also giving the output I am getting. Let me know that is what you were expecting.
all_off_grades = driver.find_elements(By.CSS_SELECTOR, '.kyber-table-body__scrolling-rows-container .kyber-grade-badge__info-text')
all_qb_names = driver.find_elements(By.CSS_SELECTOR, "a[data-gtm-id = 'player_name']")
And the output I got is:
['91.5', '90.3', '74.6', '-', '-', '60.0', '84.3', '78.3', '78.1', '-', '-', '60.0', '82.8', '83.4', '-', '-', '-', '60.0']
['Josh Allen ', 'Geno Smith ', 'Kirk Cousins ', 'Marcus Mariota ', 'Jameis Winston ', 'Trey Lance ', 'Derek Carr ', 'Justin Fields ', 'Trevor Lawrence ', 'Russell Wilson ', 'Ryan Tannehill ', 'Tom Brady ', 'Tua Tagovailoa ', 'Mac Jones ', 'Davis Mills ', 'Matthew Stafford ', 'Baker Mayfield ', 'Lamar Jackson ', 'Joe Flacco ', 'Matt Ryan ', 'Jalen Hurts ', 'Daniel Jones ', 'Kyler Murray ', 'Justin Herbert ', 'Joe Burrow ', 'Aaron Rodgers ', 'Patrick Mahomes ', 'Mitchell Trubisky ', 'Dak Prescott ', 'Jacoby Brissett ', 'Carson Wentz ', 'Jared Goff ']

Flatten a list inside list of tuples

I have the following list of tuples.
lst =
[
('LexisNexis', ['IT Services and IT Consulting ', ' New York City, NY']),
('AbacusNext', ['IT Services and IT Consulting ', ' La Jolla, California']),
('Aderant', ['Software Development ', ' Atlanta, GA']),
('Anaqua', ['Software Development ', ' Boston, MA']),
('Thomson Reuters Elite', ['Software Development ', ' Eagan, Minnesota']),
('Litify', ['Software Development ', ' Brooklyn, New York'])
]
I want to flatten the lists in each tuple to be part of the tuples of lst.
I found this How do I make a flat list out of a list of lists? but have no idea how to make it adequate to my case.

You can use unpacking:
lst = [('LexisNexis', ['IT Services and IT Consulting ', ' New York City, NY']),
('AbacusNext', ['IT Services and IT Consulting ', ' La Jolla, California']),
('Aderant', ['Software Development ', ' Atlanta, GA']),
('Anaqua', ['Software Development ', ' Boston, MA']),
('Thomson Reuters Elite', ['Software Development ', ' Eagan, Minnesota']),
('Litify', ['Software Development ', ' Brooklyn, New York'])]
output = [(x, *l) for (x, l) in lst]
print(output)
# [('LexisNexis', 'IT Services and IT Consulting ', ' New York City, NY'),
# ('AbacusNext', 'IT Services and IT Consulting ', ' La Jolla, California'),
# ('Aderant', 'Software Development ', ' Atlanta, GA'),
# ('Anaqua', 'Software Development ', ' Boston, MA'),
# ('Thomson Reuters Elite', 'Software Development ', ' Eagan, Minnesota'),
# ('Litify', 'Software Development ', ' Brooklyn, New York')]

I've found the answer by Deacon using abc from collections. It is worth to try it too.
from collections import abc
def flatten(obj):
for o in obj:
# Flatten any iterable class except for strings.
if isinstance(o, abc.Iterable) and not isinstance(o, str):
yield from flatten(o)
else:
yield o
[tuple(flatten(i)) for i in lst]
Out[47]:
[('LexisNexis', 'IT Services and IT Consulting ', ' New York City, NY'),
('AbacusNext', 'IT Services and IT Consulting ', ' La Jolla, California'),
('Aderant', 'Software Development ', ' Atlanta, GA'),
('Anaqua', 'Software Development ', ' Boston, MA'),
('Thomson Reuters Elite', 'Software Development ', ' Eagan, Minnesota'),
('Litify', 'Software Development ', ' Brooklyn, New York')]

How to get the right xpath for this node? scrapy

I'm new at using web scrapy and I've been trying to get the right xpath from this portion of the code.
from this website
hmtl code
I've been using this scrapy commands:
response.xpath('//*[#id="companycontent"]/div/div/div[2]/div/div[6]/div').getall()
This is the Output:
['<div class="address">\r\n <h4>Address <span>1</span></h4>\r\n <strong>Office : </strong>1715 , 1714<br>\r\n <strong>Floor : </strong>Floor 17<br>\r\n <strong>Building : </strong>Shatha Tower<br>\r\n Dubai Internet City<br><br>\r\n \t\t</div>']
response.xpath('//*[#id="companycontent"]/div/div/div2/div/div[6]/div').get()
'\r\n Address 1\r\n Office : 1715 , 1714\r\n Floor : Floor 17\r\n Building : Shatha Tower\r\n Dubai Internet City\r\n \t\t'
And this one:
response.xpath('//div[contains(#class, "address")]/text()').extract()
with the output:
['\r\n \r\n \r\n \t\t\t\t\t\t\t\t ', '\r\n ', '\r\n ', '1715 , 1714', '\r\n ', 'Floor 17', '\r\n ', 'Shatha Tower', '\r\n Dubai Internet City', '\r\n \t\t', ' \r\n \t\t\r\n\r\n\t\t\t\t\t\t \r\n ']
response.xpath('//div[contains(#class, "address")]/text()').getall()
['\r\n \r\n \r\n \t\t\t\t\t\t\t\t ', '\r\n ', '\r\n ', '1715 , 1714', '\r\n ', 'Floor 17', '\r\n ', 'Shatha Tower', '\r\n Dubai Internet City', '\r\n \t\t', ' \r\n \t\t\r\n\r\n\t\t\t\t\t\t \r\n ']
I'm sure the first command will do the job but I was wondering if there's a shorter xpath command to run the script.
Hope anyone can help me.

Finding text by xpath follow as //tag-name[#class="class-name"] you can follow this approach and find the data
Code:
from selenium import webdriver
path="C:\Program Files (x86)\chromedriver.exe"
driver=webdriver.Chrome(path)
driver.get("https://tecomgroup.ae/directory/company.php?company=0016F00001wcgFJQAY&csrt=2648526569298119449")
data=driver.find_element_by_xpath('//div[#class="address"]')
data.text.split("\n")
Output:
['ADDRESS 1',
'Office : 1715 , 1714',
'Floor : Floor 17',
'Building : Shatha Tower',
'Dubai Internet City']

You could use also css selector response.css('div.address > div.address ::text')
print(`[x.strip() for x in response.css('div.address > div.address ::text').getall() if x.strip()]`)

Python: Unable to expand a list

I extract information from a website and able to get the data. however, I unable to expend the data for 'K" but successfully for 'J' and ''L' .
print "hello from python 2"
from lxml import html
import requests
import csv
import pandas as pd
import MySQLdb as mdb
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
bursa = ['J','K','L']
bursalist = []
print bursalist
for x in range (len(bursa)):
try:
page = requests.get('https://www.malaysiastock.biz/Listed-Companies.aspx?type=A&value='+bursa[x])
tree = html.fromstring(page.content)
tree1 = tree.xpath('//td/h3/a[contains(text(),"(")][not(contains(text(),"(F"))]/text()')
tree2 = tree.xpath('.//td/h3/text()')
len(tree1)
alist = []
for y in range (len(tree1)):
a = tree1[y].split('(')
b = a[1].split(')')
alist.append(a[0])
alist.append(b[0])
alist.append(tree2[2*y])
alist.append(tree2[2*y+1])
a = alist
bursalist.extend(alist)
print bursalist
except Exception:
print "no data"
Notice that print bursalist only show data from 'J' and 'L' while 'K' is missing. But if I get the data from alist, data'K' is shown but unable to extend.
please advise if there is a robust way to do it
print alist will show
['K1 ', '0111', 'K-ONE TECHNOLOGY BERHAD', 'Technology Equipment ', 'KAB ', '0193', 'KEJURUTERAAN ASASTERA BERHAD', 'Industrial Engineering ', 'KAMDAR ', '8672', 'KA
MDAR GROUP (M) BERHAD', 'Retailers ', 'KANGER ', '0170', 'KANGER INTERNATIONAL BERHAD', 'Household Goods ', 'KAREX ', '5247', 'KAREX BERHAD', 'Personal Goods ', 'KAR
YON ', '0054', 'KARYON INDUSTRIES BERHAD', 'Chemicals ', 'KAWAN ', '7216', 'KAWAN FOOD BERHAD', 'Food & Beverages ', 'KEINHIN ', '7199', 'KEIN HING INTERNATIONAL BER
HAD', 'Metals ', 'KEN ', '7323', 'KEN HOLDINGS BERHAD', 'Property ', 'KENANGA ', '6483', 'KENANGA INVESTMENT BANK BERHAD', 'Other Financials ', 'KERJAYA ', '7161', '
KERJAYA PROSPEK GROUP BERHAD', 'Construction ', 'KESM ', '9334', 'KESM INDUSTRIES BERHAD', 'Semiconductors ', 'KEYASIC ', '0143', 'KEY ASIC BERHAD', 'Semiconductors
', 'KFIMA ', '6491', 'KUMPULAN FIMA BERHAD', 'Diversified Industrials ', 'KGB ', '0151', 'KELINGTON GROUP BERHAD', 'Industrial Engineering ', 'KGROUP ', '0036', 'KEY
ALLIANCE GROUP BERHAD', 'Technology Equipment ', 'KHEESAN ', '6203', 'KHEE SAN BERHAD', 'Food & Beverages ', 'KHIND ', '7062', 'KHIND HOLDINGS BERHAD', 'Household G
oods ', 'KHJB ', '0210', 'KIM HIN JOO (MALAYSIA) BERHAD', 'Retailers ', 'KIALIM ', '6211', 'KIA LIM BERHAD', 'Building Materials ', 'KIMHIN ', '5371', 'KIM HIN INDUS
TRY BERHAD', 'Building Materials ', 'KIMLUN ', '5171', 'KIMLUN CORPORATION BERHAD', 'Construction ', 'KINSTEL ', '5060', 'KINSTEEL BHD', 'Metals ', 'KIPREIT ', '5280
', 'KIP REAL ESTATE INVESTMENT TRUST', 'Real Estate Investment Trusts ', 'KKB ', '9466', 'KKB ENGINEERING BERHAD', 'Industrial Engineering ', 'KLCC ', '5235SS', 'KLC
C PROPERTY HOLDINGS BERHAD', 'Real Estate Investment Trusts ', 'KLCI1XI ', '0835EA', 'KENANGA KLCI DAILY (-1X) INVERSE ETF', 'KENANGA KLCI DAILY 2X LEVERAGED ETF', '
NUSFOR ', '5035', 'KOBAY TECHNOLOGY BERHAD', 'Industrial Materials, Components & Equipment ', 'KOBAY ', '6971', 'KOMARKCORP BERHAD', 'Packaging Materials ', 'KOMARK
KLCI2XL ', '0834EA', 'KUALA LUMPUR KEPONG BERHAD', 'Plantation ', 'KLK ', '2445', 'KLUANG RUBBER COMPANY (MALAYA) BERHAD', 'Plantation ', 'KLUANG ', '2453', 'KIM LOONG RESOURCES BERHAD', 'Plantation ', 'KMLOONG ', '5027', 'KNM GROUP BERHAD', 'Other Energy Resources ', 'KNM ', '7164', 'KNUSFORD BERHAD', 'Industrial Services ', 'KNUSFOR ', '5035', 'KOBAY TECHNOLOGY BERHAD', 'Industrial Materials, Components & Equipment ', 'KOBAY ', '6971', 'KOMARKCORP BERHAD', 'Packaging Materials ', 'KOMARK ', '7017', 'KOSSAN RUBBER INDUSTRIES BERHAD', 'Health Care Equipment & Services ', 'KOSSAN ', '7153', 'KOTRA INDUSTRIES BERHAD', 'Pharmaceuticals ', 'KOTRA ', '0002', 'KPJ HEALTHCARE BERHAD', 'Health Care Providers ', 'KPJ ', '5878', 'KUMPULAN POWERNET BERHAD', 'Personal Goods ', 'KPOWER ', '7130', 'KERJAYA PROSPEK PROPERTY BERHAD', 'Property ', 'KPPROP ', '7077', 'KUMPULAN PERANGSANG SELANGOR BERHAD', 'Diversified Industrials ', 'KPS ', '5843', 'KPS CONSORTIUM BERHAD', 'Wood & Wood Products ', 'KPSCB ', '9121', 'KRETAM HOLDINGS BERHAD', 'Plantation ', 'KRETAM ', '1996', 'KRONOLOGI ASIA BERHAD', 'Digital Services ', 'KRONO ', '0176', 'KECK SENG (MALAYSIA) BERHAD', 'Diversified Industrials ', 'KSENG ', '3476', 'KSL HOLDINGS BERHAD', 'Property ', 'KSL ', '5038', 'K.SENG SENG CORPORATION BERHAD', 'Industrial Materials, Components & Equipment ', 'KSSC ', '5192', 'K-STAR SPORTS LIMITED', 'Personal Goods ', 'KSTAR ', '5172', 'KONSORTIUM TRANSNASIONAL BERHAD', 'Travel, Leisure & Hospitality ', 'KTB ', '4847', 'KIM TECK CHEONG CONSOLIDATED BERHAD', 'Consumer Services ', 'KTC ', '0180', 'KUB MALAYSIA BERHAD', 'Industrial Services ', 'KUB ', '6874', 'KUCHAI DEVELOPMENT BERHAD', 'Other Financials ', 'KUCHAI ', '2186', 'KWANTAS CORPORATION BERHAD', 'Plantation ', 'KWANTAS ', '6572', 'KYM HOLDINGS BERHAD', 'Pack
aging Materials ', 'KYM ', '8362']

Trying to read csv file in python and creating separate table

import numpy as np
import pandas as pd
Trying to read a csv file using pandas
This is the data that I scraped.
Please note that there are Brackets start and end [](Maybe its a list). What should I write so entire data to be in table form? I don't know how to separate Brackets from the data.
[]
['Auburn University (Online Master of Business Administration with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' /Campus ', ' Raymond J. Harbert College of Business ']
['Auburn University (Data Science)', ' Bachelors ', ' US', ' AL', ' /Campus ', ' Business ']
['The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Manderson Graduate School of Business ']
['The University of Alabama (MS in Operations Management - Decision Analytics Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (M.S. degree in Applied Statistics, Data Mining Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (MBA with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Culverhouse College of Commerce ']
['Arkansas Tech University (Business Data Analytics)', ' Bachelors ', ' US', ' AR', ' /Campus ', ' Business ']
['University of Arkansas (Graduate Certificate in Business Analytics)', ' Certificate ', ' US', ' AR', ' Online/ ', ' Sam M. Walton College of Business ']
['University of Arkansas (Master of Information Systems with Business Analytics Concentration)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of Business ']
['University of Arkansas (Professional Master of Information Systems)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of
How should I read CSV file? I want all the data in a table form. Please help

Your problem is exactly what the error message is telling you it is. The error is in parsing this line:
['The University of Alabama (Master of Science in Marketing,
Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', '
Online/ ', ' Manderson Graduate School of Business ']
The code ignores quote characters and breaks the line up into fields, making a break wherever it finds the delimiter ", ". You're expecting this to be a single field:
The University of Alabama (Master of Science in Marketing,
Specialization in Marketing Analytics
but this "field" has an instance of the delimiter ", " in it, which the CSV parser will honor because it is ignoring the fact that you have this value in quotes. So this piece of data is broken into two fields:
['The University of Alabama (Master of Science in Marketing
and
Specialization in Marketing Analytics)'
This results in the line being broken into 7 fields, and your code is expecting only 6.
Note that in addition, your items are going to include the quotes, which may not be what you're expecting either, and those square braces don't belong there. In short, this isn't a well formed CSV file.
UPDATE: I'm a regex weenie. I do everything with regex expressions, and can't ignore a challenge like this. Here's a regex-based solution that will read exactly what you want out of this data. If you want it to recognize the last line of your data, you should add "']" to the end of that line.
import regex
from pprint import pprint
def parse_file(file):
linepat = regex.compile(r"\[\s*('([^']*)')?(\s*,\s*'([^']*)')*\s*\]")
with open(file) as f:
r = []
while True:
line = f.readline()
if not line:
break
line = line.strip()
if len(line) == 0:
continue
m = linepat.match(line)
if m and m.captures(4):
fields = [m.group(2)] + [s.strip() for s in m.captures(4)]
r.append(fields)
return r
def main():
r = parse_file("/tmp/blah.csv")
pprint(r)
main()
Result:
[['Auburn University (Online Master of Business Administration with '
'concentration in Business Analytics)',
'Masters',
'US',
'AL',
'/Campus',
'Raymond J. Harbert College of Business'],
...
['University of Arkansas (Professional Master of Information Systems)',
'Masters',
'US',
'AR',
'/Campus',
'Sam M. Walton College of']]
Note that this doesn't use the built-in 're' module. That module doesn't deal with repeating groups, which is a must for this kind of problem. Also note that this doesn't involve Pandas. I don't know anything about that module, I assume it is trivial to feed the clean, parsed data from this code into Pandas if that's where you really want it.

the basic method to read file.csv.
def process(string):
print("Processing:",string)
data = []
for line in open("file.csv"):
process(string)
line = line.replace("\n","")
process_code()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Get text from all HTML child elements texts with lxml xpath - python

Related

PFF data scraping not recognizing text

Flatten a list inside list of tuples

How to get the right xpath for this node? scrapy

Python: Unable to expand a list

Trying to read csv file in python and creating separate table

Categories

Resources