Scrape a <p> with optional <spans> with regex - python

I am trying to scrape a table like this:
<table><tr>
<td width="100"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My title example:</span></p></td>
<td width="440"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My text example.</span></p></td>
</tr>
<tr>
<td width="100">My second title:</p></td>
<td width="440"><p>My <span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt; text-decoration: underline;">second</span> text example.</p></td>
</tr></table>
To show the output in a simple list of dictionaries like this:
[
{"title": "My title example", "text": "My text example"},
{"title": "My other example", "text": "My <u>second</u> text example"},
{"title": "My title example", "text": "My new example"},
]
But I need to sanitize the code and swap the underline sections to tags. So this is the code that I have so far:
from bs4 import BeautifulSoup
import re
# Find the rows in the table
for table_row in html.select("table tr"):
cells = table_row.findAll('td')
if len(cells) > 0:
row_title = cells[0].text.strip()
paragraphs = []
# Find all spans in a row
for run in cells[1].findAll('span'):
print(run)
if "text-decoration: underline" in str(run):
paragraphs.append("{0}{1}{2}".format("<u>", run.text, "</u>"))
else:
paragraphs.append(run.text)
# Build up a sanitized string with all the runs.
row_text = "".join(paragraphs)
row = {"title": row_title, "text": row_text}
data.append(row)
print(data)
The issue: As you may noticed, it scrapes the row with spans perfectly (the first example) but it fails on the second one and it only scrapes the underline parts (because the text is not inside span tags). So I was thinking that instead of trying to find spans, I would just remove all the spans and replace the ones that I need with Regex, something like this:
# Find all runs in a row
for paragraph in cells[1].findAll('p'):
re.sub('<.*?>', '', str(paragraph))
And that would create text with no tags, but also without underline formatting, and that's where I am stuck.
I don't know how to add such a condition on regex. Any help is welcome.
Expected output: Remove all tags from paragraph but replace spans where text-decoration: underline is found with <u></u> tags.

One idea would be to use .replace_with() to replace the "underline" span elements with the u elements and then use .encode_contents() to get the inner HTML of the "text" cells:
result = []
for row in soup.select("table tr"):
title_cell, data_cell = row('td')[:2]
for span in data_cell('span'):
if 'underline' in span.get('style', ''):
u = soup.new_tag("u")
u.string = span.get_text()
span.replace_with(u)
else:
# replacing the "span" element with its contents
span.unwrap()
# replacing the "p" element with its contents
data_cell.p.unwrap()
result.append({
"title": title_cell.get_text(strip=True),
"test": str(data_cell.encode_contents())
})

When you find a <span> tag with the underline attribute, you can change its text to add the <u>...</u> tags using span.string = '<u>{}</u>'.format(span.text). After modifying the text, you can remove the <span> tag using unwrap().
result = []
for row in soup.select('table tr'):
columns = row.find_all('td')
title = columns[0]
txt = columns[1]
for span in txt.find_all('span', style=lambda s: 'text-decoration: underline' in s):
span.string = '<u>{}</u>'.format(span.text)
span.unwrap()
result.append({'title': title.text, 'text': txt.text})
print(result)
# [{'title': 'My title example:', 'text': 'My text example.'}, {'title': 'My second title:', 'text': 'My <u>second</u> text example.'}]
Note: This approach won't actually change the tag. It modifies the string and removes the tag.

Related

Regular text between nested elements using LXML / Etree [duplicate]

Python 2.7 using lxml
I have some annoyingly formed html that looks like this:
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.
So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b'). So let's assume I'm currently on the b node for John.
I'm trying to get whatever.xpath('string()') for everything following the current node but preceding the next b node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and operator in an expression that has no [] brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?
This should work:
from lxml import etree
p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)
my_dict = {}
for b in tree.iter('b'):
br = b.getnext().tail.replace('\n', '')
my_dict[b.text.replace('\n', '')] = br
print my_dict
This code prints:
{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}
(You may want to strip the quotation marks out!)
Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b> elements, you can then manually navigate to the <br> element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.
What not use getchildren function from view of each td. For example:
from lxml import html
s = """
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
"""
records = []
cur_record = -1
cur_field = 1
FIELD_NAME = 0
FIELD_STREET = 1
FIELD_CITY = 2
doc = html.fromstring(s)
td = doc.xpath('//td')[0]
for child in td.getchildren():
if child.tag == 'b':
cur_record += 1
record = dict()
record['name'] = child.text.strip()
records.append(record)
cur_field = 1
elif child.tag == 'br':
if cur_field == FIELD_STREET:
records[cur_record]['street'] = child.tail.strip()
cur_field += 1
elif cur_field == FIELD_CITY:
records[cur_record]['city'] = child.tail.strip()
And the results are:
records = [
{'city': '"New York\n"', 'name': '"John"\n', 'street': '"123 Main st.\n"'},
{'city': '"San Francisco\n"', 'name': '\n"Sally"\n', 'street': '"101 California St.\n"'}
]
Note you should use tag.tail if you want to get text of some non-close html tag, e.g., <br>.
Hope this would be helpful.

Extracting text section from (Edgar 10-K filings) HTML

I am trying to extract a certain section from HTML-files. To be specific, I look for the "ITEM 1" Section of the 10-K filings (a US business reports of a certain company). E.g.:
https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002
Problem: However, I am not able to find the "ITEM 1" section, nor do I have an idea how to tell my algorithm to search from that point "ITEM 1" to another point (e.g. "ITEM 1A") and extract the text in between.
I am super thankful for any help.
Among others, I have tried this (and similar), but my bd is always empty:
try:
# bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
# bd = soup.find_all(name="ITEM 1")
# bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])
print(" Business Section (Item 1): ", bd.content)
except:
print("\n Section not found!")
Using Python 3.7 and Beautifulsoup4
Regards Heka
As I mentioned in a comment, because of the nature of EDGAR, this may work on one filing but fail on another. The principles, though, should generally work (after some adjustments...)
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
tabs = doc.xpath('//table[./tr/td/font/a[#name="a_002"]]/following-sibling::p/font')
#in this filing, Item 1 is hiding in a series of <p> tags following a table with an <a> tag with a
#"name" attribute which has a value of "a_002"
flag = ''
for i in tabs:
if flag == 'stop':
break
if i.text is not None: #we now start extracting the text from each <p> tag and move to the next
print(i.text_content().strip().replace('\n',''))
nxt = i.getparent().getnext()
#the following detects when the <p> tags of Item 1 end and the next Item begins and then stops
if str(type(nxt)) != "<class 'NoneType'>" and nxt.tag == 'table':
for j in nxt.iterdescendants():
if j.tag == 'a' and j.values()[0]=='a_003':
# we have encountered the <a> tag with a "name" attribute which has a value of "a_003", indicated the beginning of the next Item; so we stop
flag='stop'
The output is the text of Item 1 in this filing.
There are special characters. Remove them first
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
doc.loadHtml(doc.replaceReg(doc.html, 'ITEM[\s]+','ITEM '))
item1 = doc.getElementByText('ITEM 1')
print(item1) # {'tag': 'B', 'html': 'ITEM 1. BUSINESS'}
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)
If you use the latest version, you can use the following methods
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
item1 = doc.getElementByReg('ITEM[\s]+1') # Incoming regex
print(item1,item1.text) # {'tag': 'B', 'html': 'ITEM\n 1. BUSINESS'} ITEM 1. BUSINESS
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)

Python Scraping: How to separate multiple attributes in one cell (td)?

When scraping an HTML table, if a cell (td) in the table has multiple attributes (See HTML snippet for example) how can you separate the two and/or how could you select just one?
HTML snippet:
<td class="playerName md align-left pre in post" style="display: table-cell;"><span ...</span>
<a role="button" class="full-name">Dustin Johnson</a>
<a role="button" class="short-name">D. Johnson</a></td>
Code I'm trying:
url = 'http://www.espn.com/golf/leaderboard?tournamentId=3742'
req = requests.get(url)
soup = bs4.BeautifulSoup(req.text,'lxml')
table = soup.find(id='leaderboard-view')
headings = [th.get_text() for th in table.find('tr').find_all('th')]
dataset = []
for row in table.find_all('tr'):
a = [td.get_text() for td in row.find_all('td')]
dataset.append(a)
Any advice on how to either a) select just one of the names, or b) separate the cell in to two cells would be appreciated.
Thank you.
If you want the full name and short name, you can try this:
for td in row.find_all('td'):
full_name = td.find('a', {'class': 'full-name'}).text
short_name = td.find('a', {'class': 'short-name'}).text
try to use regex to match the tr
players = the_soup.findAll('tr',{'class':re.compile("player-overview")})
for p in players:
name = p.find('a',{'class':'full-name'}).get_text()

Using requests and Beautifulsoup to find text in page (With CSS)

I'm doing a request to a webpage and I'm trying to retrieve some text on it. The text is splitup with span tags like this:
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
There are "inline style sheets" (CSS sheets) that says if we have to print or not the text to the screen and thus, not print the gibberish text on the screen. This is an example of 1 of the sheet:
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}
but there are more CSS files like this.. So I don't know if there are any better way to achieve my goal (print the text that shows on screen and not use the gibberish that is not displayed)
My script is able to print the text.. but all of it (with gibberish) as the following: "This is jvgviehrgjfne my gt4ugirdfgr script!"
If i understood you right, what you should do is to parse css files with regex for attributes associated with inline and provide the results to beautiful soup api. Here is a way:
import re
import bs4
page_txt = """
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
"""
css_file_read_output = """
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}"""
css_file_lines = css_file_read_output.splitlines()
css_lines_text = []
for line in css_file_lines:
inline_search = re.search(".*inline.*", line)
if inline_search is not None:
inline_group = inline_search.group()
class_name_search = re.search("\..*\{", inline_group)
class_name_group = class_name_search.group()
class_name_group = class_name_group[1:-1] # getting rid of the last { and first .
css_lines_text.append(class_name_group)
else:
pass
page_bs = bs4.BeautifulSoup(page_txt,"lxml")
wanted_text_list = []
for line in css_lines_text:
wanted_line = page_bs.find("span", class_=line)
wanted_text = wanted_line.get_text(strip=True)
wanted_text_list.append(wanted_text)
wanted_string = " ".join(wanted_text_list)

Python + BeautifulSoup - Limiting text extraction on a specific table (multiple tables on a webpage)

Hello all…I am trying to use BeautifulSoup to pick up the content of “Date of Employment:” on a webpage. the webpage contains 5 tables. the 5 tables are similar and looked like below.
<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Design Team</th><th class="top">Top</th></tr></thead><tbody><tr><td style="width:20px;">Designer:</td><td>Michael Linnen</td></tr>
<tr><td style="width:20px;">Date of Employment:</td><td>07 Jan 2012</td></tr>
<tr><td style="width:20px;">No of Works:</td><td>6</td></tr>
<tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>
<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Operation Team</th><th class="top">Top</th></tr></thead><tbody><tr><td style="width:20px;">Manager:</td><td>Nich Sharmen</td></tr>
<tr><td style="width:20px;">Date of Employment:</td><td>02 Nov 2005</td></tr>
<tr><td style="width:20px;">Zones:</td><td>6</td></tr>
<tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>
The texts I want is in the 3rd table, the table header is "Design Team".
I am Using below:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
aa = soup.find_all(text=re.compile("Date of Employment:"))
bb = aa[2].findNext('td')
print bb.text
the problem is that, the “Date of Employment:” in this table sometimes is not available. when it's not there, the code picks the "Date of Employment:" in the next table.
How do I restrict my code to pick only the wanted ones in the table named “Design Team”? thanks.
Rather than finding all the Date of Employment and finding the next td you can directy find the 5th table, given that the th is Design Team
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
aa = soup.find_all(text="Design Team")
nexttr = aa.next_sibling
if nexttr.td.text == "Date of Employment:":
print nexttr.td.next_sibling.text
else:
print "No Date of Employment:"
nexttr = aa.next_sibling finds the next tr tag within the table tag.
if nexttr.td.text == "Date of Employment:": ensures that the text within the next td tag withn the tr is "No Date of Employment:"
nexttr.td.next_sibling extracts the immediate td tag following the "Date of Employment"
print nexttr.td.next_sibling.text prints the date

Categories

Resources