Extracting text section from (Edgar 10-K filings) HTML - python

I am trying to extract a certain section from HTML-files. To be specific, I look for the "ITEM 1" Section of the 10-K filings (a US business reports of a certain company). E.g.:
https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002
Problem: However, I am not able to find the "ITEM 1" section, nor do I have an idea how to tell my algorithm to search from that point "ITEM 1" to another point (e.g. "ITEM 1A") and extract the text in between.
I am super thankful for any help.
Among others, I have tried this (and similar), but my bd is always empty:
try:
# bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
# bd = soup.find_all(name="ITEM 1")
# bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])
print(" Business Section (Item 1): ", bd.content)
except:
print("\n Section not found!")
Using Python 3.7 and Beautifulsoup4
Regards Heka

As I mentioned in a comment, because of the nature of EDGAR, this may work on one filing but fail on another. The principles, though, should generally work (after some adjustments...)
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
tabs = doc.xpath('//table[./tr/td/font/a[#name="a_002"]]/following-sibling::p/font')
#in this filing, Item 1 is hiding in a series of <p> tags following a table with an <a> tag with a
#"name" attribute which has a value of "a_002"
flag = ''
for i in tabs:
if flag == 'stop':
break
if i.text is not None: #we now start extracting the text from each <p> tag and move to the next
print(i.text_content().strip().replace('\n',''))
nxt = i.getparent().getnext()
#the following detects when the <p> tags of Item 1 end and the next Item begins and then stops
if str(type(nxt)) != "<class 'NoneType'>" and nxt.tag == 'table':
for j in nxt.iterdescendants():
if j.tag == 'a' and j.values()[0]=='a_003':
# we have encountered the <a> tag with a "name" attribute which has a value of "a_003", indicated the beginning of the next Item; so we stop
flag='stop'
The output is the text of Item 1 in this filing.

There are special characters. Remove them first
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
doc.loadHtml(doc.replaceReg(doc.html, 'ITEM[\s]+','ITEM '))
item1 = doc.getElementByText('ITEM 1')
print(item1) # {'tag': 'B', 'html': 'ITEM 1. BUSINESS'}
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)
If you use the latest version, you can use the following methods
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
item1 = doc.getElementByReg('ITEM[\s]+1') # Incoming regex
print(item1,item1.text) # {'tag': 'B', 'html': 'ITEM\n 1. BUSINESS'} ITEM 1. BUSINESS
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)

Related

I want to scrape all the text like heading, bullets paragraph from article acept some <p> tags from start of the article and from end of the article

I want to scrape the Article for this site
https://www.traveloffpath.com/covid-19-travel-insurance-everything-you-need-to-know/
and https://www.traveloffpath.com/what-to-do-if-your-flight-is-delayed-or-canceled/?swcfpc=1
I am stuck in the "p" tag because I don't want "p" tags from the start of the article and from the end of the article as I don't want "p" Share the article"p" and "p" last updated "p" and some "p" tag from the bottom text that is not included in the article.
Articletext = soup.find(class_="article")
for items in soup.find_all(class_="article"):
Gather = '\n'.join([item.text for item in items.find_all(["h6","h5","h4","h3","h2","h1","p","li"])])
filtered = Gather.split("↓ Join the community ↓")
Content = filtered[0].split("Email")
while True :
try:
Content = filtered[0].split("Email")
except :
Content = Content[1].split("ago")
else :
break
# try:
# Content = filtered[0].split("Email")
# except:
# Content = filtered[0].split("ago")
# Content = re.split('ago | Read More:',Gather)
print("Content: ", Content[1])
enter image description here
Blockquote
You could filter within the list comprehension and then find where to slice of the unwanted parts at the end:
for items in soup.select('article.article'):
tags = [
t for t in items.find_all(["h6","h5","h4","h3","h2","h1","p","li"])
if not (t.name in ['p', 'li'] and (
('class' in t.attrs and t.attrs['class']) or
('id' in t.attrs and t.attrs['id'])
))
] # filtered out "Share..." and "Last Updated..."
tLen = len(tags)
for i in list(range(tLen))[::-1]: #counting down from last tag
if tags[i].name == 'h3':
tags = tags[:i]
break
articleText = '\n'.join([t.text for t in tags])
print(articleText)
with that, you'll be able to get rid of the paragraph with the list of links for further reading. If you want up to just before the "↓ Join the community ↓" part like in your code, just change to if tags[i].name == 'h5': instead of h3, and if you want all the way to the end only skipping the "subscribe..." section , you'd just need to change that if block to
if tags[i].name == 'h5':
tags = tags[:i] + tags[i+1:]
break

How to fetch/scrape all elements from a html "class" which is inside "span"?

I am trying to scrape data from a website where i am collecting data from all elements under "class" which is inside "span" using this piece of code. But i am ending up in fetching only one element instead of all.
expand_hits = soup.findAll("a", {"class": "sold-property-listing"})
apartments = []
for hit_property in expand_hits:
#element = soup.findAll("div", {"class": "sold-property-listing__location"})
place_name = expand_hits[1].find("div", {"class": "sold-property-listing__location"}).findAll("span", {"class": "item-link"})[1].getText()
print(place_name)
apartments.append(final_str)
Expected result for print(place_name)
Stockholm
Malmö
Copenhagen
...
..
.
The result which is am getting for print(place_name)
Malmö
Malmö
Malmö
...
..
.
When i try to fetch the contents from expand_hits[1] i get only one element. If i don't specify the index scraper is throwing an error regarding the usage find(), find_all() and findAll(). As far as i understood i think i have to call the content of the elements iteratively.
Any help is much appreciated.
Thanks in Advance!
Use the loop variable rather than indexing to same collection with same index (expand_hits[1]) and append place_name not final_str
expand_hits = soup.findAll("a", {"class": "sold-property-listing"})
apartments = []
for hit_property in expand_hits:
place_name = hit_property.find("div", {"class": "sold-property-listing__location"}).find("span", {"class": "item-link"}).getText()
print(place_name)
apartments.append(place_name)
You only then need Find and no indexing
Add User-Agent header to ensure results. Also, I note that I have to pick a parent node because at least one result will not be captured by using that class item-link e.g. Övägen 6C. I use replace to get rid of the hidden text present due to now selecting for parent node.
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.hemnet.se/salda/bostader?location_ids%5B%5D=474035"
page = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(page.content,'html.parser')
for result in soup.select('.sold-results__normal-hit'):
print(re.sub(r'\s{2,}',' ', result.select_one('.sold-property-listing__location h2 + div').text).replace(result.select_one('.hide-element').text.strip(), ''))
If you only want where in Malmo e.g. Limhamns Sjöstad, you need to check how many child span tags there are for each listing.
for result in soup.select('.sold-results__normal-hit'):
nodes = result.select('.sold-property-listing__location h2 + div span')
if len(nodes)==2:
place = nodes[1].text.strip()
else:
place = 'not specified'
print(place)

Unable to Locate and Delete Items inside a Table using bs4

Hello Everyone, I am facing a situation where I need to locate and remove items from a table from a website (https://jobassam.in/indian-army-recruitment-rally-mariani-jorhat/). Here is screenshot for better clearance :
How to remove every <tr> below the Text OUR SOCIAL CONNECTIONS including the Text as well. This is the HTML structure of the Table :
The Arrow in the Second Image is where the Text is Present. Also, this is the Code i wrote to Locate this elements and delete them :
getDetails = soup2.find('div', class_='entry-content single-content')
getTables = getDetails.find_all('table')
for i in getTables:
print(i.tr)
i.decompose()
But, the Issue is these Elements are still there. How to achieve this, Please Guide. Thanks
You can simply check the index of the text you want to remove and start decomposing it from there.
So first lets get the table we're interested:
working_table = next((table for table in getTables if table.text.find("OUR SOCIAL CONNECTIONS") != -1), None)
Then lets iterate over all tr on there to search for our item index to cut off:
tables_itens = working_table.find_all('tr')
cut_line = [tables_itens.index(i) for i in tables_itens if i.text == "OUR SOCIAL CONNECTIONS"][0]
Then we start decomposing from there:
for tr in tables_itens:
if tables_itens.index(tr) >= cut_line:
tr.decompose()
We can simply check this with:
for i in tables_itens:
print(i.text)
And it will ouput:
Download Official NotificationClick Here
Online Application LinkClick Here
Visit Official WebsiteClick Here
Full code:
getDetails = soup2.find('div', class_='entry-content single-content')
getTables = getDetails.find_all('table')
working_table = next((table for table in getTables if table.text.find("OUR SOCIAL CONNECTIONS") != -1), None)
tables_itens = working_table.find_all('tr')
cut_line = [tables_itens.index(i) for i in tables_itens if i.text == "OUR SOCIAL CONNECTIONS"][0]
for tr in tables_itens:
if tables_itens.index(tr) >= cut_line:
tr.decompose()
for i in tables_itens:
print(i.text)

How do I get the first 3 sentences of a webpage in python?

I have an assignment where one of the things I can do is find the first 3 sentences of a webpage and display it. Find the webpage text is easy enough, but I'm having problems figuring out how I find the first 3 sentences.
import requests
from bs4 import BeautifulSoup
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if (t.parent.name not in blacklist):
output += '{} '.format(t)
tempout = output.split('.')
for i in range(tempout):
if (i >= 3):
tempout.remove(i)
output = '.'.join(tempout)
print(output)
Finding sentences out of text is difficult. Normally you would look for characters that might complete a sentence, such as '.' and '!'. But a period ('.') could appear in the middle of a sentence as in an abbreviation of a person's name, for example. I use a regular expression to look for a period followed by either a single space or the end of the string, which works for the first three sentences, but not for any arbitrary sentence.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
paragraphs = soup.select('section.article_text p')
sentences = []
for paragraph in paragraphs:
matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)
needed = 3 - len(sentences)
found = len(matches)
n = min(found, needed)
for i in range(n):
sentences.append(matches[i])
if len(sentences) == 3:
break
print(sentences)
Prints:
['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]
To scrape the first three sentences, just add these lines to ur code:
section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"
txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)
print(txt)
Output:
Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.
Hope that this helps!
Actually using beautify soup you can filter by the class "article_text post" seeing source code:
myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)
And get the inner text of p element
Use this instead of soup = BeautifulSoup(html_page, 'html.parser')

Webscraping Issue w/ BeautifulSoup

I am new to Python web scraping, and I am scraping productreview.com for review. The following code pulls all the data I need for a single review:
#Scrape TrustPilot for User Reviews (Rating, Comments)
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import json
import requests
import datetime as dt
final_list=[]
url = 'https://www.productreview.com.au/listings/world-nomads'
r = requests.get(url)
soup = bs(r.text, 'lxml')
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
try:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
name = name.find('span').text
location = soup.find('h4').find('small').text
policy = soup.find('div', class_ ='px-4_1Cw pt-4_9Zz pb-2_1Ex card-body_2iI').find('span').text
title = soup.find('h3').find('span').text
content = soup.find('p', class_ = 'mb-0_2CX').text
rating = soup.find('div', class_ = 'mb-4_2RH align-items-center_3Oi flex-wrap_ATH d-flex_oSG')
rating = rating.find('div')['title']
final_list.append([name, location, policy, rating, title, content])
except AttributeError:
pass
reviews = pd.DataFrame(final_list, columns = ['Name', 'Location', 'Policy', 'Rating', 'Title', 'Content'])
print(reviews)
But when I edit
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
to
for div in soup.findAll('div', class_ = 'loadingOverlay_24D'):
I don't get all reviews, I just get the same entry looped over and over.
Any help would be much appreciated.
Thanks!
Issue 1: Repeated data inside the loop
You loop has the following form:
for div in soup.find('div' , ...):
name = soup.find('h4', ... )
policy = soup.find('div', ... )
...
Notice that you are calling find inside the loop for the soup object. This means that each time you try to find the value for name, it will search the whole document from the beginning and return the first match, in every iteration.
This is why you are getting the same data over and over.
To fix this, you need to call find inside the current review div that you are currently at. That is:
for div in soup.find('div' , ...):
name = div.find('h4', ... )
policy = div.find('div', ... )
...
Issue 2: Missing data and error handling
In your code, any errors inside the loop are ignored. However, there are many errors that are actually happening while parsing and extracting the values. For example:
location = div.find('h4').find('small').text
Not all reviews have location information. Hence, the code will extract h4, then try to find small, but won't find any, returning None. Then you are calling .text on that None object, causing an exception. Hence, this review will not be added to the result data frame.
To fix this, you need to add more error checking. For example:
locationDiv = div.find('h4').find('small')
if locationDiv:
location = locationDiv.text
else:
location = ''
Issue 3: Identifying and extracting data
The page you're trying to parse has broken HTML, and uses CSS classes that seem random or at least inconsistent. You need to find the correct and unique identifiers for the data that you are extracting such that they strictly match all the entries.
For example, you are extracting the review-container div using CSS class loadingOverlay_24D. This is incorrect. This CSS class seems to be for a "loading" placeholder div or something similar. Actual reviews are enclosed in div blocks that look like this:
<div itemscope="" itemType="http://schema.org/Review" itemProp="review">
....
</div>
Notice that the uniquely identifying property is the itemProp attribute. You can extract those div blocks using:
soup.find('div', {'itemprop': 'review'}):
Similarly, you have to find the correct identifying properties of the other data you want to extract to ensure you get all your data fully and correctly.
One more thing, when a tag has more than one CSS class, usually only one of them is the identifying property you want to use. For example, for names, you have this:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
but in reality, you don't need all these classes. The first class, in this case, is sufficient to identify the name h4 blocks
name = soup.find('h4', class_ = 'my-0_27D')
Example:
Here's an example to extract the author names from review page:
for div in soup.find_all('div', {'itemprop': 'review'}):
name = div.find('h4', class_ = 'my-0_27D')
if (name):
name = name.find('span').text
else:
name = '-'
print(name)
Output:
Aidan
Bruno M.
Ba. I.
Luca Evangelista
Upset
Julian L.
Alison Peck
...
The page servs broken html code and html.parser is better at dealing with it.
Change soup = bs(r.text, 'lxml') to soup = bs(r.text, 'html.parser')

Categories

Resources