Very slow loading time for parsing annual reports - python

I am trying to scrape the SEC Edgar S&P500 annual reports with Python 3 and receive very slow loading time with some links. My current code works well for most of the report links, but returns only half of the website content for other links (e.g., the link below).
Is there a way around this? I am happy if the result is a text file without any weird html characters and only contains all text for the "end-user"
# import libraries
from simplified_scrapy import SimplifiedDoc,req,utils
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/718877/000104746919000788/0001047469-19-000788.txt"
html = req.get(new_html_text)
doc = SimplifiedDoc(html)
textfile = doc.body.text
textfile = doc.body.unescape() # Converting HTML entities
utils.saveFile("test.txt", textfile)

I found that your data contains multiple bodies. I'm sorry I didn't notice that before. See if the following code works.
from simplified_scrapy import SimplifiedDoc,req,utils
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/718877/000104746919000788/0001047469-19-000788.txt"
html = req.get(new_html_text,timeout=300) # Add timeout
doc = SimplifiedDoc(html)
texts = []
bodys = doc.selects('body|BODY') # Get all
for body in bodys:
texts.append(body.unescape()) # Converting HTML entities
utils.saveFile("test.txt", "\n".join(texts))

Related

Accessing text data in web-hosted GIS Map (ESRI) via python

I would like to interact with a web-hosted GIS map application here to scrape data contained therein. The data is behind a toggle button.
Normally, creating a soup of the websites text via BeautifulSoup and requests.get() suffices to where the text data is parse-able, however this method returns some sort of esri script, and none of the desired html or text data.
Snapshot of the website with desired element inspected:
Snapshot of the button toggled, showing the text data I'd like to scrape:
The code's first mis(steps):
import requests
from bs4 import BeautifulSoup
site = 'https://dwrapps.utah.gov/fishing/fStart'
soup = BeautifulSoup(requests.get(site).text.lower(), 'html.parser')
The return of said soup is too lengthy to post here, but there is no way to access the html data behind the toggle shown above.
I assume use of selenium would do the trick, but was curious if there was an easier method of interacting directly with the application.
the site is get json from https://dwrapps.utah.gov/fishing/GetRegionReports (in the js function getForecastData)
so you can use it in requests:
from json import dump
from typing import List
import requests
url = "https://dwrapps.utah.gov/fishing/GetRegionReports"
json:List[dict] = requests.get(url).json()
with open("gis-output.json","w") as io:
dump(json,io,ensure_ascii=False,indent=4) # export full json from to the filename gis-output.json
for dt in json:
reportData = dt.get("reportData",'') # the full text
displayName = dt.get("displayName",'')
# do somthing with the data.
"""
you can acsses also this fields:
regionAdm = dt.get("regionAdm",'')
updateDate = dt.get("updateDate",'')
dwrRating = dt.get("dwrRating",'')
ageDays = dt.get("ageDays",'')
publicCount = dt.get("publicCount",'')
finalRating = dt.get("finalRating",'')
lat = dt.get("lat",'')
lng = dt.get("lng",'')
"""

Although I have the file open in the same folder as my code, it wont execute

A small disclaimer for all, this is my first language for programming and I am still getting used to it, so any suggestions are recommended.
The problem that was given is as follows:
Scraping Numbers from HTML using BeautifulSoup In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.
We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.
Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)
Actual data: http://py4e-data.dr-chuck.net/comments_97465.html (Sum ends with 19)
You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.
Data Format
The file is a table of names and comment counts. You can ignore most of the data in the file except for lines like the following:
<tr><td>Modu</td><td><span class="comments">90</span></td></tr>
<tr><td>Kenzie</td><td><span class="comments">88</span></td></tr>
<tr><td>Hubert</td><td><span class="comments">87</span></td></tr>
You are to find all the tags in the file and pull out the numbers from the tag and sum the numbers.
Look at the sample code provided. It shows how to find all of a certain kind of tag, loop through the tags and extract the various aspects of the tags.
...
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
# Look at the parts of a tag
print 'TAG:',tag
print 'URL:',tag.get('href', None)
print 'Contents:',tag.contents[0]
print 'Attrs:',tag.attrs
You need to adjust this code to look for span tags and pull out the text content of the span tag, convert them to integers and add them up to complete the assignment.
I have written this:
import urllib
from bs4 import BeautifulSoup
page = urllib.urlopen(input("Enter URL: "))
soup = BeautifulSoup(page, "html.parser")
spans = soup('span')
numbers = []
for span in spans:
numbers.append(int(span.string))
print (sum(numbers))
The error message is that bs4 is not a module even though it is and it is not asking me for the url and not giving me output. I have no clue on how to solve it.
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
ctx= ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter -')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
numbers = []
spans = soup('span')
for span in spans:
numbers.append(int(span.string))
print (sum(numbers))
Use this code, works for me.

How to extract text from a webpage using python 2.7?

I'm trying to programmatically extract text from this webpage which describes a genome assembly in the public archive:
http://www.ebi.ac.uk/ena/data/view/ERS019623
I have thousands of assemblies that I want to track down and extract the study accession, which is the code on the far left of the table beginning with "PRJ". The URL for each of these assemblies is of the same format as the one above, i.e. "http://www.ebi.ac.uk/ena/data/view/ERS******". I have the ERS code for each of my assemblies so I can construct the URL for each one.
I've tried a few different methods, firstly if you add "&display=XML" to the end of the URL it prints the XML (or at least I'm presuming that it's printing the XML for the entire page, because the problem is that the study accession "PRJ******" is no where to be seen here). I had utilised this to extract another code that I needed from the same webpage, the run accession which is always of the format "ERR******" using the below code:
import urllib2
from bs4 import BeautifulSoup
import re
import csv
with open('/Users/bj5/Desktop/web_scrape_test.csv','rb') as f:
reader = csv.reader(f) #opens csv containig list of ERS numbers
for row in reader:
sample = row[0] #reads index 0 (1st row)
ERSpage = "http://www.ebi.ac.uk/ena/data/view/" + sample + "&display=xml" #creates URL using ERS number from 1st row
page = urllib2.urlopen(ERSpage) #opens url and assigns it to variable page
soup = BeautifulSoup(page, "html.parser") #parses the html/xml from page and assigns it to variable called soup
page_text = soup.text #returns text from variable soup, i.e. no tags
ERS = re.search('ERS......', page_text, flags=0).group(0) #returns first ERS followed by six wildcards
ERR = re.search('ERR......', page_text, flags=0).group(0) #retursn first ERR followed by six wildcards
print ERS + ',' + ERR + ',' + "http://www.ebi.ac.uk/ena/data/view/" + sample #prints ERS,ERR,URL
This worked very well, but as the study accession is not in the XML I can't use it to access this.
I also attempted to use BeautifulSoup again to download the HTML by doing this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "http://www.ebi.ac.uk/ena/data/view/ERS019623"
def get_category_links(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
print soup
get_category_links(BASE_URL)
But again I can't see the study accession in the output from this either...
I have also attempted to use a different python module, lxml, to parse the XML and HTML but haven't had any luck there either.
When I right click and inspect element on the page I can find the study accession by doing ctrl+F -> PRJ.
So my question is this: what is the code that I'm looking at in inspect element, XML or HTML (or something else)? Why does it look different to the code that prints in my console when I try and use BeautifulSoup to parse HTML? And finally how can I scrape the study accessions (PRJ******) from these webpages?
(I've only been coding for a couple of months and I'm entirely self-taught so apologies for the slightly confused nature of this question but I hope I've got across what it is that I'm trying to do. Any suggestions or advice would be much appreciated.)
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://www.ebi.ac.uk/ena/data/view/ERS019623&display=xml')
soup = BeautifulSoup(r.text, 'lxml')
ERS = soup.find('primary_id').text
ERR = soup.find('id', text=re.compile(r'^ERR')).text
url = 'http://www.ebi.ac.uk/ena/data/view/{}'.format(ERS)
print(ERS, ERR, url)
out:
ERS019623 ERR048142 http://www.ebi.ac.uk/ena/data/view/ERS019623
bs4 can parse xml file, just treat it like html, they are all the same, so their is no need to use regex to extract info.
i find a TEXT download link:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt
this link's fileds can be changed to get the data you want, like this:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession&download=txt
by doing so, you can get all you data in a text file
In you sample soup is a BeautifulSoup object: a representation of the parsed document.
If you want to print the entire HTML of the document, you can call print(soup.prettify()) or if you want the text within it print(soup.get_text()).
The soup object has other possibilities to access parts of the document you are interested in: to navigate the parsed tree, to search in it ...

BeautifulSoup's "find" acting inconsistently (bs4)

I'm scraping the NFL's website for player statistics. I'm having an issue when parsing the web page and trying to get to the HTML table which contains the actual information I'm looking for. I successfully downloaded the page and saved it into the directory I'm working in. For reference, the page I've saved can be found here.
# import relevant libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("1998.html"))
result = soup.find(id="result")
print result
I found that at one point, I ran the code and result printed the correct table I was looking for. Every other time, it doesn't contain anything! I'm assuming this is user error, but I can't figure out what I'm missing. Using "lxml" returned nothing and I can't get html5lib to work (parsing library??).
Any help is appreciated!
First, you should read the contents of your file before passing it to BeautifulSoup.
soup = BeautifulSoup(open("1998.html").read())
Second, verify manually that the table in question exists in the HTML by printing the contents to screen. The .prettify() method makes the data easier to read.
print soup.prettify()
Lastly, if the element does in fact exist, the following will be able to find it:
table = soup.find('table',{'id':'result'})
A simple test script I wrote cannot reproduce your results.
import urllib
from bs4 import BeautifulSoup
def test():
# The URL of the page you're scraping.
url = 'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season=1998&seasonType=REG&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'
# Make a request to the URL.
conn = urllib.urlopen(url)
# Read the contents of the response
html = conn.read()
# Close the connection.
conn.close()
# Create a BeautifulSoup object and find the table.
soup = BeautifulSoup(html)
table = soup.find('table',{'id':'result'})
# Find all rows in the table.
trs = table.findAll('tr')
# Print to screen the number of rows found in the table.
print len(trs)
This outputs 51 every time.

Checking webpage for results with python and beautifulsoup

I need to check a webpage search results and compare them to user input.
ui = raw_input() #for example "Niels Bohr"
link = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=90&k=10"
stranica=urllib.urlopen(link)
soup = BeautifulSoup(stranica, from_encoding="utf-8")
beauty = soup.prettify()
print beauty
since there is 1502 results, my idea was to change the k=10 to k=1502. Now I need some kind of function to check if search results contain my user input. I know that my names are the text after TEXT
so how to do it? maybe using regex?
the second part is if there are matching results to get the link of the results. Again, I know that link is inside that href="", but how to get it out and make it usable=
Finding if Niels Bohr is listed is as easy as using a large batch number and loading the resulting page:
import sys
import urllib2
from bs4 import BeautifulSoup
url = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=0&k={}".format(sys.maxint)
name = u'Bohr, Niels'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all(class_='AllWordsTextHit', text=name):
print link
This produces any links that contain the text 'Bohr, Niels' as the link text. You can use a regular expression if you need a partial match.
The link object has a (relative) href attribute you can then use to load the next page:
professor_page = 'http://www.enciklopedija.hr/' + link['href']

Categories

Resources