Web scraping SEC filings - python

I am working on web scraping 10Q documents from SEC edgar.
This is the url link: https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm
If we inspect it you can find
I need to extract 1600 Amphitheatre Parkway without using id. Below is a code snippet to extract text using id tag. However I need to se name tag.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
page = session.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
soup = BeautifulSoup(page.content, 'html.parser')
content = soup.find(id="d92517213e644-wk-Fact-0B11263160365DBABCF89969352EE602")
print(content.text)
Instead of id tag, I would like to use name tag. However I am not able to extract information sing name tag. Please help.
see the html information :
How to use name tag instead of id tag to extract the contents.
Thanks

You can find elements based on attribute values like this
soup.find('html_tag',{"attribute":"value"})
So in your case, name attribute exists on ix:nonnumeric tag
content = soup.find('ix:nonnumeric',{"name":"dei:EntityAddressAddressLine1"})

Related

How to use web scraping to get visible text on the webpage?

This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Scraping href which includes ads information

I would like to count how many ads there are in this website: https://www.lastampa.it/?refresh_ce
I am using BeautifulSoup to do this. I would need to extra info within the following:
<a id="aw0" target="_blank" href="https://googleads.g.doubleclick.net/pcs/click?xai=AKAOjssYz5VxTdwhxCBCrbtSi0dfGqGd25s7Ub6CCjsHLqd__OqfDKLyOWi6bKE3CL4XIJ0xDHy3ey-PGjm3_yVqTe0_IZ1g9AsvZmO1u8gciKpEKYMj1TIvl6KPivBuwgpfUDf8g2EvMyCD5r6tQ8Mx6Oa4G4yZoPYxFRN7ieFo7UbMr8FF2k6FL6R2qegawVLKVB5WHVAbwNQu4rVx4GE8KuxowGjcfecOnagp9uAHY2qiDE55lhdGqmXmuIEAK8UdaIKeRr6aBBVCR40LzY4&sig=Cg0ArKJSzEIRw7NDzCe7&adurl=https://track.adform.net/C/%3Fbn%3D38337867&nm=3&nx=357&ny=-4&mb=2" onfocus="ss('aw0')" onmousedown="st('aw0')" onmouseover="ss('aw0')" onclick="ha('aw0')"><img src="https://tpc.googlesyndication.com/simgad/5262715044200667305" border="0" width="990" height="30" alt="" class="img_ad"></a>
i.e. parts containing ads information.
The code that I am using is the following:
from bs4 import BeautifulSoup
import requests
from lxml import html
r = requests.get("https://www.lastampa.it/?refresh_ce")
soup = BeautifulSoup(r.content, "html.parser")
ads_div = soup.find('div')
if ads_div:
for link in ads_div.find_all('a'):
print (link['href'])
It does not scrape any information because I am considering the wrong tag/href. How could I get ads information in order to count how many ads there are in that webpage?
How about use a regular expression to match "googleads" and count how many you get.
Recursively searching from the body gives you all the links in the whole page. If you want to search in a specific div you can supply parameters such as the class or id that you want to match as a dictionary.
You can filter the links once you obtain them.
body = soup.find('body')
if body:
for link in body.find_all('a'):
if "ad" in link['href']:
print (link['href'])
When looking at the response that I get, I notice there are no ads at all. This could be because the ads are loaded via some script, which means the ads won't be rendered and requests won't download it. To get around this you can use a webdriver with selenium. That should do the trick.

how to get title of youtube video using beautifulsoup?

I am new in using Python and BeautifulSoup. I want to get title and description of a video.
I am getting a description using this code:
import requests
from bs4 import BeautifulSoup
x='https://www.youtube.com/watch?v=NjG5ZwuY0Rc'
source = requests.get(x).text
soup = BeautifulSoup(source, 'lxml')
for p in soup.find_all('p', id='eow-description'):
print(p.get_text('\n'))
How can I get title of the video?
To fetch any desired text from an html page:
Get the tag name by inspecting element in the browser(right click on browser and click inspect for chrome) if it is not known already.
Get the id of the desired tag.
Once you have details of 1 & 2, using get_text it is easy to get the details of that tag.
for title in soup.find_all('span', id="eow-title"):
print(title.get_text('\n'))

Why is BeautifulSoup's findAll returning an empty list when I search by class?

I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list.
<h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job">
html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job")
bs0bj=BeautifulSoup(html,"lxml")
nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"})
print(nameList)
The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using to obtain iframe content (the iframe src). Then extract the string from the script tag that has the info and load with json, extract the description (which is html) and pass back to bs to then select the h2 tags. You now have the rest of the info stored in the second soup object as well if required.
import requests
from bs4 import BeautifulSoup as bs
import json
r = requests.get('https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job?mobile=false&width=1140&height=500&bga=true&needsRedirect=false&jan1offset=0&jun1offset=60&in_iframe=1')
soup = bs(r.content, 'lxml')
script = soup.select_one('[type="application/ld+json"]').text
data = json.loads(script)
soup = bs(data['description'], 'lxml')
headers = [item.text for item in soup.select('h2')]
print(headers)
The answer lays hidden in two elements:
javascript rendered contents: after document.onload
in particular the content managed by js comes after this comment and it's, indeed, rendered by js. The line where the block starts is: "< ! - -BEGIN ICIMS - - >" (space added to avoid it goes blank)
As you can imagine the h2 class="ICISM class here" DOESN'T exist WHEN you call the bs4 methods.
The solution?
IMHO the best way to achieve what you want is to use selenium, to get a full rendered web page.
check this also
Web-scraping JavaScript page with Python

Finding video id on html website using python

I am scraping an html file, each page has a video on it, and in the html there is the video id. I want to print out the video id.
I know that if i want to print a headline from a div class i would do this
with open('yeehaw.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
article = soup.find('div', class_='article')
headline = article.h2.a.text
print headline
However the id for the video is found inside a data-id='qe67234'
I dont know how to access this 'qe67234' and print it out.
please help thank you!
Assuming that the tag for data-id begins with div:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('<div class="_article" data-id="qe67234"></div>')
results = soup.findAll("div", {"data-id" : re.compile(r".*")})
print('output: ', results[0]['data-id'])
# output: qe67234
Assuming that the data-id is in div
BeautifulSoup.find returns you the found html element as a dictionary. You can therefore navigate it using standard means to get access to the text (as you did in your question) as well as html tags (as shown in the code below)
soup = BeautifulSoup('<div class="_article" data-id="qe67234">')
soup.find("div", {"class":"_article"})['data-id']
Note that, oftentimes, video elements require JS for playback, and you might not be able to find the necessary element if it was scraped with a non-javascript client (i.e. python requests).
If this happens, you have to use tools like phantomjs + selenium browser to get the website combined with the javascript to perform your scraping.
EDIT
If the data-id tag itself is not constant, you should look into lxml library to replace BeautifulSoup and use xpath values to find the element that you need

Categories

Resources