I am trying to scrape the website here: ftp://ftp.sec.gov/edgar/daily-index/. Using the code as shown below:
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("ftp://ftp.sec.gov/edgar/daily-index/")
soup = BeautifulSoup(line, "lxml")
soup.a # or soup.find_all('a') neither of them works
#return None.
Please help, I am really frustrated by this. My suspicion is that the tag is causing the problem. The site's Html looks well formated (matched tags), so I am lost as to why BeautifulSoup doesn't find anything. Thanks
The ftp://ftp.sec.gov/edgar/daily-index/ URL leads to a FTP directory, not an HTML page.
Your browser could generate HTML based on the FTP directory contents, but the server does not send you HTML when you load that resource with urllib.request.
You probably want to use the ftplib module directly instead to read the directory listing, or inspect the return value of urlopen(...).read() first.
Related
I'd like to make an auto login program in internal network website.
So, I try to parse that site using requests and Beautifulsoup library.
It works...and I get some html alot shorter than that site's html.
what's the problem? maybe security issue?..
pleas help me.
import requests
from bs4 import BeautifulSoup as bs
page = requests.get("http://test.com")
soup = bs(page.text, "html.parse")
print(soup) # I get some html alot shorter than that site's html
I am learning BeautifulSoup and i tried to extract all the "a" tags from a website. I am getting lot of "a" tags but few of them are ignored and i am confused why that is happening any help will be highly appreciated.
Link i used is : https://www.w3schools.com/python/
img : https://ibb.co/mmEKTK
red box in the image is a section that has been totally ignored by the bs4. It does contains "a" tags.
Code:
import requests
import bs4
import re
import html5lib
res = requests.get('https://www.w3schools.com/python/')
soup = bs4.BeautifulSoup(res.text,'html5lib')
try:
links_with_text = []
for a in soup.find_all('a', href=True):
print(a['href'])
except:
print ('none')
sorry for the code indentation i am new here.
The links which are being ignored by bs4 are dynamically rendered i.e Advertisements etc were not present in the HTML code but have been called by scripts based on your browser habits. requests package will only fetch static HTML content, you need to simulate browser to get the dynamic content.
Selenium can be used with any browser like Chrome, Firefox etc. If you want to achieve the same results on server (without UI), use headless browsers like Phatomjs.
I am currently trying to practice with the requests and BeautifulSoup Modules in Python 3.6 and have run into an issue that I can't seem to find any info on in other questions and answers.
It seems that at some point in the page, Beuatiful Soup stops recognizing tags and Ids. I am trying to pull Play-by-play data from a page like this:
http://www.pro-football-reference.com/boxscores/201609080den.htm
import requests, bs4
source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
raise Exception('No data found for this link: '+source_url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))
#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))
Using the inspector in Chrome, I can see that the table definitely exists. I have also tried to use it on 'div's and 'tr's in the later half of the HTML and it doesn't seem to work. I have tried the standard 'html.parser' as well as lxml and html5lib, but nothing seems to work.
Am I doing something wrong here, or is there something in the HTML or its formatting that prevents BeautifulSoup from correctly finding the later tags? I have run into issues with similar pages run by this company (hockey-reference.com, basketball-reference.com), but have been able to use these tools properly on other sites.
If it is something with the HTML, is there any better tool/library for helping to extract this info out there?
Thank you for your help,
BF
BS4 won't be able to execute the javascript of a web page after doing the GET request for a URL. I think that the table of concern is loaded async from client-side javascript.
As a result, the client-side javascript will need to run first before scraping the HTML. This post describes how to do so!
Ok, I got what was the problem. You're trying to parse comment, not an ordinary html element.
For such cases you should use Comment from BeautifulSoup, like this:
import requests
from bs4 import BeautifulSoup,Comment
source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
raise Exception('No data found for this link: '+source_url)
soup = BeautifulSoup(res.content,'html.parser')
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
comment=BeautifulSoup(str(comment), 'html.parser')
search_play = comment.find('table', {'id':'pbp'})
if search_play:
play_to_play=search_play
i working on a script for scraping video titles from this webpage
" https://www.google.com.eg/trends/hotvideos "
but the proplem is the titles are hidden on the html source page but i can see it if i used the inspector to looking for that
that's my code it's working good with this ("class":"wrap")
but when i used that with the hidden one like "class":"hotvideos-single-trend-title-container" that's did't give me anything on output
#import urllib2
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.google.com.eg/trends/hotvideos').read()
soup = BeautifulSoup(html)
print (soup.findAll('div',{"class":"hotvideos-single-trend-title-container"}))
#wrap
The page is generated/populated by using Javascript.
BeautifulSoup won't help you here, you need a library which supports Javascript generated HTML pages, see here for a list or have a look at Selenium
I'm newbie to HTML parsers. I'm actually trying to parse the source code of the webpage with url (http://www.quora.com/How-many-internships-are-necessary-for-a-B-Tech-student). I'm trying to get the answer_count.
I tried it in the following way:
import urllib2
from bs4 import BeautifulSoup
q = urllib2.urlopen(url)
soup = BeautifulSoup(q)
divs = soup.find_all('div',class_='answer_count')
But I get the list 'divs' as empty. Why is it so? Where am I wrong? How do I implement it to get the result as '2 Answers'?
Maybe you don't have the same page as us on your browser (because you are logged in or so).
When I look at the webpage you provided with Google Chrome, there is nowhere 'answer_count' in the source code. So if Google chrome doen't find it, BeautifulSoup won't either