Trying to make a python script that can scrape from pastebin's RAW Paste Data section of the page on saved pastebin outputs. But I'm running into an issue with Python Attribute Error about NoneType has no object attribute 'text', I'm using the libraries from BeautifulSoup in my project. I tried to install spider-egg with pip install so I could use that also, but there was issues downloading the package from the server.
I need to be able to grab different multiple lines from the RAW Paste Data section and the print them back out to me.
first_string = raw_box.text.strip()
second_string = raw_box2.text.strip()
from the pastebin page I have the class element names for the RAW Paste Data section which is;
<textarea id="paste_code" class="paste_code" name="paste_code" onkeydown="return catchTab(this,event)">
taking the class name paste_code I then have this
raw_box = soup.find('first_string ', attrs={'class': 'paste_code'})
raw_box2 = soup.find('second_string ', attrs={'class': 'paste_code'})
I thought that should of been it, but apparently not, because I get the error I mentioned before. After parsing the data that has been stripped I need to be able to redirect that into a file after printing what it got. I also want to try make this python3 compatible, but that would take a little more work I think, since there's a lot of differences between python 2.7.12 and 3.5.2.
The following approach should help to get you started:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://pastebin.com/hGeHMBQf')
soup = BeautifulSoup(r.text, "html.parser")
raw = soup.find('textarea', id='paste_code').text
print raw
Which for this example should display:
hello world
Related
I'm following a web tutorial trying to use BeautifulSoup4 to extract data from a html file (stored on my local PC) in Jupyterlab as follows:
from bs4 import BeautifulSoup
with open ('simple.html') as html_file:
simple = BeautifulSoup('html_file','lxml')
print(simple.prettify())
I'm getting the following output irrespective of what is in the html file instead of the expected html
<html>
<body>
<p>
html_file
</p>
</body>
</html>
I've also tried it using the html parser html.parser and I simply get html_file as the output.
I know it can find the file because when I run the code after removing it from the directory I get a FileNotFoundError.
It works perfectly well when I run python interactively from the same directory. I'm able to run other BeautifulSoup to parse web pages.
I'm using Fedora 32 linux with Python3, Jupyterlab, BeautifulSoup4,requests, lxml installed in a virtual environment using pipenv.
Any help to get to the bottom of the problem is welcome.
Your problem is in this line:
simple = BeautifulSoup('html_file','lxml')
In particular, you're telling BeautifulSoup to parse the literal string 'html_file' instead of the contents of the variable html_file.
Changing it to:
simple = BeautifulSoup(html_file,'lxml')
(note the lack of quotes surrounding html_file) should give the desired result.
I am trying to use the following Python code to get some data from EDGAR database.
html1 = 'https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/aapl-20170930.xml'
xbrl_resp = requests.get(html1)
xbrl_str = xbrl_resp.text
soup1 = BeautifulSoup(xbrl_str, 'lxml')
mytag = soup1.find('us-gaap:StockholdersEquity',{'contextRef':'FI2017Q4'})
print(mytag)
It returns none even though the tag exists in the xml file. Any suggestions would be appreciated
There are a couple of issues that you are running into to. First, pass through the content of the request rather than the text. Second, use the xml parser instead of the lxml parser. Finally, you're incorrectly searching within the 'us-gaap:StockholdersEquity' tag.
html1 = 'https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/aapl-20170930.xml'
xbrl_resp = requests.get(html1)
xbrl_str = xbrl_resp.content
soup1 = BeautifulSoup(xbrl_str, 'xml')
mytag = soup1.find('us-gaap:StockholdersEquity',contextRef='FI2017Q4')
print(mytag)
The XML Parser converts the xml tags to lowercase: see here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-xml. Therefore you need to search with lowercase names, such as:
mytag = soup1.find('us-gaap:stockholdersequity',contextref='FI2017Q4')
I had the same issue of soup.find('table') returning None.
This issue occurred in an environment where the lxml package version was 3.4.4.
On another environment with lxml version 3.7.3 the same code worked fine.
So, I went back to the 'bad' environment and upgraded the lxml package version.
pip install lxml --upgrade
soup.find('table') started working after that.
Hope this helps!
Ram
I'm new to Python and I need to get the data from a table on a
Webpage and send to a list.
I've tried everything, and the best I got is:
f = urllib.request.urlopen(url)
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-enUS.asp?Data=11/22/2017&Data1=20171122&slcTaxa=APR#"
soup = BeautifulSoup(urllib.request.urlopen(url).read(),'lxml')
rows=list()
for tr in soup.findAll('table'):
rows.append(tr)
Any suggestions?
You're not that far !
First make sure to import the proper version of BeautifulSoup which is BeautifulSoup4 by doing apt-get install python3-bs4 (assuming you're on Ubuntu or Debian and running Python 3).
Then isolate the td elements of html table and clean data a bit. For example remove the first 3 elements of the lists which are useless, and remove the ugly '\n':
import urllib
from bs4 import BeautifulSoup
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-enUS.asp?Data=11/22/2017&Data1=20171122&slcTaxa=APR#"
soup = BeautifulSoup(urllib.request.urlopen(url).read(),'lxml')
rows=list()
for tr in soup.findAll('table'):
for td in tr:
rows.append(td.string)
temp_list=rows[3:]
final_list=[element for element in temp_list if element != '\n']
I don't know which data you want to extract precisely. Now you need to work on your Python list (called final_list here)!
Hope it's clear.
There is a Dowload option at the end of the webpage. If you can download the file manually you are good to go.
If you want to access different dates automatically, and since it is JavaScript, I suggest to use Selenium to download the xlsx files through Python.
With the xlsx file you can use Xlsxwriter to read the data and do what you want.
I'm doing a project where I need to store the date that a video in youtube was published.
The problem is that I'm having some difficulties trying to find this data in the middle of the HTML source code
Here's my code attempt:
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.youtube.com/watch?v=XQgXKtPSzUI&t=915s"
response = requests.get(url)
soup = BS(response.content, "html.parser")
response.close()
dia = soup.find_all('span',{'class':'date'})
print(dia)
Output:
[]
I know that the arguments I'm sending to .find_all() are wrong.
I'm saying this because I was able to store other information from the video using the same code, such as the title and the views.
I've tried different arguments when using .find_all() but didn't figured out how to find it.
If you use Python with pafy, the object you'll get has the published date easily accessible.
Install pafy: "pip install pafy"
import pafy
vid = pafy.new("www.youtube.com/watch?v=2342342whatever")
published_date = vid.published
print(published_date) #Python3 print statement
Check out the pafy docs for more info:
https://pythonhosted.org/Pafy/
The reason I leave the doc link is because it's a really neat module, it handles getting the data without external request modules and also exposes a bunch of other useful properties of the video, like the best format download link, etc.
It seems that YouTube is using javascript to add the date, so that information is not in the source code. You should try using Selenium to scrape, or get the date from the js since it is directly in the source code.
Try adding attribute as shown below:
dia = soup.find_all('span', attr={'class':'date'})
Okay, so in a terminal, after importing and making the necessary objects--I typed:
for links in soup.find_all('a'):
print(links.get('href'))
which gave me all the links on a wikipedia page (roughly 250). No problems.
However, in a program I am coding, I only receive about 60 links (and this is scraping the same wikipedia page) and the ones I DO get are mostly not worth anything. I double checked that I initialized both exactly the same--the only difference is the names of variables. For reference, here is the function where I setup the BS4 object, and grab the desired page:
def get_site(hyperLink):
userSite = urllib3.PoolManager()
siteData = userSite.request("GET", hyperLink)
bsd = BeautifulSoup(siteData.data)
return bsd
Later, I grab the elements and append them to a list I will then manipulate:
def find_urls(bsd, urls, currentNetloc):
for links in bsd.find_all('a'):
urls.append(links.get('href'))
return urls
Other relevant info:
I am using Python 3.3
I am using urllib3, BeautifulSoup 4, and urlparse (from urllib)
I am working in PyCharm (for the actual program)
Using Lubuntu, if it matters.
After running a command line instance of python3 and importing "sys" I typed and received:
$ sys.executable
'/usr/bin/python3'
$ sys.path
['', '/usr/local/lib/python3.3/dist-packages/setuptools-1.1.5-py3.3.egg', '/usr/local/lib/python3.3/dist-packages/pip-1.4.1-py3.3.egg', '/usr/local/lib/python3.3/dist-packages/beautifulsoup4-4.3.2-py3.3.egg', '/usr/lib/python3.3', '/usr/lib/python3.3/plat-i386-linux-gnu', '/usr/lib/python3.3/lib-dynload', '/usr/local/lib/python3.3/dist-packages', '/usr/lib/python3/dist-packages']
After running these commands in a Pycharm project, I received exactly the same results, with the exception that the directories containing my pycharm projects were included in the list.
This is not my answer. I got it from here, which has helped me before.
from bs4 import BeautifulSoup
import csv
# Create .csv file with headers
f=csv.writer(open("nyccMeetings.csv","w"))
f.writerow(["Name", "Date", "Time", "Location", "Topic"])
# Use python html parser to avoid truncation
htmlContent = open("nyccMeetings.html")
soup = BeautifulSoup(htmlContent,"html.parser")
# Find each row
rows = soup.find_all('tr')
for tr in rows:
cols = tr.find_all('td') # Find each column
try:
names = cols[0].get_text().encode('utf-8')
date = cols[1].get_text().encode('utf-8')
time = cols[2].get_text().encode('utf-8')
location = cols[3].get_text().encode('utf-8')
topic = cols[4].get_text().encode('utf-8')
except:
continue
# Write to .csv file
f.writerow([names, date, time, location, topic])
I think it would be useful to note some of the troubles I ran into while writing this script:
Specify your parser. It is very important to specify the type of html parser that BeautifulSoup will use to parse through the html tree form. The html file that I read into Python was not formatted correctly so BeautifulSoup truncated the html and I was only able to access about a quarter of the records. By telling BeautifulSoup to explicitly use the built-in Python html parser, I was able to avoid this issue and retrieve all records.
Encode to UTF-8. get_text() had some issues with encoding the text inside the html tags. As such, I was unable to write data to the comma-delimited file. By explicitly telling the program to encode to UTF-8, we avoid this issue altogether.
I have encountered many problems in my web scraping projects; however, BeautifulSoup was never the culprit.
I highly suspect you are having the same problem I had scraping Wikipedia. Wikipedia did not like my user-agent and was returning a page other than what I requested. Try adding a user-agent in your code e.g.
Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36
You mentioned you were using urllib3 so here is where you can read on how to use a custom user-agent.
Also if you want to diagnose your problem try this: In the terminal where you said everything was working fine, add an extra line print len(html) Then do the same in your program to see if you are in fact getting the links from the same page.