I cannot scrape anything with BeautifulSoup - python

Im using BeautifulSoup to scrape some web contents.
Im learning with this example code,but I always get a "None" response.
Code:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html').read())
post = soup.find('div', attrs={'id': 'topmenucontainer'})
print post
Any idea what Im doing wrong ?
Thanks!!

I don't think you are doing anything wrong.
It is the second script tag that is confusing BeautifulSoup. The tag looks like this:
<script type='text/javascript'>
<!--//--><![CDATA[//><!--
var arVersion = navigator.appVersion.split("MSIE")
var version = parseFloat(arVersion[1])
function fixPNG(myImage)
{
if ((version >= 5.5) && (version < 7) && (document.body.filters))
{
var imgID = (myImage.id) ? "id='" + myImage.id + "' " : ""
var imgClass = (myImage.className) ? "class='" + myImage.className + "' " : ""
var imgTitle = (myImage.title) ?
"title='" + myImage.title + "' " : "title='" + myImage.alt + "' "
var imgStyle = "display:inline-block;" + myImage.style.cssText
var strNewHTML = "<span " + imgID + imgClass + imgTitle
+ " style=\"" + "width:" + myImage.width
+ "px; height:" + myImage.height
+ "px;" + imgStyle + ";"
+ "filter:progid:DXImageTransform.Microsoft.AlphaImageLoader"
+ "(src=\'" + myImage.src + "\', sizingMethod='scale');\"></span>"
myImage.outerHTML = strNewHTML
}
}
//--><!]]>
</script>
but BeatifulSoup seems to think it is still in a comment or something and includes the rest of the file as content of the script tag.
Try:
print str(soup.findAll('script')[1])[:2000]
and you'll see what I mean.
If you remove the CDATA then you should find the page parses correctly:
soup = BeautifulSoup(
urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html')
.read()
.replace('<![CDATA[', '').replace('<!]]>', ''))

Something weird with your HTML. BeautifulSoup tries its best, but sometimes it just can't parse it.
Try moving the first <link> element inside the <head>, that might help.

You could try to use lxml lib.
lxml article
from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
post = doc.cssselect('div#topmenucontainer')

Related

np.arange returns <Response 200> instead of value

I'm trying to write a script that scrapes the text of multiple webpages with slightly differing URLs. I want to go through the pages with an np.arange function that inserts a string into the URL. But there must be something wrong with the URL the script is composing. In the document, that stores the scraped text, it scrapes just messages like "this site does not exist anymore". The steps I have taken to come closer to the solution are detailed below. Here is my code.
from bs4 import BeautifulSoup
import numpy as np
import datetime
from time import sleep
from random import randint
datum = datetime.datetime.now()
pages = np.arange(1, 20, 1)
datum_jetzt = datum.strftime("%Y") + "-" + datum.strftime("%m") + "-" + datum.strftime("%d")
url = "https://www.shabex.ch/pub/" + datum_jetzt + "/index-"
results = requests.get(url)
file_name = "" + datum.strftime("%Y") + "-" + datum.strftime("%m") + "-" + datum.strftime("%d") + "-index.htm"
for page in pages:
page = requests.get("https://www.shabex.ch/pub/" + datum_jetzt + "/index-" + str(page) + ".htm")
soup = BeautifulSoup(results.text, "html.parser")
texte = soup.get_text()
sleep(randint(2,5))
f = open(file_name, "a")
f.write(texte)
f.close
I found that if I find enter print("https://www.shabex.ch/pub/" + datum_jetzt + "/index-" + str(page) + ".htm") in the console, I get https://www.shabex.ch/pub/2020-05-18/index-<Response [200]>.htm. So the np.arange function returns the response of the webserver instead of the value I seek.
Where have I gone wrong?

Excluding 'duplicated' scraped URLs in Python app?

I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
Really, what I would ideally want to scrape is just one of these.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
Here is my script:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')
This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.
Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);

Looping through pages always gets the same result

I'm trying to loop through pages and save a specific image
import urllib.request
from bs4 import BeautifulSoup as bs
frontstring = 'http://www.haz.de/'
for i in range(1, 50):
url = 'http://www.haz.de/Hannover/Aus-der-Region/Lehrte/Fotostrecken/' \
+ 'Digitales-Daumenkino-So-waechst-das-Parkhaus#p' + str(i)
with urllib.request.urlopen(url) as page:
soup = bs(page)
galleryimage = soup.findAll('img', {'class': 'pda-fullgallery-large photo'})
for imgtag in galleryimage:
try:
imgurl = frontstring + imgtag['src']
imgname = 'img/fullgallery-large' + str(i) + '.jpg'
urllib.request.urlretrieve(imgurl, imgname)
print('saving image from ' + imgurl + ' to ' + imgname)
except Exception as e:
raise
else:
pass
However the image is always the same. I don't know where it went wrong. If I open the url in the browser it's the correct page and image but the soup seems to be always the same code. Probably something really stupid and simple but I'm not seeing it after trying really long to find the mistake.
http://www.haz.de/Hannover/Aus-der-Region/Lehrte/Fotostrecken/Digitales-Daumenkino-So-waechst-das-Parkhaus
http://www.haz.de/Hannover/Aus-der-Region/Lehrte/Fotostrecken/Digitales-Daumenkino-So-waechst-das-Parkhaus/(offset)/1
http://www.haz.de/Hannover/Aus-der-Region/Lehrte/Fotostrecken/Digitales-Daumenkino-So-waechst-das-Parkhaus/(offset)/2
http://www.haz.de/Hannover/Aus-der-Region/Lehrte/Fotostrecken/Digitales-Daumenkino-So-waechst-das-Parkhaus/(offset)/3
Those are the real urls, the url you seen is generated by JavaScript. You should disable JavaScrip before you scraping any site

How to use python to interpret a url

I'm writing code that is attempting to extract the text from the Library of Babel.
They basically use a system of Hexes, Walls, Shelfs, Volumes and Pages to split up their library of randomly generated text files. Here is an example (https://libraryofbabel.info/book.cgi?2-w1-s2-v22:1)
Here we have Hex: 2, Wall: 1, Shelf: 2, Volume: 22, Page: 1.
I would ideally like to randomly generate a page across all these variables to extract text from, however I am not getting the output I would imagine.
Here is my code:
import requests
from bs4 import BeautifulSoup
from urlparse import urlparse
import random
hex = str(random.randint(0, 6))
wall = str(random.randint(1, 4))
shelf = str(random.randint(1, 5))
vol = str(random.randint(1, 32))
page = str(random.randint(1, 410))
print("Fetching: " + " Hex: " + hex + ", Wall: " + wall + ", Shelf: " + shelf + ", Vol: " + vol + ", Page: " + page)
babel_url = str("https://libraryofbabel.info/browse.cgi?" + hex + "-w" + wall + "-s" + shelf + "-v" + vol + ":" + page)
r = requests.get(babel_url)
soup = BeautifulSoup(r.text)
print(soup.get_text())
My output would be identical to that if I changed the url to be https://libraryofbabel.info/browse.cgi. print(babel_url) shows me that the way I wrote the url is fine but something isn't interpreting what I have written in the way I want.
I've found that just pasting https://libraryofbabel.info/book.cgi?2-w1-s2-v22:1 into chrome drops me at https://libraryofbabel.info/book.cgi. But if I navigate to https://libraryofbabel.info/book.cgi?2-w1-s2-v22:1 (or any other page) I can move between pages at will.
The only thing I get in the output worth mentioning is:
It appears your browser has javascript disabled. Follow this link to browse without javascript.
Put on you glasses :
You are requesting browse.cgi instead of book.cgi
https://libraryofbabel.info/browse.cgi?2-w2-s1-v10:72
instead of
https://libraryofbabel.info/book.cgi?2-w2-s1-v10:72

HTTPError: HTTP Error 400: Bad request urllib2

I am beginner to python. I am the developer of Easy APIs Project (http://gcdc2013-easyapisproject.appspot.com) and was doing a Python implementation of weather API using my project. Visit http://gcdc2013-easyapisproject.appspot.com/APIs_Doc.html to see Weather API. The below is my implementation but it returns HTTPError: HTTP Error 400: Bad request error.
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q='+a+' in celsius')
htmlx = responsex.read()
responsex.close()
htmlx = html[1:] #remove first {
htmlx = html[:-1] #remove last }
htmlx = html.split('}{') #split and put each resutls to array
return str(htmlx[1]);
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+q)
html = response.read()
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3])
Please help me..
The query string should be quoted using urllib.quote or urllib.quote_plus:
import urllib
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q=' + urllib.quote(a + ' in celsius'))
html = responsex.read()
responsex.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
return html[0]
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+urllib.quote(q))
html = response.read()
print repr(html)
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3].split()[0])
In addition to that, I modified celsius to use html instead of htmlx. The original code mixed use of html and htmlx.
I have found the answer. The query should be quoted with urllib2.quote(q)

Categories

Resources