Working web crawler, suddenly not working anymore

Working web crawler, suddenly not working anymore - python

I was following this tutorial and the code worked perfectly.
Now after doing some other projects I went back and wanted to re-run the same code. Suddenly I was getting an error message that forced me to add features="html.parser" in the soup variable.
So I did, but now when I run the code, literally nothing happens. Why is that, what am I doing wrong?
I checked whether I might have uninstalled beautifulsoup4 module, but no, it is still there. I re-typed the whole code from scratch, but nothing seems to work.
import requests
from bs4 import BeautifulSoup
def spider():
url = "https://www.amazon.de/s?k=laptop+triton&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss"
source = requests.get(url)
plain_text = source.text
soup = BeautifulSoup(plain_text, features="html.parser")
for mylink in soup.findAll('img', {'class':'s-image'}):
mysrc = mylink.get('src')
print(mysrc)
spider()
Ideally I'd want the crawler to print about 10-20 lines of src = "..." of the amazon page in question. This code worked a couple hours ago...

The solution is to add headers={'User-Agent':'Mozilla/5.0'} to requests.get() (without it, Amazon doesn't send the correct page):
import requests
from bs4 import BeautifulSoup
def spider():
url = "https://www.amazon.de/s?k=laptop+triton&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss"
source = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
plain_text = source.text
soup = BeautifulSoup(plain_text, features="html.parser")
for mylink in soup.findAll('img', {'class':'s-image'}):
mysrc = mylink.get('src')
print(mysrc)
spider()
Prints:
https://m.media-amazon.com/images/I/71YPEDap2lL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81fyVgZuQxL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71VmlANJMOL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71rAT5E7DfL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71cEKKNfb3L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/61aWXuLIEBL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71B7NyjuU9L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81s822PQUcL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71fBKuAiQzL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71hXTUR-oRL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81-Lf6jX-OL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81B85jUARqL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/8140E7+uhZL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/8140E7+uhZL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71ROCddvJ2L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71ROCddvJ2L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/41bB8HuoBYL._AC_UL436_.jpg

Related

How to get google search page html code using python?

I try to extract the google search page HTML code in python. I use requests module in python.
from bs4 import BeautifulSoup
url = "https://www.google.com/search?q=how+to+get+google+search+page+source+code+by+python"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
search = soup.find_all('div',class_="yuRUbf")
print(search)
But I can't find any of this class_="yuRUbf" in the code. I think it do not give me the source code. Now how can I do this work.
I also used resp.content but it didn't work.
I also selenium but it didn't work.

why doesn't my web scraper work? Python3 - requests, BeautifulSoup

I have been following this python tutorial for a while, and I made a web scrawler, similar to the one in the video.
Language: Python
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class':'item-title'}):
href = link.get('href')
title = link.string
print(href)
page += 1
spider(1)
And this is the output that the program gives:
PS D:\development> & C:/Users/hirusha/AppData/Local/Programs/Python/Python38/python.exe "d:/development/Python/TheNewBoston/Python/one/web scrawler.py"n/TheNewBoston/Python/one/web scrawler.py"
PS D:\development>
What can I do?
Before this, I had an error, the code was:
soup = BeautifulSoup(plain_text)
i changed this to
soup = BeautifulSoup(plain_text, 'html.parser')
and the error was gone,
the error i got here was:
d:/development/Python/TheNewBoston/Python/one/web scrawler.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 10 of the file d:/development/Python/TheNewBoston/Python/one/web scrawler.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
soup = BeautifulSoup(plain_text)
Any help is appreciated, Thank You!

There are no results as the class you are targeting is not present until the webpage is rendered, which doesn't happen with requests.
Data is dynamically retrieved from a script tag. You can regex the JavaScript object holding the data and parse with json to get that info.
The error you show was due to a parser not being specified originally; which you rectified.
import re, json, requests
import pandas as pd
r = requests.get('https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=1')
data = json.loads(re.search(r'window\.runParams = (\{".*?\});', r.text, re.S).group(1))
df = pd.DataFrame([(item['title'], 'https:' + item['productDetailUrl']) for item in data['items']])
print(df)

Python Couldn't parse HTML from URL

I have tried this below code
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://myip.ms/'
page = 1
req = requests.get(URL + str(page))
soup = bs (req.text, 'html.parser')
print (soup)
this code working for some websites but not working for most of websites like myip.ms

Works for me. But what essentially are you trying to achieve here? Your code appends "1" to the end of the URL and then visits it. If the page with those URL parameters doesn't exist on the server - it will give you errors. For this case: https://myip.ms/1 exists, but no surprise that any other page could give you errors

How can I force Python to navigate to a web page and print all anchors (a-html)

I am trying to take some data off an intra-net site at work. I am testing the code below, which looks fine me, but it almost seems like it is going to the wrong URL. If I right-click on the page and click 'View Page Source', I can see a bunch of links (anchors) that I want to scrape from, but what Python is actually printing out is completely different from what I'm seeing in 'View Page Source'.
from bs4 import BeautifulSoup as bs
import requests
from lxml import html
import urllib.request
REQUEST_URL = 'https://corp-intranet-internal.com/admin/?page=0'
response = requests.get(REQUEST_URL, auth=('fname.lname#gmail.com', 'my_pass'))
xml_data = response.text.encode('utf-8', 'ignore')
html_page = urllib.request.urlopen(REQUEST_URL)
delay = 5 # seconds
soup = bs(html_page, "lxml")
for link in soup.findAll('a'):
print(link.get('href'))
I tested the same idea, using Selenium, and I'm getting results that don't match the 'View Page Source'. Any idea what could be wrong here? Thanks.

Parsing a specific website crashes the Python process

Looking to parse an HTML page for images (from http://www.z-img.com), and when I load the page into BeautifulSoup (bs4), Python crashes. The "problem details" shows that etree.pyd was the "Fault Module Name", which means its probably a parsing error, but so far, I can't quite nail down the cause of it.
Here's the simplest code I can boil it down to, on Python2.7:
import requests, bs4
url = r"http://z-img.com/search.php?&ssg=off&size=large&q=test"
r = requests.get(url)
html = r.content
#or
#import urllib2
#html = urllib2.urlopen(url).read()
soup = bs4.BeautifulSoup(html)
along with a sample output on PasteBin (http://pastebin.com/XYT9g4Lb), after I had passed it through JsBeautifier.com.

This is a bug that was fixed in lxml version 2.3.5. Upgrade to version 2.3.5 or later.

Oh, there you go, naturally the first thing I try after I submit the question is the solution: the <!DOCTYPE> tag seems to be at the root of it. I created a new HTML file, temp.html:
<!DOCTYPE>
<html>
</html>
and passed that to BeautifulSoup as an HTML string, and that was enough to crash Python again. So I just need to remove that tag before I pass the HTML to BeautifulSoup in the future:
import requests, bs4
url = r"http://z-img.com/search.php?&ssg=off&size=large&q=test"
r = requests.get(url)
html = r.content
#or
#import urllib2
#html = urllib2.urlopen(url).read()
#replace the declaration with nothing, and my problems are solved
html = html.replace(r"<!DOCTYPE>", "")
soup = bs4.BeautifulSoup(html)
Hope this saves someone else some time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Working web crawler, suddenly not working anymore - python

Related

How to get google search page html code using python?

why doesn't my web scraper work? Python3 - requests, BeautifulSoup

Python Couldn't parse HTML from URL

How can I force Python to navigate to a web page and print all anchors (a-html)

Parsing a specific website crashes the Python process

Categories

Resources