Page scraper parser error? - python

Hello everyone, new Python user here I am having a strange error when building a very basic page scraper.
I am using BeautifulSoup4 to assist me and when I execute my code I get this error
"UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 13 of the file C:/Users/***/PycharmProjects/untitled1/s.py. To get rid of this warning, change code that looks like this:"
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "html.parser")
markup_type=markup_type))
If anyone has any help to fix this I would greatly appreciate it!
Code Follows
import requests
from bs4 import BeautifulSoup
def trade_spider():
url = 'http://buckysroom.org/trade/search.php?page=' # Could add a + pls str(pagesomething) to add on to the url so that it would update
source_code = requests.get(url) #requests the data from the site
plain_text = source_code.text #imports all of the data gathered
soup = BeautifulSoup(plain_text) #This hold all of the data, and allows you to sort through all of the data, converts it
for link in soup.find_all( 'a', {'class' : 'item-name'}):
href = link.get('href')
print(href)
trade_spider()

You could try to change the following line to:
soup = BeautifulSoup(plain_text, "html.parser")
or whatever other parser you need to use...

Related

Can't get for loop to work while parsing HTML using Beautiful Soup 4

I'm using the Beautiful Soup documentation to help me understand how to implement it. I'm not too familiar with Python as a whole, so maybe I'm making a syntax error, but I don't believe so. The code below should print out any links from the main Etsy page, but it's not doing that. The documentation states something similar to this, but maybe I'm missing something. Here's my code:
#!/usr/bin/python3
# import library
from bs4 import BeautifulSoup
import requests
import os.path
from os import path
# Request to website and download HTML contents
url='https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy_e&utm_campaign=Search_US_Brand_GGL_ENG_General-Brand_Core_All_Exact&utm_ag=A1&utm_custom1=_k_Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB_k_&utm_content=go_227553629_16342445429_536666953103_kwd-1818581752_c_&utm_custom2=227553629&gclid=Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content, 'html.parser')
for x in soup.head.find_all('a'):
print(x.get('href'))
The HTML prints if I set it up that way, but I can't get the for loop to work.
If you're trying to get all tags from the specified URL then:
url = 'https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy_e&utm_campaign=Search_US_Brand_GGL_ENG_General-Brand_Core_All_Exact&utm_ag=A1&utm_custom1=_k_Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB_k_&utm_content=go_227553629_16342445429_536666953103_kwd-1818581752_c_&utm_custom2=227553629&gclid=Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB'
with requests.get(url) as r:
r.raise_for_status()
soup = BeautifulSoup(r.text, 'lxml')
if (body := soup.body):
for a in body.find_all('a', href=True):
print(a['href'])

How to read link from beautifulsoup output python

I am trying to pass a link I extracted from beautifulsoup.
import requests
r = requests.get('https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links[1])
This is the link I am wanting.
Output: https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip
Now I am trying to pass this link through so I can download the contents.
# make a folder if it doesn't already exist
if not os.path.exists(folder_name):
os.makedirs(folder_name)
# pass the url
url = r'link from beautifulsoup result needs to go here'
response = requests.get(url, stream = True)
# extract contents
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
for elem in zf.namelist():
zf.extract(elem, '../data')
My overall goal is trying to take the link that I webscraped and place it in the url variable because the link is always changing on this website. I want to make it dynamic so I don't have to manually search for this link and change it when its changing and instead it changes dynamically. I hope this makes sense and appreciate any help I can get.
If I manually enter my code as the following I know it works
url = r'https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip'
If I can get my code to pass that exactly I know it'll work I'm just stuck with how to accomplish this.
I think you can do it with the find_all() method in Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content)
for a in soup.find_all('a'):
url = a.get('href')

why doesn't my web scraper work? Python3 - requests, BeautifulSoup

I have been following this python tutorial for a while, and I made a web scrawler, similar to the one in the video.
Language: Python
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class':'item-title'}):
href = link.get('href')
title = link.string
print(href)
page += 1
spider(1)
And this is the output that the program gives:
PS D:\development> & C:/Users/hirusha/AppData/Local/Programs/Python/Python38/python.exe "d:/development/Python/TheNewBoston/Python/one/web scrawler.py"n/TheNewBoston/Python/one/web scrawler.py"
PS D:\development>
What can I do?
Before this, I had an error, the code was:
soup = BeautifulSoup(plain_text)
i changed this to
soup = BeautifulSoup(plain_text, 'html.parser')
and the error was gone,
the error i got here was:
d:/development/Python/TheNewBoston/Python/one/web scrawler.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 10 of the file d:/development/Python/TheNewBoston/Python/one/web scrawler.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
soup = BeautifulSoup(plain_text)
Any help is appreciated, Thank You!
There are no results as the class you are targeting is not present until the webpage is rendered, which doesn't happen with requests.
Data is dynamically retrieved from a script tag. You can regex the JavaScript object holding the data and parse with json to get that info.
The error you show was due to a parser not being specified originally; which you rectified.
import re, json, requests
import pandas as pd
r = requests.get('https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=1')
data = json.loads(re.search(r'window\.runParams = (\{".*?\});', r.text, re.S).group(1))
df = pd.DataFrame([(item['title'], 'https:' + item['productDetailUrl']) for item in data['items']])
print(df)

warning in building webcrawler in python using beautifulsoup

I am trying to build a simple web crawler that gives the URLs of every legion product displayed on amazon.in if the key searched is 'legion'. I am using the following code:
import requests
from bs4 import BeautifulSoup
def legion_spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.amazon.in/s?k=legion&qid=1588862016&swrs=82DF79C1243AF6D61651CCAA9F883EC4&ref=sr_pg_'+ str(page)
source_code = requests.get(url)
plain_txt = source_code.text
soup = BeautifulSoup(plain_txt)
for link in soup.findAll('a',{'class': 'a-size-medium a-color-base a-text-normal'}):
href = link.get('href')
print(href)
page += 1
legion_spider(1)
and the output I am getting is this:
C:\Users\lenovo\AppData\Local\Programs\Python\Python38-32\python.exe "E:/Python Practice/web_crawler.py"
E:/Python Practice/web_crawler.py:10: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 10 of the file E:/Python Practice/web_crawler.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
soup = BeautifulSoup(plain_txt)
Process finished with exit code 0
You are missing the parser! Follow this part of the BS documentation!
BeautifulSoup(markup, <parser>)

Grabbing instagram feed using Python

I'm trying to get all Instagram posts by a specific user in Python. Below my code:
import requests
from bs4 import BeautifulSoup
def get_images(user):
url = "https://www.instagram.com/" + str(user)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for image in soup.findAll('img'):
href = image.get('src')
print(href)
get_images('instagramuser')
However, I'm getting the error:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 14 of the file C:/Users/Bedri/PycharmProjects/untitled1/main.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this: BeautifulSoup([your markup], "html.parser") markup_type=markup_type))
So my question, what am I doing wrong?
You should pass parser to BeautifulSoup, it's not an error, just a warning.
soup = BeautifulSoup(plain_text, "html.parser")
soup = BeautifulSoup(plain_text,'lxml')
I would recommend using > lxml < instead of > html.parser <
Instead of requests.get use urlopen
here's the code all in one line
from urllib import request
from bs4 import BeautifulSoup
def get_images(user):
soup = BeautifulSoup(request.urlopen("https://www.instagram.com/"+str(user)),'lxml')
for image in soup.findAll('img'):
href = image.get('src')
print(href)
get_images('user')

Categories

Resources