I'm trying to get all Instagram posts by a specific user in Python. Below my code:
import requests
from bs4 import BeautifulSoup
def get_images(user):
url = "https://www.instagram.com/" + str(user)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for image in soup.findAll('img'):
href = image.get('src')
print(href)
get_images('instagramuser')
However, I'm getting the error:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 14 of the file C:/Users/Bedri/PycharmProjects/untitled1/main.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this: BeautifulSoup([your markup], "html.parser") markup_type=markup_type))
So my question, what am I doing wrong?
You should pass parser to BeautifulSoup, it's not an error, just a warning.
soup = BeautifulSoup(plain_text, "html.parser")
soup = BeautifulSoup(plain_text,'lxml')
I would recommend using > lxml < instead of > html.parser <
Instead of requests.get use urlopen
here's the code all in one line
from urllib import request
from bs4 import BeautifulSoup
def get_images(user):
soup = BeautifulSoup(request.urlopen("https://www.instagram.com/"+str(user)),'lxml')
for image in soup.findAll('img'):
href = image.get('src')
print(href)
get_images('user')
Related
I have been following this python tutorial for a while, and I made a web scrawler, similar to the one in the video.
Language: Python
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7<ype=wholesale&SortType=default&g=n&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class':'item-title'}):
href = link.get('href')
title = link.string
print(href)
page += 1
spider(1)
And this is the output that the program gives:
PS D:\development> & C:/Users/hirusha/AppData/Local/Programs/Python/Python38/python.exe "d:/development/Python/TheNewBoston/Python/one/web scrawler.py"n/TheNewBoston/Python/one/web scrawler.py"
PS D:\development>
What can I do?
Before this, I had an error, the code was:
soup = BeautifulSoup(plain_text)
i changed this to
soup = BeautifulSoup(plain_text, 'html.parser')
and the error was gone,
the error i got here was:
d:/development/Python/TheNewBoston/Python/one/web scrawler.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 10 of the file d:/development/Python/TheNewBoston/Python/one/web scrawler.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
soup = BeautifulSoup(plain_text)
Any help is appreciated, Thank You!
There are no results as the class you are targeting is not present until the webpage is rendered, which doesn't happen with requests.
Data is dynamically retrieved from a script tag. You can regex the JavaScript object holding the data and parse with json to get that info.
The error you show was due to a parser not being specified originally; which you rectified.
import re, json, requests
import pandas as pd
r = requests.get('https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7<ype=wholesale&SortType=default&g=n&page=1')
data = json.loads(re.search(r'window\.runParams = (\{".*?\});', r.text, re.S).group(1))
df = pd.DataFrame([(item['title'], 'https:' + item['productDetailUrl']) for item in data['items']])
print(df)
I am trying to scrape a book from a website and while parsing it with Beautiful Soup I noticed that there were some errors. For example this sentence:
"You have more… direct control over your skaa here. How many woul "Oh, a half dozen or so,"
The "more…" and " woul" are both errors that occurred somewhere in the script.
Is there anyway to automatically clean mistakes like this up?
Example code of what I have is below.
import requests
from bs4 import BeautifulSoup
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'html.parser')
print(soup.prettify())
trin = soup.tr.get_text()
final = str(trin)
print(final)
You need to escape the convert the html entities as detailed here. To apply in your situation however, and retain the text, you can use stripped_strings:
import requests
from bs4 import BeautifulSoup
import html
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'lxml')
for r in soup.select_one('table tr').stripped_strings:
s = html.unescape(r)
print(s)
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html5lib')
After running several times, the process gets stuck at BeautifulSoup(html.read(), 'html5lib'), and I have tried to change from html parser to 'lxml' and 'html.parser'. However, the problem persists. Is there a bug in BeautifulSoup? How can I solve this problem?
update
I add some logs inside the program, like this
print('open the url')
html = urlopen(url)
print('create BeautifulSoup Object')
bs = BeautifulSoup(html.read(), 'html5lib')
The console print create BeautifulSoup Object and just stay there with a blinking cursor.
I've encountered the same problem and I found out that the program got stuck at html.read(), which may because urlopen() resource does not close correctly when the response has some errors.
You can change like this:
with urlopen(url) as html:
html = html.read()
bs = BeautifulSoup(html, "lxml")
Or you can choose to use the requests package, which is better than the urllib like this:
import requests
html = requests.get(url).text
bs = BeautifulSoup(html, "lxml")
Hope it can solve your problem
I'm trying to scrape a webpage using BeautifulSoup using the code below:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen("http://en.wikipedia.org//wiki//Markov_chain.htm") as url:
s = url.read()
soup = BeautifulSoup(s)
with open("scraped.txt", "w", encoding="utf-8") as f:
f.write(soup.get_text())
f.close()
The problem is that it saves the Wikipedia's main page instead of that specific article. Why the address doesn't work and how should I change it?
The correct url for the page is http://en.wikipedia.org/wiki/Markov_chain:
>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/Markov_chain"
>>> soup = BeautifulSoup(urllib.request.urlopen(url))
>>> soup.title
<title>Markov chain - Wikipedia, the free encyclopedia</title>
#alecxe's answer will generate:
**GuessedAtParserWarning**:
No parser was explicitly specified, so I'm using the best
available HTML parser for this system ("html.parser"). This usually isn't a problem,
but if you run this code on another system, or in a different virtual environment, it
may use a different parser and behave differently. The code that caused this warning
is on line 25 of the file crawl.py.
To get rid of this warning, pass the additional argument 'features="html.parser"' to
the BeautifulSoup constructor.
Here is a solution without GuessedAtParserWarning using requests:
# crawl.py
import requests
url = 'https://www.sap.com/belgique/index.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
file = path.join(path.dirname(__file__), 'downl.txt')
# Either print the title/text or save it to a file
print(soup.title)
# download the text
with open(file, 'w') as f:
f.write(soup.text)
I have this sript:
import urrlib2
from bs4 import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs
For this web site, it prints empty list? What can be problem? I am running on Ubuntu 12.04
Actually there are quite couple of bugs in BeautifulSoup which might raise some unknown errors. I had a similar issue when working on apache using lxml parser
So, just try to use other couple of parsers mentioned in the documentation
soup = BeautifulSoup(page, "html.parser")
This should work!
It looks like you have a few mistakes in your code urrlib2 should be urllib2, I've fixed the code for you and this works using BeautifulSoup 3
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs