Web Scraping with python - Errors Occured - python

I want to write the code as below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings: print(sibling)
But whenever I type the line of bsObj = BeautifulSoup(html), it throws an error as the following pic:
Hope anyone can help me out of this.
Thanks

html = urlopen("http://www.your.url/here")
Instead of
html = urlopen(("http://www.your.url/here")
Notice the (( to the right of urlopen.

Related

weird text indentation when web scraping with beautifullsoup4 in python

Im trying to web scrape github
This is the code:
import requests as req
from bs4 import BeautifulSoup
urls = [
"https://github.com/moom825/Discord-RAT",
"https://github.com/freyacodes/Lavalink",
"https://github.com/KagChi/lavalink-railways",
"https://github.com/KagChi/lavalink-repl",
"https://github.com/Devoxin/Lavalink.py",
"https://github.com/karyeet/heroku-lavalink"]
r = req.get(urls[0])
soup = BeautifulSoup(r.content,"lxml")
title = str(soup.find("p",attrs={"class":"f4 mt-3"}).text)
print(title)
When i run the program i don't get any kind of errors but the indentation is very weird
Please anyone help me with this problem
Im using replit
Github has a really good API
You can use .strip() after .text then it will remove whitespace.
import requests as req
from bs4 import BeautifulSoup
urls = [
"https://github.com/moom825/Discord-RAT",
"https://github.com/freyacodes/Lavalink",
"https://github.com/KagChi/lavalink-railways",
"https://github.com/KagChi/lavalink-repl",
"https://github.com/Devoxin/Lavalink.py",
"https://github.com/karyeet/heroku-lavalink"]
r = req.get(urls[0])
soup = BeautifulSoup(r.content,"lxml")
title = str(soup.find("p",attrs={"class":"f4 mt-3"}).text.strip())
print(title)

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.
So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

Reading in Content From URLS in a File

I'm trying to get other subset URLs from a main URL. However,as I print to see if I get the content, I noticed that I am only getting the HTML, not the URLs within it.
import urllib
file = 'http://example.com'
with urllib.request.urlopen(file) as url:
collection = url.read().decode('UTF-8')
I think this is what you are looking for.
You can use beautiful soup library of python and this code should work with python3
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_all_urls(url):
open = urlopen(url)
url_html = BeautifulSoup(open, 'html.parser')
for link in url_html.find_all('a'):
links = str(link.get('href'))
if links.startswith('http'):
print(links)
else:
print(url + str(links))
get_all_urls('url.com')

I have BeautifulSoup4, from Anaconda, but I can't seem to utilize it, to save URL to TXT

I have this view in Anaconda.
However, I can't see to utilize BS in my script.
import string
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
#import BeautifulSoup as bs
alphabets = string.ascii_lowercase
for i in alphabets:
#print(i)
html = urlopen("http://www.airlineupdate.com/content_public/codes/airportcodes/airports-by-iata/iata-" + i + ".htm")
print(html)
for j in html:
#soup = bs4(html, "html.parser")
soup = bs(html, "html.parser")
f = open('C:\\Users\\Excel\\Desktop\\URL.txt', 'w')
When I try to run the code above, I get the following error:
ModuleNotFoundError: No module named 'BeautifulSoup4'
Can someone enlighten me as to what's going on here?
from documentation its
from bs4 import BeautifulSoup
and based on your code, it seems like you want to use it as bs()
from bs4 import BeautifulSoup as bs

Python BeautifulSoup cannot find table ID

I am running into some trouble scraping a table using BeautifulSoup. Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats = soup.find('table', id = 'totals')
In [78]: print(stats)
None
When I right click on the table to inspect the element the HTML looks as I'd expect, however when I view the source the only element with id = 'totals' is commented out. Is there a way to scrape a table from the commented source code?
I have referenced this post but can't seem to replicate their solution.
Here is a link to the webpage I am interested in. I'd like to scrape the table labeled "Totals" and store it as a data frame.
I am relatively new to Python, HTML, and web scraping. Any help would be greatly appreciated.
Thanks in advance.
Michael
Comments are string instances in BeautifulSoup. You can use BeautifulSoup's find method with a regular expression to find the particular string that you're after. Once you have the string, have BeautifulSoup parse that and there you go.
In other words,
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats_html = soup.find(string=re.compile('id="totals"'))
stats_soup = BeautifulSoup(stats_html, "html.parser")
print(stats_soup.table.caption.text)
You can do this:
from urllib2 import *
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"lxml")
stats = soup.findAll('div', id = 'all_totals')
print stats
Please inform me if I helped!

Categories

Resources