I am trying to extract all the links to the store locations on a web site: https://www.ulta.com/stores/directory
The structure of the web looks like this
i want to extract all the links under the ul (class="sl-all-locations__wrapper") using Beautifulsoup as below, but i got nothing from the code
import requests
import bs4 as bs
resp = requests.get('https://www.ulta.com/stores/directory')
soup = bs.BeautifulSoup(resp.text, 'lxml')
soup.find_all('ul',{'class': 'sl-all-locations__wrapper'})
Also, inquiries like
soup.find_all('div',id='sl-all-locations')
does not work either...
i am wondering if there is anything wrong with my code or the website is anti-scraper....
can anyone help me with this?
Related
I'd like to make an auto login program in internal network website.
So, I try to parse that site using requests and Beautifulsoup library.
It works...and I get some html alot shorter than that site's html.
what's the problem? maybe security issue?..
pleas help me.
import requests
from bs4 import BeautifulSoup as bs
page = requests.get("http://test.com")
soup = bs(page.text, "html.parse")
print(soup) # I get some html alot shorter than that site's html
I'm new to Python and I'm struggling to find a solution or a way to do the following thing. I need 2 things from a website that I can get from inspect element: the link to the .m3u8 file which can be found in the html (Elements tab) of the website and a link to a .ts file (it doesn't matter which one) from the Network tab. Does anybody know how to do this? Thanks in advance!
Use BS4 and requests:
import requests
from bs4 import BeautifulSoup
URL = 'https://stackoverflow.com/questions/64828046/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='question-header')
print(results)
from urllib.request import urlopen
import lxml.html
connection = urlopen('http://yourwebsite')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
if link.endswith(".m3u8") or link.endwith(".ts"):
print(link)
you can use other if conditions to check whether something is in the link, like:
if "m3u8" in link:
print(link)
I'm trying to read a news webpage to get the titles of their stories. I'm attempting to put them in a list, but I keep getting an empty list. Can someone please point in the right direction here? What am I missing? Please see code below. Thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://nypost.com/'
ttl_lst = []
soup = BeautifulSoup(requests.get(url).text, "lxml")
title = soup.findAll('h2', {'class': 'story-heading'})
for row in title:
ttl_lst.append(row.text)
print (ttl_lst)
the requests module only returns the first html file sent to them. Sites like nypost use ajax requests to get their articles. You will have to use something like selenium for this, which allows for ajax requests after the page loads.
I have scoured these pages for days without success, so I am hoping this is not a duplicate. If so I apologize.
I have a device on a local network that provides a data read out in HTML that is updated live. So far my BeautifulSoup and URLLIB2 attempts at parsing this data has been unsuccessful.
Any help would be appreciated.
This is the source code, with the data of interested encircled:
This if the resultant output:
from bs4 import BeautifulSoup
import re
import urllib2
from urllib import urlopen
url = 'http://192.168.1.2/index.html#home-view'
#___________________________________________________________________
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
soup = BeautifulSoup(data, "html.parser")
result = soup.findAll('p', {'class':'gas-conc'})
print result
SOLVED!: Thank you for the assistance. With Selenium I was able to painfully scrape out this data. However I had to use the BS 'beautify' function on the source code and manually count out which characters to splice out.
I'm 90% sure that you wouldn't get these data unless you managed to render Javascript somehow.
Check out this post for more info in how to make that happens.
In a nutshell, you can use:
selenium
PyQt5
Dryscrape
I have written a small web scraper in BS4.With the code I am able to scrape one page at a time,here is the relevant code.
import csv
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=129867").text
soup = BeautifulSoup(html,'lxml')
This code scrapes one page but I want to scrape more than one page at a time(a range) so I tried adding this for loop like this.
import csv
from bs4 import BeautifulSoup
import requests
for ace in range(129867, 129869):
html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id= {ace}").text
soup = BeautifulSoup(html,'lxml')
Nothing happens when I run the code and I don't even get up any of the usual cryptic messages up hinting at what went wrong.Could it be syntax,or is it something else.Any help appreciated.
You should do everything inside the loop now. And, you are not inserting the ace value into the URL and there is an extra space after the id=. It might also be a good idea to establish a web-scraping session and use the params keyword of the get() method.
Fixed version:
import csv
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
for ace in range(129867, 129869):
url = "http://www.gbgb.org.uk/resultsMeeting.aspx"
html = session.get(url, params={'id': ace}).text
soup = BeautifulSoup(html, 'lxml')
Note this code is still of a blocking nature, it would process the pages one at a time. If you want to speed things up, look into Scrapy web-scraping framework.