I am trying to get the hottest Wallpaper from Reddit's wallpaper subreddit.
I am using beautiful soup to get the HTML layout of the first wallpaper And then regex to get the URL from the anchor tag. But more than often I get a URL that's not matched by my regex. Here's the code I am using:
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
print r.status_code
text = r.text
soup = BeautifulSoup(text, "html.parser")
search_string = str(soup.find('a', {'class':'title'}))
photo_url = str(re.search('[htps:/]{7,8}[a-zA-Z0-9._/:.]+[a-zA-Z0-9./:.-]+', search_string).group())
Is there any way around it?
Here's a better way to do it:
Adding .json to the end of url in Reddit returns a json object instead of HTML.
For example https://www.reddit.com/r/wallpapers will provide HTML content but https://www.reddit.com/r/wallpapers/.json will give you a json object which you can easily exploit using json module in python
Here's the same program of getting the hottest wallpaper:
>>> import urllib
>>> import json
>>> data = urllib.urlopen('https://www.reddit.com/r/wallpapers/.json')
>>> wallpaper_dict = json.loads(data.read())
>>> wallpaper_dict['data']['children'][1]['data']['url']
u'http://i.imgur.com/C49VtMu.jpg'
>>> wallpaper_dict['data']['children'][1]['data']['title']
u'Space Shuttle'
>>> wallpaper_dict['data']['children'][1]['data']['domain']
u'i.imgur.com'
Not only it's much more cleaner, it'll also prevent you from a headache if reddit changes it's HTML layout or someone posts a URL that your regex can't handle.As a thumb rule it's generally smarter to use json instead of scraping the HTML
PS: The list inside the [children] is the wallpaper number. The first one is the topmost, the second one is the second one and so on.
Therefore ['data']['children'][2]['data']['url'] will give you the link for the second hottest wallpaper. you get the gist? :)
PPS: What's more is that with this method you can use the default urllib module. Generally when you're scraping Reddit you'd have to create fake User-Agent header and pass it while requesting(or it gives you a 429 response code but that's not the case with this method.
Here is the correct way to do it your method, but Jarwins method is better. You should not be using regex when working with HTML. You simply had to reference the href attribute to get the URL
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
soup = BeautifulSoup(r.text, "html.parser")
url = str(soup.find_all('a', {'class':'title'})[1]["href"])
print url
Related
I'm beginning learning coding and can't understand why video tutorials always use a simple method of entering multiple search strings to return embedded tags within a tag and they get results, yet my len(query) always comes back with a big fat 0 when doing the same thing. I'm using nearly the exact same code. Ultimately for this post, let's say I want to return the URLS. They are behind "div" then "h3" and then "href" tags on this page. But for example, let's please just try to narrow down the "h3" text behind the "div" tags.
Example:
import csv
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
response = requests.get('https://www.youtube.com/playlist?list=PLHnSLOMOPT11ORMDapNppzDKBYnWWP66O')
response
<python response: <Response [200]>
soup = BeautifulSoup(response.text, 'html.parser')
videos = soup.find_all('div')
len(videos)
<python response: 95>
For this I get 95 pings on the request for div tags. However when I add any second string to narrow down those for further HTML, I get 0 response back. Let's try adding the h3 tag. It should give a much lower integer return, but I get zero.
import csv
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
response = requests.get('https://www.youtube.com/playlist?list=PLHnSLOMOPT11ORMDapNppzDKBYnWWP66O')
response
<python response: <Response [200]>
soup = BeautifulSoup(response.text, 'html.parser')
videos = soup.find_all('div', 'h3')
len(videos)
<python response: 0>
What is taking me aback is tutorials and videos use the same simplistic <'query 1', '<query 2'> method in the videos, and they get search results that filter accordingly with each tags embedded text. I would be appreciative of code suggestions on filtering the embedded tags texts, as well as perhaps en explanation with what I may be doing incorrectly that these videos are doing right for the simplistic methods they demonstrate which hasn't worked for me..
I've even tried doing the same method on a simple WikiPedia page, but again more than one string get's a 0 response.
Pass a list as the first argument. It doesn't look very different, but it's definitely not the exact same thing:
videos = soup.find_all(['div', 'h3'])
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list
This is the function you are calling in the bad code:
find_all(name, attrs, recursive, string, limit, **kwargs)
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
You are looking for name='div', attrs='h3', so 0 is the answer you should expect. This is the kind of thing you are actually telling it to looking for (which nobody would actually put in their html):
<div class="h3">...</div>
full example demonstrating the proper syntax:
from bs4 import BeautifulSoup
from io import BytesIO
content = BytesIO(bytes("""<html>
<body>
<h3><div id='first'>content</div></h3>
<h3><div id='second'>content</div></h3>
<h3><div id='third'>content</div></h3>
</body>
</html>
""", 'utf-8'))
soup = BeautifulSoup(content, 'html.parser')
videos = soup.find_all(['div', 'h3'])
print(len(videos))
output
6
I have been trying to get the value of some variables of a web page:
itemPage='https://dadosabertos.camara.leg.br/api/v2/legislaturas/1'
url = urlopen(itemPage)
soupItem=BeautifulSoup(url,'lxml')
dataInicio=soupItem.find('dataInicio')
dataFim=soupItem.find('dataFim')
However, dataInicio and dataFim are empty. What am I doing wrong?
There are a couple of issues here. First, soup expects a string as input; check your url and see that it's actually <http.client.HTTPResponse object at 0x036D7770>. You can read() it, which produces a JSON byte string which is usable. But if you'd prefer to stick with XML parsing, I'd recommend using Python's request library to obtain a raw XML string (pass in correct headers to specify XML).
Secondly, when you create your soup object, you need to pass in features="xml" instead of "lxml".
Putting it all together:
import requests
from bs4 import BeautifulSoup
item_page = "https://dadosabertos.camara.leg.br/api/v2/legislaturas/1"
response = requests.get(item_page, headers={"accept": "application/xml"})
soup = BeautifulSoup(response.text, "xml")
data_inicio = soup.find("dataInicio")
data_fim = soup.find("dataFim")
print(data_inicio)
print(data_fim)
Output:
<dataInicio>1826-04-29</dataInicio>
<dataFim>1830-04-24</dataFim>
What I'm trying to do is search StackOverflow for answers. I know it's probably been done before, but I'd like to do it again. With a GUI. Anyway that is a little bit down the road as right now i'm just trying to get to the page with the most votes for a question. I noticed while trying to see how to get into a nested div to get the link for the first answer that my search was off and taking me to the wrong place. I am using BeautifulSoup and Requests and python3 to do this.
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
payload = {'q': 'open GL cube'}
page = requests.get("https://stackoverflow.com/search",params=payload)
print(" URL IS ", page.url)
data = page.content
soup = BeautifulSoup(data, 'lxml')
top = soup.find('a', {'title':'Highest voted search results'})['href']
print(top)
page2 = requests.get("https://stackoverflow.com",params=top)
print(page2.url)
data2 = page2.content
topSoup = BeautifulSoup(data2, 'lxml')
for div in topSoup.find_all('div', {'class':'result-link'}):
print(div.text)
i get the link and it outputs /search?tab=votes&q=open%GL%20cube
but when I pass it in with the params it does
https://stackoverflow.com/?/search?tab=votes&q=open%GL%20cube
I would like to get rid of the /?/
Don't pass it as parameters, just add it to the URL:
page2 = requests.get("https://stackoverflow.com" + top)
Once you pass requests parameters it adds a ? to the link before concatenating the new parameters to the link.
Requests - Passing Parameters In URLs
Also, as stated, you should really use the API.
Why not use the API?
There are plenty of search options (https://api.stackexchange.com/docs/advanced-search), and you get the response in JSON, no need for ugly HTML parsing.
My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my subset. I tried to find a simple solution but failed continuously.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
for link in soup.find_all('a',href=True):
print (link['href'])
This returns this complete list while I only need the ones that end with lyrics and the artist's name (here for instance Drake). These will the links from where I should be able to retrieve the lyrics.
https://genius.com/
/signup
/login
https://www.facebook.com/geniusdotcom/
https://twitter.com/Genius
https://www.instagram.com/genius/
https://www.youtube.com/user/RapGeniusVideo
https://genius.com/new
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
/search?page=2&q=drake
/search?page=3&q=drake
/search?page=4&q=drake
/search?page=5&q=drake
/search?page=6&q=drake
/search?page=7&q=drake
/search?page=8&q=drake
/search?page=9&q=drake
/search?page=672&q=drake
/search?page=673&q=drake
/search?page=2&q=drake
/embed_guide
/verified-artists
/contributor_guidelines
/about
/static/press
mailto:brands#genius.com
https://eventspace.genius.com/
/static/privacy_policy
/jobs
/developers
/static/terms
/static/copyright
/feedback/new
https://genius.com/Genius-how-genius-works-annotated
https://genius.com/Genius-how-genius-works-annotated
My next step would be to use selenium to emulate scroll which in the case of genius.com gives the entire set of search results. Any suggestions or resources would be appreciated. I would also like a few comments about the way I wish to proceed with this solution. Can we make it more generic?
P.S. I may not have well lucidly explained my problem but I have tried my best. Also, any ambiguities are welcome too. I am new to scraping and python and programming as well in so, just wanted to make sure that I am following the right path.
Use the regex module to match only the links you want.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
from re import compile
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
pattern = re.compile("[\S]+-lyrics$")
for link in soup.find_all('a',href=True):
if pattern.match(link['href']):
print (link['href'])
Output:
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
This just looks if your link matches the pattern ending in -lyrics. You may use similar logic to filter using user_input variable as well.
Hope this helps.
Python/Webscraping Beginner so please bear with me. I'm trying to grab all product names from this URL
Unfortunately, nothing gets returned when I run my code. The same code works fine for most other sites but I've tried dozens of variations and I can't make it work for this site.
Is this URL even scrapable using Bsoup? Any feedback is appreciated.
import bs4
import requests
url = 'http://www.rakuten.com/sr/searchresults.aspx?qu'
payload = {'q': 'Python',}
r = requests.get(url % payload)
soup = bs4.BeautifulSoup(r.text)
titles = [a.attrs.get('href') for a in soup.findAll('div.productscontainer a[href^=/prod]')]
for t in titles:
print(t)
import bs4
import requests
url = 'http://www.rakuten.com/sr/searchresults.aspx?qu'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)
titles = [td.text for td in soup.findAll('td', attrs={'class': 'searchlist'})]
for t in titles:
print(t)
If this formatting is correct, is the JS for sure preventing me from pulling anything?
First of all, your string formatting likely is wrong. Look at this:
>>> url = 'http://www.rakuten.com/sr/searchresults.aspx?qu'
>>> payload = {'q': 'Python',}
>>> url % payload
'http://www.rakuten.com/sr/searchresults.aspx?qu'
I guess that's not what you want. You should look up how string formatting works in Python, and then come up with a proper way to construct your URL.
Secondly, that "search engine" makes heavy use of JavaScript. Probably you will not be able to retrieve the information you want by just looking at the initially retrieved HTML content.