Extracting specific elements from list python 2.7 - python

I am working on this bot that extracts the urls from a specific page. I have extracted all the links and put them in a list now I can't seem to get realist urls(lead to other sites starting with http or https) out from the list and append them to another list or delete the ones that don't start with http. Thanks in advance
import urllib2
import requests
from bs4 import BeautifulSoup
def main():
#get all the links from bing about cancer
site = "http://www.bing.com/search?q=cancer&qs=n&form=QBLH&pq=cancer&sc=8-4&sp=-1&sk=&cvid=E56491F36028416EB41694212B7C33F2"
urls =[]
true_links = []
r = requests.get(site)
html_content = r.content
soup = BeautifulSoup(html_content, 'html.parser')
links = soup.find_all("a")
for link in links:
link = link.get("href")
urls.append(str(link))
#urls.append(link.get("href"))
#print map(str, urls)
#REMOVE GARBAGE LINKS
print len(urls)
print urls
main()

You can use urlparse.urljoin:
link = urlparse.urljoin(site, link.get("href"))
This will create absolute URLs out of relative ones.
You should also be using html_content = r.text instead of html_content = r.content. r.text takes care of using the proper encoding.

Related

Beautifulsoup: Removing German Umlauts

i'm new to all of this, so i need a little bit of help. For a uni project i am trying to extract ingedrients from a website and in general the code works how it should, but i just don't know how
to get "Bärlauch" instead of "B%C3%A4rlauch" in the end.
I used beautifulsoup with the following code:
URL = [...]
links = []
for url in range(0,10):
req = requests.get(URL[url])
soup = bs(req.content, 'html.parser')
for link in soup.findAll('a'):
links.append(str(link.get('href')))
I don't get why it doesn't work as it should, eventhough the encoding already is utf-8.
Maybe someone knows better.
Thanks!
URLs are URL-encoded. The response of a request ist a response not a req(uest).
URLS = [...]
links = []
for url in URLS:
response = requests.get(url)
soup = bs(response.content, 'html.parser')
for link in soup.find_all('a'):
links.append(urllib.parse.unquote(link.get('href')))

Trying to scrape the links on the page www.zath.co.uk using python

I am trying to scrape a website www.zath.co.uk, and extract the links to all of the articles using Python 3. Looking at the raw html file I identified one of the sections I am interested in, displayed below using BeautifulSoup.
<article class="post-32595 post type-post status-publish format-standard has-post-thumbnail category-games entry" itemscope="" itemtype="https://schema.org/CreativeWork">
<header class="entry-header">
<h2 class="entry-title" itemprop="headline">
<a class="entry-title-link" href="https://www.zath.co.uk/family-games-day-night-event-giffgaff/" rel="bookmark">
A Family Games Night (& Day) With giffgaff
</a>
I then wrote this code to excute this, I started by setting up a list of urls from the website to scrape.
urlList = ["https://www.zath.co.uk/","https://www.zath.co.uk/page/2/",....."https://www.zath.co.uk/page/35/"
Then (after importing the necessary libraries) defined a function get all Zeth articles.
def getAllZathPosts(url,links):
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response)
for a in soup.findAll('a'):
url = a['href']
c = a['class']
if c == "entry-title-link":
print(url)
links.append(url)
return
Then call the function.
links = []
zathPosts = {}
for url in urlList:
zathPosts = getAllZathPosts(url,links)
The code runs with no errors but the links list remains empty with no urls printed as if the class never equals "entry-title-link". I have tried adding an else case.
else:
print(url + " not article")
and all the links from the pages printed as expected. Any suggestions?
You can simply iterate it using range and extract article tag
import requests
from bs4 import BeautifulSoup
for page_no in range(35):
page=requests.get("https://www.zath.co.uk/page/{}/".format(page_no))
parser=BeautifulSoup(page.content,'html.parser')
for article in parser.findAll('article'):
print(article.h2.a['href'])
You can do something like the below code:
import requests
from bs4 import BeautifulSoup
def getAllZathPosts(url,links):
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
results = soup.select("a.entry-title-link")
#for i in results:
#print(i.text)
#links.append(url)
if len(results) >0:
links.append(url)
links = []
urlList = ["https://www.zath.co.uk/","https://www.zath.co.uk/page/2/","https://www.zath.co.uk/page/35/"]
for url in urlList:
getAllZathPosts(url,links)
print(set(links))

Soup works on one IMBD page but not on another. How to solve?

url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")
movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link
print(movie_div1)
#empty list
print(movie_div)
#gives perfect list
Why is movie_div1 giving an empty list? I am not able to identify any difference in the URL structures to indicate the code should be different. All leads appreciated.
Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.
You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.
2nd option is using Selenium.
Good luck.
As #SakuraFreak mentioned, you could parse the JSON received. However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div>.
For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
details = str(soup.find('span', class_='ab_widget'))
json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"
json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)
json_data = json.loads(details[:json_end])
imdb_titles = json_data["titles"]
for item in imdb_titles.values():
print(item["primary"]["title"])

Scraping links from href on Sephora website

Hi So I am trying to scrape the links for all the products a specific page on Sephora. My code only gives me the first 12 links while there are 48 products on the website. I think this is because Sephora is a User-Interactive-website(Please correct me if I am wrong) so it doesn't load the rest. But I do not know how to get the rest. Please send some help! Thank you!!!
Here is my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
link_list = []
keyword = 'product'
for link in soup.findAll('a'):
href = link.get('href')
if keyword in href:
link_list.append('https://www.sephora.com/' + href)
else:
continue
If you take a look at the source code, you will see their data stored as a json object. You can get the json object by this:
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
data = json.loads(soup.find('script', id='linkJSON').text)
products = data[3]['props']['products']
prefix = "https://www.sephora.com"
url_links = [prefix+p["targetUrl"] for p in products]
print(url_links)
By investigating the json data, you can find where the links stored. To view the json data more clearly, I use this website: https://codebeautify.org/jsonviewer

Beautifulsoup "findAll()" does not return the tags

I am trying to build a scraper to get some abstracts of academic papers and their corresponding titles on this page.
The problem is that my for link in bsObj.findAll('a',{'class':'search-track'}) does not return the links I need to further build my scraper. In my code, the check is like this:
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
The for loop above does print out anything, however, the href links should be inside the <a class="search-track" ...</a>.
I have referred to this post, but changing the Beautifulsoup parser is not solving the problem of my code. I am using "html.parser" in my Beautifulsoup constructor: bsObj = bs(html.content, features="html.parser").
And the print(len(bsObj)) prints out "3" while it prints out "2" for both "lxml" and "html5lib".
Also, I started off using urllib.request.urlopen to get the page and then tried requests.get() instead. Unfortunately the two approaches give me the same bsObj.
Here is the code I've written:
#from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bs
import ssl
'''
The elsevier search is kind of a tree structure:
"keyword --> a list of journals (a journal contain many articles) --> lists of articles
'''
address = input("Please type in your keyword: ") #My keyword is catalyst for water splitting
#https://www.elsevier.com/en-xs/search-results?
#query=catalyst%20for%20water%20splitting&labels=journals&page=1
address = address.replace(" ", "%20")
address = "https://www.elsevier.com/en-xs/search-results?query=" + address + "&labels=journals&page=1"
journals = []
articles = []
def getJournals(url):
global journals
#html = urlopen(url)
html = requests.get(url)
bsObj = bs(html.content, features="html.parser")
#print(len(bsObj))
#testFile = open('testFile.txt', 'wb')
#testFile.write(bsObj.text.encode(encoding='utf-8', errors='strict') +'\n'.encode(encoding='utf-8', errors='strict'))
#testFile.close()
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
########does not print anything########
'''
if 'href' in link.attrs and link.attrs['href'] not in journals:
newJournal = link.attrs['href']
journals.append(newJournal)
'''
return None
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
getJournals(address)
print(journals)
Can anyone tell me what the problem is in my code that the for loop does not print out any links? I need to store the links of journals in a list and then visit each link to scrape the abstracts of papers. By right the abstracts part of a paper is free and the website shouldn't have blocked my ID because of it.
This page is dynamically loaded with jscript, so Beautifulsoup can't handle it directly. You may be able to do it using Selenium, but in this case you can do it by tracking the api calls made by the page (for more see, as one of many examples, here.
In your particular case it can be done this way:
from bs4 import BeautifulSoup as bs
import requests
import json
#this is where the data is hiding:
url = "https://site-search-api.prod.ecommerce.elsevier.com/search?query=catalyst%20for%20water%20splitting&labels=journals&start=0&limit=10&lang=en-xs"
html = requests.get(url)
soup = bs(html.content, features="html.parser")
data = json.loads(str(soup))#response is in json format so we load it into a dictionary
Note: in this case, it's also possible to dispense with Beautifulsoup altogether and load the response directly, as in data = json.loads(html.content). From this point:
hits = data['hits']['hits']#target urls are hidden deep inside nested dictionaries and lists
for hit in hits:
print(hit['_source']['url'])
Ouput:
https://www.journals.elsevier.com/water-research
https://www.journals.elsevier.com/water-research-x
etc.

Categories

Resources