Checking webpage for results with python and beautifulsoup

Checking webpage for results with python and beautifulsoup - python

I need to check a webpage search results and compare them to user input.
ui = raw_input() #for example "Niels Bohr"
link = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=90&k=10"
stranica=urllib.urlopen(link)
soup = BeautifulSoup(stranica, from_encoding="utf-8")
beauty = soup.prettify()
print beauty
since there is 1502 results, my idea was to change the k=10 to k=1502. Now I need some kind of function to check if search results contain my user input. I know that my names are the text after TEXT
so how to do it? maybe using regex?
the second part is if there are matching results to get the link of the results. Again, I know that link is inside that href="", but how to get it out and make it usable=

Finding if Niels Bohr is listed is as easy as using a large batch number and loading the resulting page:
import sys
import urllib2
from bs4 import BeautifulSoup
url = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=0&k={}".format(sys.maxint)
name = u'Bohr, Niels'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all(class_='AllWordsTextHit', text=name):
print link
This produces any links that contain the text 'Bohr, Niels' as the link text. You can use a regular expression if you need a partial match.
The link object has a (relative) href attribute you can then use to load the next page:
professor_page = 'http://www.enciklopedija.hr/' + link['href']

Related

How to get href from a class containing a specific text using CSS selector (Scrapy)

I am working with the following web site: https://inmuebles.mercadolibre.com.mx/venta/, and I am trying to get the link from "ver_todos" button from "Inmueble" section (in red). However, the "Tour virtual" and "Publicados hoy" sections (in blue) may or may not appear when visiting the site.
As shown in the image below, the classes ui-search-filter-dl contain the specific sections from the menu from above image; while ui-search-filter-container classes contain the sub-sections displayed by the site (e.g. Casas, Departamento & Terrenos for Inmueble). With the intention of obtaining the link from "ver todos" button from "Inmueble" section, I was using this line of code:
ver_todos = response.css('div.ui-search-filter-dl')[2].css('a.ui-search-modal__link').attrib['href']
But since "Tour virtual" and "Publicados hoy" are not always in the page, I cannot be sure that ui-search-filter-dl at index 2 is always the index corresponding to "ver todos" button.
I was trying to get the link from "ver todos" by using this line of code:
response.css(''':contains("Inmueble") ~ .ui-search-filter-dt-title
.ui-search-modal__link::attr(href)''').extract()
Basically, I was trying to get the href from a ui-search-filter-dt-title class that contains the title "Inmueble". Unfortunately, the output is an empty list. I would like to find the link from "ver todos" by using css and regex but I'm having trouble with it. How may I achieve that?

I think xpath is easier to select the target elements in most cases:
Code:
xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(#class,'ui-search-modal__link')]/#href"
url = response.xpath(xpath).extract()[0]
Actually, I didn't create a scrapy project to check your code. Alternatively, I implemented the following code:
from lxml import html
import requests
res = requests.get( "https://inmuebles.mercadolibre.com.mx/venta/")
dom = html.fromstring(res.text)
xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(#class,'ui-search-modal__link')]/#href"
url = dom.xpath(xpath)[0]
assert url == 'https://inmuebles.mercadolibre.com.mx/venta/_FiltersAvailableSidebar?filter=PROPERTY_TYPE'
Since the xpath should be the same among scrapy and lxml, of course, I hope the code shown in the beginning will also work fine in your scrapy project.

An easy way you could do it is by getting all the link <a> and then checking if any of their text matches ver todos.
import requests
from bs4 import BeautifulSoup
link = "https://inmuebles.mercadolibre.com.mx/venta/"
def main():
res = requests.get(link)
if res.status_code == 200:
soup = BeautifulSoup(res.text, "html.parser")
links = [a["href"] for a in soup.select("a") if a.text.strip().lower() == "ver todos"]
print(links)
if __name__ == "__main__":
main()

Although I have the file open in the same folder as my code, it wont execute

A small disclaimer for all, this is my first language for programming and I am still getting used to it, so any suggestions are recommended.
The problem that was given is as follows:
Scraping Numbers from HTML using BeautifulSoup In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.
We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.
Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)
Actual data: http://py4e-data.dr-chuck.net/comments_97465.html (Sum ends with 19)
You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.
Data Format
The file is a table of names and comment counts. You can ignore most of the data in the file except for lines like the following:
<tr><td>Modu</td><td><span class="comments">90</span></td></tr>
<tr><td>Kenzie</td><td><span class="comments">88</span></td></tr>
<tr><td>Hubert</td><td><span class="comments">87</span></td></tr>
You are to find all the tags in the file and pull out the numbers from the tag and sum the numbers.
Look at the sample code provided. It shows how to find all of a certain kind of tag, loop through the tags and extract the various aspects of the tags.
...
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
# Look at the parts of a tag
print 'TAG:',tag
print 'URL:',tag.get('href', None)
print 'Contents:',tag.contents[0]
print 'Attrs:',tag.attrs
You need to adjust this code to look for span tags and pull out the text content of the span tag, convert them to integers and add them up to complete the assignment.
I have written this:
import urllib
from bs4 import BeautifulSoup
page = urllib.urlopen(input("Enter URL: "))
soup = BeautifulSoup(page, "html.parser")
spans = soup('span')
numbers = []
for span in spans:
numbers.append(int(span.string))
print (sum(numbers))
The error message is that bs4 is not a module even though it is and it is not asking me for the url and not giving me output. I have no clue on how to solve it.

import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
ctx= ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter -')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
numbers = []
spans = soup('span')
for span in spans:
numbers.append(int(span.string))
print (sum(numbers))
Use this code, works for me.

Beautiful Soup - Blank screen for a long time without any output

I am quite new to python and am working on a scraping based project- where I am supposed to extract all the contents from links containing a particular search term and place them in a csv file. As a first step, I wrote this code to extract all the links from a website based on a search term entered. I only get a blank screen as output and I am unable to find my mistake.
import urllib
import mechanize
from bs4 import BeautifulSoup
import datetime
def searchAP(searchterm):
newlinks = []
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
text = ""
start = 0
while "There were no matches for your search" not in text:
url = "http://www.marketing-interactive.com/"+"?s="+searchterm
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text, "lxml")
results = soup.findAll('a')
for r in results:
if "rel=bookmark" in r['href'] :
newlinks.append("http://www.marketing-interactive.com"+ str(r["href"]))
start +=10
return newlinks
print searchAP("digital marketing")

You made four mistakes:
You are defining start but you never use it. (Nor can you, as far as I can see on http://www.marketing-interactive.com/?s=something. There is no url based pagination.) So you endlessly looping over the first set of results.
"There were no matches for your search" is not the no-results string returned by that site. So it would go on forever anyway.
You are appending the link, including http://www.marketing-interactive.com to http://www.marketing-interactive.com. So you would end up with http://www.marketing-interactive.comhttp://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/
Concerning rel=bookmark selection: arifs solution is the proper way to go. But if you really want to do it this way you'd need to something like this:
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
This first checks if rel exists and then checks if its first child is "bookmark", as r['href'] simply does not contain the rel. That's not how BeautifulSoup structures things.
To scrape this specific site you can do two things:
You could do something with Selenium or something else that supports Javascript and press that "Load more" button. But this is quite a hassle.
You can use this loophole: http://www.marketing-interactive.com/wp-content/themes/MI/library/inc/loop_handler.php?pageNumber=1&postType=search&searchValue=digital+marketing
This is the url that feeds the list. It has pagination, so you can easily loop over all results.

The following script extracts all the links from the web page based on given search key. But it does not explore beyond the first page. Although the following code can easily be modified to get all results from multiple pages by manipulating page-number in the URL (as described by Rutger de Knijf in the other answer.).
from pprint import pprint
import requests
from BeautifulSoup import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content)
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
Usage:
pprint(get_url_for_search_key('digital marketing'))
Output:
[u'http://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/',
u'http://www.marketing-interactive.com/singapore-polytechnic-on-the-hunt-for-digital-marketing-agency/',
u'http://www.marketing-interactive.com/how-to-get-your-bosses-on-board-your-digital-marketing-plan/',
u'http://www.marketing-interactive.com/digital-marketing-institute-launches-brand-refresh/',
u'http://www.marketing-interactive.com/entropia-highlights-the-7-original-sins-of-digital-marketing/',
u'http://www.marketing-interactive.com/features/futurist-right-mindset-digital-marketing/',
u'http://www.marketing-interactive.com/lenovo-brings-board-new-digital-marketing-head/',
u'http://www.marketing-interactive.com/video/discussing-digital-marketing-indonesia-video/',
u'http://www.marketing-interactive.com/ubs-melvin-kwek-joins-credit-suisse-as-apac-digital-marketing-lead/',
u'http://www.marketing-interactive.com/linkedins-top-10-digital-marketing-predictions-2017/']
Hope this is what you wanted as the first step for your project.

Retrieving a subset of href's from findall() in BeautifulSoup

My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my subset. I tried to find a simple solution but failed continuously.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
for link in soup.find_all('a',href=True):
print (link['href'])
This returns this complete list while I only need the ones that end with lyrics and the artist's name (here for instance Drake). These will the links from where I should be able to retrieve the lyrics.
https://genius.com/
/signup
/login
https://www.facebook.com/geniusdotcom/
https://twitter.com/Genius
https://www.instagram.com/genius/
https://www.youtube.com/user/RapGeniusVideo
https://genius.com/new
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
/search?page=2&q=drake
/search?page=3&q=drake
/search?page=4&q=drake
/search?page=5&q=drake
/search?page=6&q=drake
/search?page=7&q=drake
/search?page=8&q=drake
/search?page=9&q=drake
/search?page=672&q=drake
/search?page=673&q=drake
/search?page=2&q=drake
/embed_guide
/verified-artists
/contributor_guidelines
/about
/static/press
mailto:brands#genius.com
https://eventspace.genius.com/
/static/privacy_policy
/jobs
/developers
/static/terms
/static/copyright
/feedback/new
https://genius.com/Genius-how-genius-works-annotated
https://genius.com/Genius-how-genius-works-annotated
My next step would be to use selenium to emulate scroll which in the case of genius.com gives the entire set of search results. Any suggestions or resources would be appreciated. I would also like a few comments about the way I wish to proceed with this solution. Can we make it more generic?
P.S. I may not have well lucidly explained my problem but I have tried my best. Also, any ambiguities are welcome too. I am new to scraping and python and programming as well in so, just wanted to make sure that I am following the right path.

Use the regex module to match only the links you want.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
from re import compile
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
pattern = re.compile("[\S]+-lyrics$")
for link in soup.find_all('a',href=True):
if pattern.match(link['href']):
print (link['href'])
Output:
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
This just looks if your link matches the pattern ending in -lyrics. You may use similar logic to filter using user_input variable as well.
Hope this helps.

How to extract text from a webpage using python 2.7?

I'm trying to programmatically extract text from this webpage which describes a genome assembly in the public archive:
http://www.ebi.ac.uk/ena/data/view/ERS019623
I have thousands of assemblies that I want to track down and extract the study accession, which is the code on the far left of the table beginning with "PRJ". The URL for each of these assemblies is of the same format as the one above, i.e. "http://www.ebi.ac.uk/ena/data/view/ERS******". I have the ERS code for each of my assemblies so I can construct the URL for each one.
I've tried a few different methods, firstly if you add "&display=XML" to the end of the URL it prints the XML (or at least I'm presuming that it's printing the XML for the entire page, because the problem is that the study accession "PRJ******" is no where to be seen here). I had utilised this to extract another code that I needed from the same webpage, the run accession which is always of the format "ERR******" using the below code:
import urllib2
from bs4 import BeautifulSoup
import re
import csv
with open('/Users/bj5/Desktop/web_scrape_test.csv','rb') as f:
reader = csv.reader(f) #opens csv containig list of ERS numbers
for row in reader:
sample = row[0] #reads index 0 (1st row)
ERSpage = "http://www.ebi.ac.uk/ena/data/view/" + sample + "&display=xml" #creates URL using ERS number from 1st row
page = urllib2.urlopen(ERSpage) #opens url and assigns it to variable page
soup = BeautifulSoup(page, "html.parser") #parses the html/xml from page and assigns it to variable called soup
page_text = soup.text #returns text from variable soup, i.e. no tags
ERS = re.search('ERS......', page_text, flags=0).group(0) #returns first ERS followed by six wildcards
ERR = re.search('ERR......', page_text, flags=0).group(0) #retursn first ERR followed by six wildcards
print ERS + ',' + ERR + ',' + "http://www.ebi.ac.uk/ena/data/view/" + sample #prints ERS,ERR,URL
This worked very well, but as the study accession is not in the XML I can't use it to access this.
I also attempted to use BeautifulSoup again to download the HTML by doing this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "http://www.ebi.ac.uk/ena/data/view/ERS019623"
def get_category_links(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
print soup
get_category_links(BASE_URL)
But again I can't see the study accession in the output from this either...
I have also attempted to use a different python module, lxml, to parse the XML and HTML but haven't had any luck there either.
When I right click and inspect element on the page I can find the study accession by doing ctrl+F -> PRJ.
So my question is this: what is the code that I'm looking at in inspect element, XML or HTML (or something else)? Why does it look different to the code that prints in my console when I try and use BeautifulSoup to parse HTML? And finally how can I scrape the study accessions (PRJ******) from these webpages?
(I've only been coding for a couple of months and I'm entirely self-taught so apologies for the slightly confused nature of this question but I hope I've got across what it is that I'm trying to do. Any suggestions or advice would be much appreciated.)

from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://www.ebi.ac.uk/ena/data/view/ERS019623&display=xml')
soup = BeautifulSoup(r.text, 'lxml')
ERS = soup.find('primary_id').text
ERR = soup.find('id', text=re.compile(r'^ERR')).text
url = 'http://www.ebi.ac.uk/ena/data/view/{}'.format(ERS)
print(ERS, ERR, url)
out:
ERS019623 ERR048142 http://www.ebi.ac.uk/ena/data/view/ERS019623
bs4 can parse xml file, just treat it like html, they are all the same, so their is no need to use regex to extract info.
i find a TEXT download link:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt
this link's fileds can be changed to get the data you want, like this:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession&download=txt
by doing so, you can get all you data in a text file

In you sample soup is a BeautifulSoup object: a representation of the parsed document.
If you want to print the entire HTML of the document, you can call print(soup.prettify()) or if you want the text within it print(soup.get_text()).
The soup object has other possibilities to access parts of the document you are interested in: to navigate the parsed tree, to search in it ...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.