Retrieving information from a web page in Python? - python

I have the following page:
http://www.noncode.org/keyword.php
from where I want to extract some information from it by performing an external search by Python. Maybe it will sound simple, but I have not programmed web applications before.
So I would like to put in the search box something like:
NONHSAT146018.2
to perform a search and from the resulting webpage, which is this:
From the results I need to extract the field that it says Sequence. I have read some information about the BeatifulSoup library and some examples, but they do not include in the address the php form. I will really appreciate your help on this matter. Thanks.
Update: Following the advice of the users and with the help of #Lukas Newman, I made the following:
data="NONHSAT146018.2"
page = requests.get("http://www.noncode.org/show_rna.php?id=" + data)
soup=BS(page.content,'html.parser')
target = soup.find('h2',text='Sequence')
print(target)
target = soup.find('table',text='table-1')
print(target)
table = soup.find('table', attrs={'class'},text='table-1')
print(table)
when I inspect the results I got that the sequence is in the following field:
How can I extract that part by using Python?

Look at the url
http://www.noncode.org/show_rna.php?id=NONHSAT000002
The search is just passed as a get parameter. So to access the side just set the start url to something like:
import requests
from bs4 import *
id = "NONHSAT146018"
page = requests.get("http://www.noncode.org/show_rna.php?id=" + id)
soup = BeautifulSoup(page.content, "html.parser")
element = soup.findAll('table', class_="table-1")[1]
element2 = element.findAll('tr')[1]
element3 = element2.findNext('td')
your_data = str(element3.renderContents(), "utf-8")
print(your_data)

Related

How do I scrape a 'td' in web scraping

I am learning web scraping and I'm scraping in this following website: ivmp servers. I have trouble with scraping the number of players in the server, can someone help me? I will send the code of what I've done so far
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find('table')
summary = players.find('div', class_ ='players')
print(summary)
Looking at the page you provided, i can assume that the table you want to extract information from is the one with server names and ip adresses.
There are actually 4 "table" element on this page.
Luckily for you, this table has an id (serverlist). You can easily find it with right click > inspect on Chrome
players = soup.select_one('table#serverlist')
Now you want to get the td.
You can print all of them using :
for td in players.select("td"):
print(td)
Or you can select the one you are interested in :
players.select("td.hostname")
for example.
Hope this helps.
Looking at the structure of the page, there are a few table cells (td) with the class "players", it looks like two of them are for sorting the table, so we'll assume you don't want those.
In order to extract the one(s) you do want, I would first query for all the td elements with the class "players", and then loop through them adding only the ones we do want to an array.
Something like this:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find_all('td', class_='players')
summary = []
for cell in players:
# Exclude the cells which are for sorting
if cell.get_text() != 'Players':
summary.append(cell.get_text())
print(summary)

Wrong search generated with Python requests

What I'm trying to do is search StackOverflow for answers. I know it's probably been done before, but I'd like to do it again. With a GUI. Anyway that is a little bit down the road as right now i'm just trying to get to the page with the most votes for a question. I noticed while trying to see how to get into a nested div to get the link for the first answer that my search was off and taking me to the wrong place. I am using BeautifulSoup and Requests and python3 to do this.
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
payload = {'q': 'open GL cube'}
page = requests.get("https://stackoverflow.com/search",params=payload)
print(" URL IS ", page.url)
data = page.content
soup = BeautifulSoup(data, 'lxml')
top = soup.find('a', {'title':'Highest voted search results'})['href']
print(top)
page2 = requests.get("https://stackoverflow.com",params=top)
print(page2.url)
data2 = page2.content
topSoup = BeautifulSoup(data2, 'lxml')
for div in topSoup.find_all('div', {'class':'result-link'}):
print(div.text)
i get the link and it outputs /search?tab=votes&q=open%GL%20cube
but when I pass it in with the params it does
https://stackoverflow.com/?/search?tab=votes&q=open%GL%20cube
I would like to get rid of the /?/
Don't pass it as parameters, just add it to the URL:
page2 = requests.get("https://stackoverflow.com" + top)
Once you pass requests parameters it adds a ? to the link before concatenating the new parameters to the link.
Requests - Passing Parameters In URLs
Also, as stated, you should really use the API.
Why not use the API?
There are plenty of search options (https://api.stackexchange.com/docs/advanced-search), and you get the response in JSON, no need for ugly HTML parsing.

Beautiful Soup - Blank screen for a long time without any output

I am quite new to python and am working on a scraping based project- where I am supposed to extract all the contents from links containing a particular search term and place them in a csv file. As a first step, I wrote this code to extract all the links from a website based on a search term entered. I only get a blank screen as output and I am unable to find my mistake.
import urllib
import mechanize
from bs4 import BeautifulSoup
import datetime
def searchAP(searchterm):
newlinks = []
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
text = ""
start = 0
while "There were no matches for your search" not in text:
url = "http://www.marketing-interactive.com/"+"?s="+searchterm
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text, "lxml")
results = soup.findAll('a')
for r in results:
if "rel=bookmark" in r['href'] :
newlinks.append("http://www.marketing-interactive.com"+ str(r["href"]))
start +=10
return newlinks
print searchAP("digital marketing")
You made four mistakes:
You are defining start but you never use it. (Nor can you, as far as I can see on http://www.marketing-interactive.com/?s=something. There is no url based pagination.) So you endlessly looping over the first set of results.
"There were no matches for your search" is not the no-results string returned by that site. So it would go on forever anyway.
You are appending the link, including http://www.marketing-interactive.com to http://www.marketing-interactive.com. So you would end up with http://www.marketing-interactive.comhttp://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/
Concerning rel=bookmark selection: arifs solution is the proper way to go. But if you really want to do it this way you'd need to something like this:
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
This first checks if rel exists and then checks if its first child is "bookmark", as r['href'] simply does not contain the rel. That's not how BeautifulSoup structures things.
To scrape this specific site you can do two things:
You could do something with Selenium or something else that supports Javascript and press that "Load more" button. But this is quite a hassle.
You can use this loophole: http://www.marketing-interactive.com/wp-content/themes/MI/library/inc/loop_handler.php?pageNumber=1&postType=search&searchValue=digital+marketing
This is the url that feeds the list. It has pagination, so you can easily loop over all results.
The following script extracts all the links from the web page based on given search key. But it does not explore beyond the first page. Although the following code can easily be modified to get all results from multiple pages by manipulating page-number in the URL (as described by Rutger de Knijf in the other answer.).
from pprint import pprint
import requests
from BeautifulSoup import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content)
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
Usage:
pprint(get_url_for_search_key('digital marketing'))
Output:
[u'http://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/',
u'http://www.marketing-interactive.com/singapore-polytechnic-on-the-hunt-for-digital-marketing-agency/',
u'http://www.marketing-interactive.com/how-to-get-your-bosses-on-board-your-digital-marketing-plan/',
u'http://www.marketing-interactive.com/digital-marketing-institute-launches-brand-refresh/',
u'http://www.marketing-interactive.com/entropia-highlights-the-7-original-sins-of-digital-marketing/',
u'http://www.marketing-interactive.com/features/futurist-right-mindset-digital-marketing/',
u'http://www.marketing-interactive.com/lenovo-brings-board-new-digital-marketing-head/',
u'http://www.marketing-interactive.com/video/discussing-digital-marketing-indonesia-video/',
u'http://www.marketing-interactive.com/ubs-melvin-kwek-joins-credit-suisse-as-apac-digital-marketing-lead/',
u'http://www.marketing-interactive.com/linkedins-top-10-digital-marketing-predictions-2017/']
Hope this is what you wanted as the first step for your project.

Scrape multiple pages with BeautifulSoup and Python

My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] and writes the td elements to a text file.
However, there are multiple pages available at the site above in which I would like to be able to scrape.
For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.
How can my code be changed to scrape data from all the available listed pages?
My code that works for page 1 only:
import bs4
import requests
response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')
soup = bs4.BeautifulSoup(response.text)
soup.prettify()
acct = open("/Users/it/Desktop/accounting.txt", "w")
for tr in soup.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.
Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.
Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).
Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.
As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.
POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.
What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.
Test if it works by adding it to the URL.
Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.
Modified code is below:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.
Results:
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]
Hope that helps.
EDIT:
Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list
with open("results.txt","wb") as acct:
for class_url in classes_url_list:
base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

Checking webpage for results with python and beautifulsoup

I need to check a webpage search results and compare them to user input.
ui = raw_input() #for example "Niels Bohr"
link = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=90&k=10"
stranica=urllib.urlopen(link)
soup = BeautifulSoup(stranica, from_encoding="utf-8")
beauty = soup.prettify()
print beauty
since there is 1502 results, my idea was to change the k=10 to k=1502. Now I need some kind of function to check if search results contain my user input. I know that my names are the text after TEXT
so how to do it? maybe using regex?
the second part is if there are matching results to get the link of the results. Again, I know that link is inside that href="", but how to get it out and make it usable=
Finding if Niels Bohr is listed is as easy as using a large batch number and loading the resulting page:
import sys
import urllib2
from bs4 import BeautifulSoup
url = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=0&k={}".format(sys.maxint)
name = u'Bohr, Niels'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all(class_='AllWordsTextHit', text=name):
print link
This produces any links that contain the text 'Bohr, Niels' as the link text. You can use a regular expression if you need a partial match.
The link object has a (relative) href attribute you can then use to load the next page:
professor_page = 'http://www.enciklopedija.hr/' + link['href']

Categories

Resources