Changing the link in python mechanize - python

I am trying to write a python script that will generate the rank-list of my batch. For this I simply need to change the roll-number parameter of the link using inspect element feature in web-browser. The link(relative) looks something like:
/academic/utility/AcademicRecord.jsp?loginCode=000&loginnumber=000&loginName=name&Home=ascwebsite
I just need to change the loginCode to get the grade of my batch-mates. I am trying to use python to iterate through all the roll-numbers and generate a rank-list. I used mechanize library to open the site using python. The relevant portion of code:
br = mechanize.Browser()
br.set_handle_robots(False)
response = br.open('link_to_the_page')
I then do the necessary authentication and navigate to the appropriate page where the link to view the grades reside.
Then I find the relevant link like this:
for link in br.links(url_regex='/academic/utility/AcademicRecord.jsp?'):
Now inside this I change the url and attributes of the link appropriately.
And then I open the link using:
response=br.follow_link(link)
print response.read()
But it does not work. It opens the same link i.e. with the initial roll number. In fact I tried changing the url of the link to something very different like http://www.google.com.
link.url='http://www.google.com'
link.base_url='http://www.google.com'
It still opens the same page and not google's page.
Any help would be highly appreciated.

According to the source code, follow_link() and click_link() use link's absolute_url property that is set during the link initialization. And, you are setting only url and base_url properties.
The solution would be to change the absolute_url of a link in the loop:
BASE_URL = 'link_to_the_page'
for link in br.links(url_regex='/academic/utility/AcademicRecord.jsp?'):
modified_link = ...
link.absolute_url = mechanize.urljoin(BASE_URL, modified_link)
br.follow_link(link)
Hope that helps.

Related

Generate and download tsv from a website (with python)

I have this website and want to write a script which can execute a code which gives the same output as clicking on 'Export' -> 'Generate tsv' -> Wait to generate -> 'Download'.
The endgoal is to use this for a list of approx. 1700 proteins which I have in .txt (so extract a protein, in this case 'Q9BXF6' and put it in the url: https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table) and download all results in .tsv files.
I tried inspecting the 'Export' button but the sourcecode wasn't illuminating (or I didn't know where to look). I also tried this:
r = requests.get('https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table')
soup = BeautifulSoup(r.content, 'html.parser')
to locate what I need but it outputs a bunch of characters that I can't really understand.
I also tried downloading the whole page just like it is with the urllib library:
with
myurl = 'https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table'
urllib.request.urlopen() as f:
html = f.read().decode('utf-8')
or
urllib.urlretrieve (myurl, 'interpro.txt') # although this didn't work
It seems as if all content is written somewhere else and refered to and everything I've tried outputs something stupid, but I don't know anything about html and am really new to python (I only use R).
For your first question, you can use the URL of the following element to retrieve the protein value that you require for the next problem.
href="blob:https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b"
The URL is set to the href tag which you can then use it to make the request to download the file. You can also obtain this by right-clicking on the download button for TSV and clicking Inspect-Element you will then be able to see the presence of this href tag.
Following that, download by doing e.g.
import urllib.request
url = 'https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b'
urllib.request.urlretrieve(url, '/Users/abc/Downloads/file.tsv') # any dir to save
with open("/Users/abc/Downloads/file.tsv") as file_in:
for line in file_in:
#here make your calls for your second problem.
You can also use a Web-Automator such as selenium to gracefully solve this problem. If the latter is of interest do look into it - it's not hard.

Web Scraping youtube with Python 3

I'm doing a project where I need to store the date that a video in youtube was published.
The problem is that I'm having some difficulties trying to find this data in the middle of the HTML source code
Here's my code attempt:
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.youtube.com/watch?v=XQgXKtPSzUI&t=915s"
response = requests.get(url)
soup = BS(response.content, "html.parser")
response.close()
dia = soup.find_all('span',{'class':'date'})
print(dia)
Output:
[]
I know that the arguments I'm sending to .find_all() are wrong.
I'm saying this because I was able to store other information from the video using the same code, such as the title and the views.
I've tried different arguments when using .find_all() but didn't figured out how to find it.
If you use Python with pafy, the object you'll get has the published date easily accessible.
Install pafy: "pip install pafy"
import pafy
vid = pafy.new("www.youtube.com/watch?v=2342342whatever")
published_date = vid.published
print(published_date) #Python3 print statement
Check out the pafy docs for more info:
https://pythonhosted.org/Pafy/
The reason I leave the doc link is because it's a really neat module, it handles getting the data without external request modules and also exposes a bunch of other useful properties of the video, like the best format download link, etc.
It seems that YouTube is using javascript to add the date, so that information is not in the source code. You should try using Selenium to scrape, or get the date from the js since it is directly in the source code.
Try adding attribute as shown below:
dia = soup.find_all('span', attr={'class':'date'})

Python iterating through pages google search

I am working on a larger code that will display the links of the results for a Google Newspaper search and then analyze those links for certain keywords and context and data. I've gotten everything this one part to work, and now when I try to iterate through the pages of results I come to a problem. I'm not sure how to do this without an API, which I do not know how to use. I just need to be able to iterate through multiple pages of search results so that I can then apply my analysis to it. It seems like there is a simple solution to iterating through the pages of results, but I am not seeing it.
Are there any suggestions on ways to approach this problem? I am somewhat new to Python and have been teaching myself all of these scraping techniques, so I'm not sure if I'm just missing something simple here. I know this may be an issue with Google restricting automated searches, but even pulling in the first 100 or so links would be beneficial. I have seen examples of this from regular Google searches but not from Google Newspaper searches
Here is the body of the code. If there are any lines where you have suggestions, that would be helpful. Thanks in advance!
def get_page_tree(url):
page = requests.get(url=url, verify=False)
return html.fromstring(page.text)
def find_other_news_sources(initial_url):
forwarding_identifier = '/url?q='
google_news_search_url = "https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro"
google_news_search_tree = get_page_tree(url=google_news_search_url)
other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in google_news_search_tree.xpath('//a//#href') if forwarding_identifier in a_link]
return other_news_sources_links
links = find_other_news_sources("https://www.google.com/search? hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro")
with open('textanalysistest.csv', 'wt') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for row in links:
print(row)
I'm looking into building a parser for a site with similar structure to google's (i.e. a bunch of consecutive results pages, each with a table of content of interest).
A combination of the Selenium package (for page-element based site navigation) and BeautifulSoup (for html parsing) seems like it's the weapon of choice for harvesting written content. You may find them useful too, although I have no idea what kinds of defenses google has in place to deter scraping.
A possible implementation for Mozilla Firefox using selenium, beautifulsoup and geckodriver:
from bs4 import BeautifulSoup, SoupStrainer
from bs4.diagnose import diagnose
from os.path import isfile
from time import sleep
import codecs
from selenium import webdriver
def first_page(link):
"""Takes a link, and scrapes the desired tags from the html code"""
driver = webdriver.Firefox(executable_path = 'C://example/geckodriver.exe')#Specify the appropriate driver for your browser here
counter=1
driver.get(link)
html = driver.page_source
filter_html_table(html)
counter +=1
return driver, counter
def nth_page(driver, counter, max_iter):
"""Takes a driver instance, a counter to keep track of iterations, and max_iter for maximum number of iterations. Looks for a page element matching the current iteration (how you need to program this depends on the html structure of the page you want to scrape), navigates there, and calls mine_page to scrape."""
while counter <= max_iter:
pageLink = driver.find_element_by_link_text(str(counter)) #For other strategies to retrieve elements from a page, see the selenium documentation
pageLink.click()
scrape_page(driver)
counter+=1
else:
print("Done scraping")
return
def scrape_page(driver):
"""Takes a driver instance, extracts html from the current page, and calls function to extract tags from html of total page"""
html = driver.page_source #Get html from page
filter_html_table(html) #Call function to extract desired html tags
return
def filter_html_table(html):
"""Takes a full page of html, filters the desired tags using beautifulsoup, calls function to write to file"""
only_td_tags = SoupStrainer("td")#Specify which tags to keep
filtered = BeautifulSoup(html, "lxml", parse_only=only_td_tags).prettify() #Specify how to represent content
write_to_file(filtered) #Function call to store extracted tags in a local file.
return
def write_to_file(output):
"""Takes the scraped tags, opens a new file if the file does not exist, or appends to existing file, and writes extracted tags to file."""
fpath = "<path to your output file>"
if isfile(fpath):
f = codecs.open(fpath, 'a') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
else:
f = codecs.open(fpath, 'w') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
return
After this, it is just a matter of calling:
link = <link to site to scrape>
driver, n_iter = first_page(link)
nth_page(driver, n_iter, 1000) # the 1000 lets us scrape 1000 of the result pages
Note that this script assumes that the result pages you are trying to scrape are sequentially numbered, and those numbers can be retrieved from the scraped page's html using 'find_element_by_link_text'. For other strategies to retrieve elements from a page, see the selenium documentation here.
Also, note that you need to download the packages on which this depends, and the driver that selenium needs in order to talk with your browser (in this case geckodriver, download geckodriver, place it in a folder, and then refer to the executable in 'executable_path')
If you do end up using these packages, it can help to spread out your server requests using the time package (native to python) to avoid exceeding a maximum number of requests allowed to the server off of which you are scraping. I didn't end up needing it for my own project, but see here, second answer to the original question, for an implementation example with the time module used in the fourth code block.
Yeeeeaaaahhh... If someone with higher rep could edit and add some links to beautifulsoup, selenium and time documentations, that would be great, thaaaanks.

Basic file uploading via website form using POST Requests in Python

I try to upload a file on a random website using Python and HTTP requests. For this, I use the handy library named Requests.
According to the documentation, and some answers on StackOverflow here and there, I just have to add a files parameter in my application, after studying the DOM of the web page.
The method is simple:
Look in the source code for the URL of the form ("action" attribute);
Look in the source code for the "name" attribute of the uploading
button ;
Look in the source code for the "name" and "value" attributes of the submit form button ;
Complete the Python code with the required parameters.
Sometimes this works fine. Indeed, I managed to upload a file on this site : http://pastebin.ca/upload.php
After looking in the source code, the URL of the form is upload.php, the buttons names are file and s, the value is Upload, so I get the following code:
url = "http://pastebin.ca/upload.php"
myFile = open("text.txt","rb")
r = requests.get(url,data={'s':'Upload'},files={'file':myFile})
print r.text.find("The uploaded file has been accepted.")
# ≠ -1
But now, let's look at that site: http://www.pictureshack.us/
The corresponding code is as follows:
url = "http://www.pictureshack.us/index2.php"
myFile = open("text.txt","rb")
r = requests.get(url,data={'Upload':'upload picture'},files={'userfile':myFile})
print r.text.find("Unsupported File Type!")
# = -1
In fact, the only difference I see between these two sites is that for the first, the URL where the work is done when submitting the form is the same as the page where the form is and where the files are uploaded.
But that does not solve my problem, because I still do not know how to submit my file in the second case.
I tried to make my request on the main page instead of the .php, but of course it does not work.
In addition, I have another question.
Suppose that some form elements do not have "name" attribute. How am I supposed to designate it at my request with Python?
For example, this site: http://imagesup.org/
The submitting form button looks like this: <input type="submit" value="Héberger !">
How can I use it in my data parameters?
The forms have another component you must honour: the method attribute. You are using GET requests, but the forms you are referring to use method="post". Use requests.post to send a POST request.

Download text from a URL in Python

I'm working on a school project currently which aim goal is to analyze scam mails with the Natural Language Toolkit package. Basically what I'm willing to do is to compare scams from different years and try to find a trend - how does their structure changed with time.
I found a scam-database: http://www.419scam.org/emails/
I would like to download the content of the links with python, but I am stuck.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
links2 = soup.findAll(href=re.compile("index"))
print links2
So I can fetch the links but I don't know yet how can I download the content. Any ideas? Thanks a lot!
You've got a good start, but right now you're simply retrieving the index page and loading it into the BeautifulSoup parser. Now that you have href's from the links, you essentially need to open all of those links, and load their contents into data structures that you can then use for your analysis.
This essentially amounts to a very simple web-crawler. If you can use other people's code, you may find something that fits by googling "python Web crawler." I've looked at a few of those, and they are straightforward enough, but may be overkill for this task. Most web-crawlers use recursion to traverse the full tree of a given site. It looks like something much simpler could suffice for your case.
Given my unfamiliarity with BeautifulSoup, this basic structure will hopefully get you on the right path, or give you for a sense for how the web crawling is done:
from BeautifulSoup import BeautifulSoup
import urllib2, re
emailContents = []
def analyze_emails():
# this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents
def parse_email_page(link):
print "opening " + link
# open, soup, and parse the page.
#Looks like the email itself is in a "blockquote" tag so that may be the starting place.
#From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents
def parse_list_page(link):
print "opening " + link
html = urllib2.urlopen(link).read()
soup = BeatifulSoup(html)
email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages
for link in email_page_links:
parseEmailPage(link['href'])
def main():
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work
for link in links:
parse_list_page(link['href'])
analyze_emails()
if __name__ == "__main__":
main()

Categories

Resources