Extracting parts of a webpage with python

Extracting parts of a webpage with python - python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?

I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.

They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

Related

Scraping csv file from url with React script

I want to scrape sample_info.csv file from https://depmap.org/portal/download/.
Since there is a React script on the website it's not that straightforward with BeautifulSoup and accessing the file via an appropriate tag. I did approach this from many angles and the one that gave me the best results looks like this and it returns the executed script where all downloaded files are listed together with other data. My then idea was to strip the tags and store the information in JSON. However, I think there must be some kind of mistake in the data because it is impossible to store it as JSON.
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
all_scripts = soup.find_all('script')
script = str(all_scripts[32])
last_char_index = script.rfind("}]")
first_char_index = script.find("[{")
script_cleaned = script[first_char_index:last_char_index+2]
script_json = json.loads(script_cleaned)
This code gives me an error
JSONDecodeError: Extra data: line 1 column 7250 (char 7249)
I know that my solution might not be elegant but it took me closest to the goal i.e. downloading the sample_info.csv file from the website. Not sure how to proceed here. If there are other options? I tried with selenium but this solution will not be feasible for the end-user of my script due to the driver path declaration

It is probably easier in this context to use regular expressions, since the string is invalid JSON.
This RegEx tool (https://pythex.org/) can be useful for testing expressions.
import re
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', script_cleaned)
#[
# ('https://ndownloader.figshare.com/files/26261524', 'CCLE_gene_cn.csv'),
# ('https://ndownloader.figshare.com/files/26261527', 'CCLE_mutations.csv'),
# ('https://ndownloader.figshare.com/files/26261293', 'Achilles_gene_effect.csv'),
# ('https://ndownloader.figshare.com/files/26261569', 'sample_info.csv'),
# ('https://ndownloader.figshare.com/files/26261476', 'CCLE_expression.csv'),
# ('https://ndownloader.figshare.com/files/17741420', 'primary_replicate_collapsed_logfold_change_v2.csv'),
# ('https://gygi.med.harvard.edu/publications/ccle', 'protein_quant_current_normalized.csv'),
# ('https://ndownloader.figshare.com/files/13515395', 'D2_combined_gene_dep_scores.csv')
# ]
Edit: This also works by passing the html_content directly (no need to BeautifulSoup).
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', html_content)

Regular Expression to read between HTML works in RegEx tester but not in my code

I'm fairly new to RegEx (and Python) in general and am trying to use it to read the temperature and description of weather via the HTML tags of a website.
I've attempted to rework examples of what I've been shown in class and read online to do this.
url = 'https://weather.com/en-AU/weather/today/l/-27.47,153.02'
contents = urllib.request.urlopen(url).read().decode("utf-8")
start_of_div = contents.find('<div class="today_nowcard-phrase">') # start of phrase line
end_of_div = start_of_div + contents[start_of_div:].find("</div>") + 6 # close of phrase line
phrase_area = contents[start_of_div:end_of_div]
print(phrase_area)
phrase = phrase_area.rfind(r'>(.*)<') # regex tester says this works
print(phrase)
There's then another section that gets the degrees which uses the same kind of layout.
It should print a phrase like 'Sunny' or 'Light Rain' or whatever else the weather is, as well as the current degrees (celsius). Instead it prints out:
<div class="today_nowcard-phrase">Sunny</div>
- 1
<div class="today_nowcard-temp"><span class="">21<sup>
- 1
Instead of -1 it should be 'Sunny' and '21' (at that point of time). The RegEx works when I put it into RegEx testing sites, but not in my actual program (probably because of some obvious error I can't see). Any help would be appreciated.

As mentioned in comments used an html parser. The elements all have nice distinctive class names you can use e.g. .today_nowcard-temp (where the leading . is a css class selector to match on element class name)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://weather.com/en-AU/weather/today/l/-27.47,153.02')
soup = bs(r.content, 'html.parser')
temp = soup.select_one('.today_nowcard-temp').text
desc = soup.select_one('.today_nowcard-phrase').text
print(temp, desc)

How to fix "List index out of range" error while extracting the first and last page numbers in my web-scraping program?

I am trying to create my first Python web-scraper to automate one task for work - I need to write all vacancies from this website (only for health) to an Excel file. Using a tutorial, I have come up with the following program.
However, in step 6, I receive an error stating: IndexError: list index out of range.
I have tried using start_page = paging[2].text, as I thought that the first page may be the base page, but it results in the same error.
Here are the steps that I followed:
I checked that the website https://iworkfor.nsw.gov.au allows scraping
Imported the necessary libraries:
import requests
from bs4 import BeautifulSoup
import pandas
stored the URL as a variable:
base_url = "https://iworkfor.nsw.gov.au/nsw-health-jobs?divisionid=1"
Get the HTML content:
r = requests.get(base_url)`
c = r.content
parse HTML
soup = BeautifulSoup(c,"html.parser")
To extract the first and last page numbers
paging = soup.find("div",{"class":"pana jobResultPaging tab-paging-top"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text
Making an empty list to append all the content:
web_content_list = []
Making page links from the page numbers ,crawl through the pages and extract the contents from the corresponding tags
for page_number in range(int(start_page),int(last_page) + 1):
# To form the url based on page numbers
url = base_url+"&page="+str(page_number)
r = requests.get(base_url+"&page="+str(page_number))
c = r.content
soup = BeautifulSoup(c,"html.parser")
To extract the Title
vacancies_header = soup.find_all("div", {"class":"box-sec2-left"})
To extract the LHD, Job type and Job Reference number
vacancies_content = soup.find_all("div", {"class":"box-sec2-right"})
To process vacancy by vacancy by looping
for item_header,item_content in zip(vacancies_header,vacancies_content):
# To store the information to a dictionary
web_content_dict = {}
web_content_dict["Title"]=item_header.find("a").text.replace("\r","").replace("\n","")
web_content_dict["Date Posted"] = item_header.find("span").text
web_content_dict["LHD"] = item_content.find("h5").text
web_content_dict["Position Type"] = item_content.find("p").text
web_content_dict["Job Reference Number"] = item_content.find("span",{"class":"box-sec2-reference"}).text
# To store the dictionary to into a list
web_content_list.append(web_content_dict)
To make a dataframe with the list
df = pandas.DataFrame(web_content_list)
To write the dataframe to a csv file
df.to_csv("Output.csv")
Ideally, the program will write the data about all vacancies to a CSV file in a nice table with the columns: title, date posted, LHD, Position Type, Job reference number.

The problem is that your initial call to find() returns an empty <div>, and so your subsequent call to find_all returns an empty list:
>div = soup.find("div",{"class":"pana jobResultPaging tab-paging-top"
>div
<div class="pana jobResultPaging tab-paging-top">
</div>
>div.find_all("a")
[]
Update:
The reason you're unable to parse the contents of the <div> in question (i.e. why it's empty) has to do with the fact that the data retrieved from the server is "paginated" by client-side javascript (code in your browser). Your python code is parsing only the HTML that is returned by the request to iworkfor.nsw.gov.au; the data that is what you're after (and what is turned into "pages") is requested by that same javascript and returned by the server in a format called JSON.
So, the bad news is that the instructions that have been provided to you will not work. You will have to parse the JSON returned by the server and then decode the escaped HTML that it contains.

Bioinformatics : Programmatic Access to the BacDive Database

the resource at "BacDive" - ( http://bacdive.dsmz.de/) is a highly useful database for accessing bacterial knowledge, such as strain information, species information and parameters such as growth temperature optimums.
I have a scenario in which I have a set of organism names in a plain text file, and I would like to programmatically search them 1 by 1 against the Bacdive database (which doesnt allow a flat file to be downloaded) and retrieve the relevent information and populate my text file accordingly.
What are the main modules (such as beautifulsoups) that I would need to accomplish this? Is it straight forward? Is it allowed to programmatically access webpages ? Do I need permission?
A bacteria name would be "Pseudomonas putida" . Searching this would give 60 hits on bacdive. Clicking one of the hits, takes us to the specific page, where the line : "Growth temperature: [Ref.: #27] Recommended growth temperature : 26 °C " is the most important.
The script would have to access bacdive (which i have tried accessing using requests, but I feel they do not allow programmatic access, I have asked the moderator about this, and they said I should register for their API first).
I now have the API access. This is the page (http://www.bacdive.dsmz.de/api/bacdive/). This may seem quite simple to people who do HTML scraping, but I am not sure what to do now that I have access to the API.

Here is the solution...
import re
import urllib
from bs4 import BeautifulSoup
def get_growth_temp(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
no_hits = int(map(float, re.findall(r'[+-]?[0-9]+',str(soup.find_all("span", class_="searchresultlayerhits"))))[0])
if no_hits > 1 :
letters = soup.find_all("li", class_="searchresultrow1") + soup.find_all("li", class_="searchresultrow2")
all_urls = []
for i in letters:
all_urls.append('http://bacdive.dsmz.de/index.php' + i.a["href"])
max_temp = []
for ind_url in all_urls:
soup = BeautifulSoup(urllib.urlopen(ind_url).read())
a = soup.body.findAll(text=re.compile('Recommended growth temperature :'))
if a:
max_temp.append(int(map(float, re.findall(r'[+-]?[0-9]+', str(a)))[0]))
print "Recommended growth temperature : %d °C:\t" % max(max_temp)
url = 'http://bacdive.dsmz.de/index.php?search=Pseudomonas+putida'
if __name__ == "__main__":
# TO Open file then iterate thru the urls/bacterias
# with open('file.txt', 'rU') as f:
# for url in f:
# get_growth_temp(url)
get_growth_temp(url)
Edit:
Here I am passing single url. if you want to pass multiple urls to get their growth temperature. call the function(url) by opening file. code is commented.
Hope it helped you..
Thanks

Scraping urbandictionary with Python

I'm currently working on an arcbot and I'm trying to make a command "!urbandictionary", it should scrape the meaning of a term, the first one which is provided by urbandictionary, if there's another solution, e.g. another dictionary site with a better api that's also good. Here's my code:
if Command.lower() == '!urban':
dictionary = Argument[1] #this is the term which the user provides, e.g. "scrape"
dictionaryscrape = urllib2.urlopen('http://www.urbandictionary.com/define.php?term='+dictionary).read() #plain html of the site
scraped = getBetweenHTML(dictionaryscrape, '<div class="meaning">','</div>') #Here's my problem, i'm not sure if it scrapes the first meaning or not..
messages.main(scraped,xSock,BotID) #Sends the meaning of the provided word (Argument[0])
How do I correctly scrape a meaning of a word in urbandictionary?

Just get the text from the meaning class:
import requests
from bs4 import BeautifulSoup
word = "scrape"
r = requests.get("http://www.urbandictionary.com/define.php?term={}".format(word))
soup = BeautifulSoup(r.content)
print(soup.find("div",attrs={"class":"meaning"}).text)
Gassing and breaking your car repeatedly really fast so that the front and rear bumpers "scrape" the pavement; while going hyphy

There is an unofficial api here apparently
`http://api.urbandictionary.com/v0/define?term={word}`
From https://github.com/zdict/zdict/wiki/Urban-dictionary-API-documentation

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting parts of a webpage with python - python

Related

Scraping csv file from url with React script

Regular Expression to read between HTML works in RegEx tester but not in my code

How to fix "List index out of range" error while extracting the first and last page numbers in my web-scraping program?

Bioinformatics : Programmatic Access to the BacDive Database

Scraping urbandictionary with Python

Categories

Resources