Compare string result from path & requests - python

I am scraping the HTML code from the URL defined, mainly focussing on the tag, to extract the results of it. Then, compare if string "example" exists in the script, if yes, print something and flag =1.
I am not able to compare the results extracted from the HTML.fromstring
Able to scrape the HTML content and view the full successfully, wanted to proceed further but not able to (compare strings)
import requests
from lxml import html
page = requests.get("http://econpy.pythonanywhere.com/ex/001.html")
tree = html.fromstring(page.text) #was page.content
# To get all the content in <script> of the webpage
scripts = tree.xpath('//script/text()')
# To get line of script that contains the string "location" (text)
keyword = tree.xpath('//script/text()[contains(., "location")]')
# To get the element ID of the script that contains the string "location"
keywordElement = tree.xpath('//script[contains(., "location")]')
print('\n<SCRIPT> is :\n', scripts)
# To print the Element ID
print('\n\KEYWORD script is discovered # ',keywordElement)
# To print the line of script that contain "location" in text form
print('Supporting lines... \n\n',keyword)
# ******************************************************
# code below is where the string comparison comes in
# to compare the "keyword" and display output to user
# ******************************************************
string = "location"
if string in keyword:
print('\nDANGER: Keyword detected in URL entered')
Flag = "Detected" # For DB usage
else:
print('\nSAFE: Keyword does not exist in URL entered')
Flag = "Safe" # For DB usage
# END OF PROGRAM
Actual result: able to retrieve all the necessary information including its element and content
Expected result: To print the DANGER / SAFE word to user and define the variable "Flag" which will then stored into database.

keyword is a list.
You need to index the list to get the string after which you will be able to search for the specific string
"location" in keyword[0] #gives True

Related

How to fix "List index out of range" error while extracting the first and last page numbers in my web-scraping program?

I am trying to create my first Python web-scraper to automate one task for work - I need to write all vacancies from this website (only for health) to an Excel file. Using a tutorial, I have come up with the following program.
However, in step 6, I receive an error stating: IndexError: list index out of range.
I have tried using start_page = paging[2].text, as I thought that the first page may be the base page, but it results in the same error.
Here are the steps that I followed:
I checked that the website https://iworkfor.nsw.gov.au allows scraping
Imported the necessary libraries:
import requests
from bs4 import BeautifulSoup
import pandas
stored the URL as a variable:
base_url = "https://iworkfor.nsw.gov.au/nsw-health-jobs?divisionid=1"
Get the HTML content:
r = requests.get(base_url)`
c = r.content
parse HTML
soup = BeautifulSoup(c,"html.parser")
To extract the first and last page numbers
paging = soup.find("div",{"class":"pana jobResultPaging tab-paging-top"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text
Making an empty list to append all the content:
web_content_list = []
Making page links from the page numbers ,crawl through the pages and extract the contents from the corresponding tags
for page_number in range(int(start_page),int(last_page) + 1):
# To form the url based on page numbers
url = base_url+"&page="+str(page_number)
r = requests.get(base_url+"&page="+str(page_number))
c = r.content
soup = BeautifulSoup(c,"html.parser")
To extract the Title
vacancies_header = soup.find_all("div", {"class":"box-sec2-left"})
To extract the LHD, Job type and Job Reference number
vacancies_content = soup.find_all("div", {"class":"box-sec2-right"})
To process vacancy by vacancy by looping
for item_header,item_content in zip(vacancies_header,vacancies_content):
# To store the information to a dictionary
web_content_dict = {}
web_content_dict["Title"]=item_header.find("a").text.replace("\r","").replace("\n","")
web_content_dict["Date Posted"] = item_header.find("span").text
web_content_dict["LHD"] = item_content.find("h5").text
web_content_dict["Position Type"] = item_content.find("p").text
web_content_dict["Job Reference Number"] = item_content.find("span",{"class":"box-sec2-reference"}).text
# To store the dictionary to into a list
web_content_list.append(web_content_dict)
To make a dataframe with the list
df = pandas.DataFrame(web_content_list)
To write the dataframe to a csv file
df.to_csv("Output.csv")
Ideally, the program will write the data about all vacancies to a CSV file in a nice table with the columns: title, date posted, LHD, Position Type, Job reference number.
The problem is that your initial call to find() returns an empty <div>, and so your subsequent call to find_all returns an empty list:
>div = soup.find("div",{"class":"pana jobResultPaging tab-paging-top"
>div
<div class="pana jobResultPaging tab-paging-top">
</div>
>div.find_all("a")
[]
Update:
The reason you're unable to parse the contents of the <div> in question (i.e. why it's empty) has to do with the fact that the data retrieved from the server is "paginated" by client-side javascript (code in your browser). Your python code is parsing only the HTML that is returned by the request to iworkfor.nsw.gov.au; the data that is what you're after (and what is turned into "pages") is requested by that same javascript and returned by the server in a format called JSON.
So, the bad news is that the instructions that have been provided to you will not work. You will have to parse the JSON returned by the server and then decode the escaped HTML that it contains.

Unable to get Facebook Group members after first page using Python

I am trying to get the names of members of a group I am a member of. I am able to get the names in the first page but not sure how to go to the next page:
My Code:
url = 'https://graph.facebook.com/v2.5/1671554786408615/members?access_token=<MY_CUSTOM_ACCESS_CODE_HERE>'
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
for each in data['data']:
print each['name']
Using the code above I am successfully getting all names on the first page but question is -- how do I go to the next page?
In the Graph API Explorer Output screen I see this:
What change does my code need to keep going to next pages and get names of ALL members of the group?
The JSON returned by the Graph API is telling you where to get the next page of data, in data['paging']['next']. You could give something like this a try:
def printNames():
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
for each in data['data']:
print each['name']
return data['paging']['next'] # Return the URL to the next page of data
url = 'https://graph.facebook.com/v2.5/1671554786408615/members?access_token=<MY_CUSTOM_ACCESS_CODE_HERE>'
url = printNames()
print "====END OF PAGE 1===="
url = printNames()
print "====END OF PAGE 2===="
You would need to add checks, for instance ['paging']['next'] will only be available in your JSON object if there is a next page, so you might want to modify your function to return a more complex structure to convey this information, but this should give you the idea.

Python adding a string to a match list with multiple items

The code I am working on is retrieving a list from an HTML page with 2 fields, URL, and title...
The URL anyway starts with /URL.... And I need to append the "http://website.com" to every returned vauled from a re.findall.
The code so far is this:
bsoup=bs(html)
tag=soup.find('div',{'class':'item'})
reg=re.compile('<a href="(.+?)" rel=".+?" title="(.+?)"')
links=re.findall(reg,str(tag))
*(append "http://website.com" to the href"(.+?)" field)*
return links
Try:
for link in tag.find_all('a'):
link['href'] = 'http://website.com' + link['href']
Then use one of these output methods:
return str(soup) gets you the document after the changes are applied.
return tag.find_all('a') gets you all the link elements.
return [str(i) for i in tag.find_all('a')] gets you all the link elements converted to strings.
Now, don't try to parse HTML with regex while you have a XML parser already working.

Get the id from url

I would like to get the id from the url ,I have stored a set of urls in a list,i would like to get a certail part of the url ,ie is the id part ,for thoose url that dont have an id part should print as none.The code so far i have tried
text=[u'/fhgj/b?ie=UTF8&node=2858778011',u'/gp/v/w/', u'/gp/v/l', u'/gp/fhghhgl?ie=UTF8&docId=1001423601']
text=text.rsplit(sep='&', maxsplit=-1)
print text
the output is
[u'2858778011',u'/gp/v/w/', u'/gp/v/l', u'1001423601']
i expect to get something like this
[u'2858778011',u'None', u'None', u'1001423601']
Use urlparse, or if you really want to use string libs then
prefix, sep, text = text.partition("&")
(or just text = text.partition("&")[2]).

Extracting parts of a webpage with python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

Categories

Resources