This question already has answers here:
What exactly do "u" and "r" string prefixes do, and what are raw string literals?
(7 answers)
Closed 2 years ago.
My current Cloud Run URL returns a long string, matching the exact format as described here.
When I run the following code in Google Apps Script, I get a Log output of '1'. What happens, is the entire string is put in the [0][0] position of the data array instead of actually being parsed.
function myFunction() {
const token = ScriptApp.getIdentityToken();
const options = {
headers: {'Authorization': 'Bearer ' + token}
}
var responseString = UrlFetchApp.fetch("https://*myproject*.a.run.app", options).getContentText();
var data = Utilities.parseCsv(responseString, '\t');
Logger.log(data.length);
}
My expected output is a 2D array as described in the aforementioned link, with a logged output length of 18.
I have confirmed the output of my response by:
Logging the responseString
Copying the output log into a separate var -> var temp = "copied-output"
Changing the parseCsv line to -> var data = Utilities.parseCsv(temp, '\t')
Saving and running the new code. This then outputs a successful 2D array with a length of 18.
So why is it, that my current code doesn't work?
Happy to try anything because I am out of ideas.
Edit: More information below.
Python script code
#app.route("/")
def hello_world():
# Navigate to webpage and get page source
driver.get("https://www.asxlistedcompanies.com/")
soup = BeautifulSoup(driver.page_source, 'html.parser')
# ##############################################################################
# Used by Google Apps Script to create Arrays
# This creates a two-dimensional array of the format [[a, b, c], [d, e, f]]
# var csvString = "a\tb\tc\nd\te\tf";
# var data = Utilities.parseCsv(csvString, '\t');
# ##############################################################################
long_string = ""
limit = 1
for row in soup.select('tr'):
if limit == 20:
break
else:
tds = [td.a.get_text(strip=True) if td.a else td.get_text(strip=True) for td in row.select('td')]
count = 0
for column in tds:
if count == 4:
linetext = column + r"\n"
long_string = long_string+linetext
else:
text = column + r"\t"
long_string = long_string+text
count = count+1
limit = limit+1
return long_string
GAS Code edited:
function myFunction() {
const token = ScriptApp.getIdentityToken();
const options = {
headers: {'Authorization': 'Bearer ' + token}
}
var responseString = UrlFetchApp.fetch("https://*myfunction*.a.run.app", options).getContentText();
Logger.log("The responseString: " + responseString);
Logger.log("responseString length: " + responseString.length)
Logger.log("responseString type: " + typeof(responseString))
var data = Utilities.parseCsv(responseString, '\t');
Logger.log(data.length);
}
GAS logs/output as requested:
6:17:11 AM Notice Execution started
6:17:22 AM Info The responseString: 14D\t1414 Degrees Ltd\tIndustrials\t21,133,400\t0.001\n1ST\t1ST Group Ltd\tHealth Care\t12,738,500\t0.001\n3PL\t3P Learning Ltd\tConsumer Discretionary\t104,613,000\t0.005\n4DS\t4DS Memory Ltd\tInformation Technology\t58,091,300\t0.003\n5GN\t5G Networks Ltd\t\t82,746,600\t0.004\n88E\t88 Energy Ltd\tEnergy\t42,657,800\t0.002\n8CO\t8COMMON Ltd\tInformation Technology\t11,157,900\t0.001\n8IH\t8I Holdings Ltd\tFinancials\t35,814,200\t0.002\n8EC\t8IP Emerging Companies Ltd\t\t3,199,410\t0\n8VI\t8VIC Holdings Ltd\tConsumer Discretionary\t13,073,200\t0.001\n9SP\t9 Spokes International Ltd\tInformation Technology\t21,880,100\t0.001\nACB\tA-Cap Energy Ltd\tEnergy\t7,846,960\t0\nA2B\tA2B Australia Ltd\tIndustrials\t95,140,200\t0.005\nABP\tAbacus Property Group\tReal Estate\t1,679,500,000\t0.082\nABL\tAbilene Oil and Gas Ltd\tEnergy\t397,614\t0\nAEG\tAbsolute Equity Performance Fund Ltd\t\t107,297,000\t0.005\nABT\tAbundant Produce Ltd\tConsumer Staples\t1,355,970\t0\nACS\tAccent Resources NL\tMaterials\t905,001\t0\n
6:17:22 AM Info responseString length: 1020
6:17:22 AM Info responseString type: string
6:17:22 AM Info 1.0
6:17:22 AM Notice Execution completed
Issue:
Using a r'' raw string flag makes \n and \t, a literal \ and n/t respectively and not a new line or a tab character. This explains why you were able to copy the "displayed" logs to a variable and execute it successfully.
Solution:
Don't use r flag.
Snippet:
linetext = column + "\n" #no flag
long_string = long_string+linetext
else:
text = column + "\t" #no flag
Related
Running a script that writes a block of code into a textarea on a website. The first four "lines" write correctly until var url = "https: at which point the cursor jumps to the upper left of the text area and then continues writing. Each time the / character is come across, the cursor returns to the upper left before continuing writing.
How can I prevent the cursor being affected.
I have tried \/, \\/, {/}, and similar ways to escape the slash.
self.driver.find_element_by_id('textarea').send_keys(
'\nvar device = this\n\nvar url = "' + baseurl + '/' + firmwarename + '"\n\nvar conv = TR69.createConnection(device)\n\ntry {'+
'var uuid = java.util.UUID.randomUUID().toString().replace("-","") \n'+
What it physically does:
myhiddenurl.comSG9C130016_prod-mycomponent-5260-8a.27.103-combined-squashfs.img.gsdf"
var conv = TR69.createConnection(device)
var device = this
var url = "http:
Notice that lines 3 and 4 should be 1 and 2. And that line 1 is the continuation of what is now line 4.
Here is sample code which shows the issue...
firmwarename = "tchrisdemo-code-3-2-3.gsdf"
self.driver.get("https://futureoftesting.us/os.html")
self.driver.find_element_by_id('textarea').clear()
baseurl = "http://myhiddendomain.com/"
self.driver.find_element_by_id('textarea').send_keys(
'\nvar device = this\n\nvar url = "' + baseurl + '/' +
firmwarename + '"\n\nvar conv = TR69.createConnection(device)\n\ntry {'+
'var uuid = java.util.UUID.randomUUID().toString().replace("-","") \n'+
'var dlRequest = new TR69.DownloadRequest() \n' )
Line 5 of the code is the problem...
I've tried a variety of changes akin to your comment. The .format one allowed one allowed one "/" through then jumped to the top of the textarea and continued writing on the next one.
baseurl = r"http://myhiddendomain.com/"
url = "{}/{}".format(baseurl,firmwarename)
self.driver.find_element_by_id('textarea').send_keys(
'\nvar device = this\n\nvar url = "' + baseurl + firmwarename + '"\n\nvar conv = TR69.createConnection(device)\n\ntry {'+
'var uuid = java.util.UUID.randomUUID().toString().replace("-","") \n'+
'var dlRequest = new TR69.DownloadRequest() \nThis is formatting: ' + url)
which sadly generated this:
var dlRequest = new TR69.DownloadRequest()
This is formatting: http:/myhiddendomain.com/
var device = this
Not sure I fully get this solution.
It appears after more searching that the "jumping cursor" is a known problem and that the "devs have to fix it"
Backslash '\' is the escape character
So try \ / (space in between so it doesn't look like a V but you obv want without the space)
So I am creating a script to communicate to our API server for asset management and retrieve some information. I've found that the longest total time portion of the script is:
{method 'read' of '_ssl._SSLSocket' objects}
Currently we're pulling information about 25 assets or so and that specific portion is taking 18.89 seconds.
Is there any way to optimize this so it doesn't take 45 minutes to do all 2,700 computers we have?
I can provide a copy of the actual code if that would be helpful.
import urllib2
import base64
import json
import csv
# Count Number so that process only runs for 25 assets at a time will be
# replaced with a variable that is determined by the number of computers added
# to the list
Count_Stop = 25
final_output_list = []
def get_creds():
# Credentials Function that retrieves username:pw from .file
with open('.cred') as cred_file:
cred_string = cred_file.read().rstrip()
return cred_string
print(cred_string)
def get_all_assets():
# Function to retrieve computer ID + computer names and store the ID in a
# new list called computers_parsed
request = urllib2.Request('jss'
'JSSResource/computers')
creds = get_creds()
request.add_header('Authorization', 'Basic ' + base64.b64encode(creds))
response = urllib2.urlopen(request).read()
# At this point the request for ID + name has been retrieved and now to be
# formatted in json
parsed_ids_json = json.loads(response)
# Then assign the parsed list (which has nested lists) at key 'computers'
# to a new list variable called computer_set
computer_set = parsed_ids_json['computers']
# New list to store just the computer ID's obtained in Loop below
computer_ids = []
# Count variable, when equal to max # of computers in Count_stop it stops.
count = 0
# This for loop iterates over ID + name in computer_set and returns the ID
# to the list computer_ids
for computers in computer_set:
count += 1
computer_ids.append(computers['id'])
# This IF condition allows for the script to be tested at 25 assets
# instead of all 2,000+ (comment out other announce_all_assets call)
if count == Count_Stop:
announce_all_assets(computer_ids, count)
# announce_all_assets(computer_ids, count)
def announce_all_assets(computer_ids, count):
print('Final list of ID\'s for review: ' + str(computer_ids))
print('Total number of computers to check against JSS: ' +
str(count))
extension_attribute_request(computer_ids, count)
def extension_attribute_request(computer_ids, count):
# Creating new variable, first half of new URL used in loop to get
# extension attributes using the computer ID's in computers_ids
base_url = 'jss'
what_we_want = '/subset/extensionattributes'
creds = get_creds()
print('Extension attribute function starts now:')
for ids in computer_ids:
request_url = base_url + str(ids) + what_we_want
request = urllib2.Request(request_url)
request.add_header('Authorization', 'Basic ' + base64.b64encode(creds))
response = urllib2.urlopen(request).read()
parsed_ext_json = json.loads(response)
ext_att_json = parsed_ext_json['computer']['extension_attributes']
retrieve_all_ext(ext_att_json)
def retrieve_all_ext(ext_att_json):
new_computer = {}
# new_computer['original_id'] = ids['id']
# new_computer['original_name'] = ids['name']
for computer in ext_att_json:
new_computer[str(computer['name'])] = computer['value']
add_to_master_list(new_computer)
def add_to_master_list(new_computer):
final_output_list.append(new_computer)
print(final_output_list)
def main():
# Function to run the get all assets function
get_all_assets()
if __name__ == '__main__':
# Function to run the functions in order: main > get all assets >
main()
I'd _highly recommend using the 'requests' module over 'urllib2'. It handles a lot of stuff for you and will save you many a headache.
I believe it will also give you better performance, but I'd love to hear your feedback.
Here's your code using requests. (I've added newlines to highlight my changes. Note the built-in .json() decoder.):
# Requires requests module be installed.:
# `pip install requests` or `pip3 install requests`
# https://pypi.python.org/pypi/requests/
import requests
import base64
import json
import csv
# Count Number so that process only runs for 25 assets at a time will be
# replaced with a variable that is determined by the number of computers added
# to the list
Count_Stop = 25
final_output_list = []
def get_creds():
# Credentials Function that retrieves username:pw from .file
with open('.cred') as cred_file:
cred_string = cred_file.read().rstrip()
return cred_string
print(cred_string)
def get_all_assets():
# Function to retrieve computer ID + computer names and store the ID in a
# new list called computers_parsed
base_url = 'jss'
what_we_want = 'JSSResource/computers'
request_url = base_url + what_we_want
# NOTE the request_url is constructed based on your request assignment just below.
# As such, it is malformed as a URL, and I assume anonymized for your posting on SO.
# request = urllib2.Request('jss'
# 'JSSResource/computers')
#
creds = get_creds()
headers={
'Authorization': 'Basic ' + base64.b64encode(creds),
}
response = requests.get( request_url, headers )
parsed_ids_json = response.json()
#[NO NEED FOR THE FOLLOWING. 'requests' HANDLES DECODES JSON. SEE ABOVE ASSIGNMENT.]
# At this point the request for ID + name has been retrieved and now to be
# formatted in json
# parsed_ids_json = json.loads(response)
# Then assign the parsed list (which has nested lists) at key 'computers'
# to a new list variable called computer_set
computer_set = parsed_ids_json['computers']
# New list to store just the computer ID's obtained in Loop below
computer_ids = []
# Count variable, when equal to max # of computers in Count_stop it stops.
count = 0
# This for loop iterates over ID + name in computer_set and returns the ID
# to the list computer_ids
for computers in computer_set:
count += 1
computer_ids.append(computers['id'])
# This IF condition allows for the script to be tested at 25 assets
# instead of all 2,000+ (comment out other announce_all_assets call)
if count == Count_Stop:
announce_all_assets(computer_ids, count)
# announce_all_assets(computer_ids, count)
def announce_all_assets(computer_ids, count):
print('Final list of ID\'s for review: ' + str(computer_ids))
print('Total number of computers to check against JSS: ' +
str(count))
extension_attribute_request(computer_ids, count)
def extension_attribute_request(computer_ids, count):
# Creating new variable, first half of new URL used in loop to get
# extension attributes using the computer ID's in computers_ids
base_url = 'jss'
what_we_want = '/subset/extensionattributes'
creds = get_creds()
print('Extension attribute function starts now:')
for ids in computer_ids:
request_url = base_url + str(ids) + what_we_want
headers={
'Authorization': 'Basic ' + base64.b64encode(creds),
}
response = requests.get( request_url, headers )
parsed_ext_json = response.json()
ext_att_json = parsed_ext_json['computer']['extension_attributes']
retrieve_all_ext(ext_att_json)
def retrieve_all_ext(ext_att_json):
new_computer = {}
# new_computer['original_id'] = ids['id']
# new_computer['original_name'] = ids['name']
for computer in ext_att_json:
new_computer[str(computer['name'])] = computer['value']
add_to_master_list(new_computer)
def add_to_master_list(new_computer):
final_output_list.append(new_computer)
print(final_output_list)
def main():
# Function to run the get all assets function
get_all_assets()
if __name__ == '__main__':
# Function to run the functions in order: main > get all assets >
main()
Please do let me know the relative performance time with your 25 assets in 18.89 seconds! I'm very curious.
I'd still recommend my other answer below(?) regarding the use of the requests module from a pure cleanliness perspective (requests is very clean to work with), but I recognize it may or may not address your original question.
If you want to try PyCurl, which likely will impact your original question, here's the same code implemented with that approach:
# Requires pycurl module be installed.:
# `pip install pycurl` or `pip3 install pycurl`
# https://pypi.python.org/pypi/pycurl/7.43.0
# NOTE: The syntax used herein for pycurl is python 3 compliant.
# Not python 2 compliant.
import pycurl
import base64
import json
import csv
def pycurl_data( url, headers ):
buffer = BytesIO()
connection = pycurl.Curl()
connection.setopt( connection.URL, url )
connection.setopt(pycurl.HTTPHEADER, headers )
connection.setopt( connection.WRITEDATA, buffer )
connection.perform()
connection.close()
body = buffer.getvalue()
# NOTE: The following assumes a byte string and a utf8 format. Change as desired.
return json.loads( body.decode('utf8') )
# Count Number so that process only runs for 25 assets at a time will be
# replaced with a variable that is determined by the number of computers added
# to the list
Count_Stop = 25
final_output_list = []
def get_creds():
# Credentials Function that retrieves username:pw from .file
with open('.cred') as cred_file:
cred_string = cred_file.read().rstrip()
return cred_string
print(cred_string)
def get_all_assets():
# Function to retrieve computer ID + computer names and store the ID in a
# new list called computers_parsed
base_url = 'jss'
what_we_want = 'JSSResource/computers'
request_url = base_url + what_we_want
# NOTE the request_url is constructed based on your request assignment just below.
# As such, it is malformed as a URL, and I assume anonymized for your posting on SO.
# request = urllib2.Request('jss'
# 'JSSResource/computers')
#
creds = get_creds()
headers= [ 'Authorization: Basic ' + base64.b64encode(creds) ]
response = pycurl_data( url, headers )
# At this point the request for ID + name has been retrieved and now to be
# formatted in json
parsed_ids_json = json.dumps( response )
# Then assign the parsed list (which has nested lists) at key 'computers'
# to a new list variable called computer_set
computer_set = parsed_ids_json['computers']
# New list to store just the computer ID's obtained in Loop below
computer_ids = []
# Count variable, when equal to max # of computers in Count_stop it stops.
count = 0
# This for loop iterates over ID + name in computer_set and returns the ID
# to the list computer_ids
for computers in computer_set:
count += 1
computer_ids.append(computers['id'])
# This IF condition allows for the script to be tested at 25 assets
# instead of all 2,000+ (comment out other announce_all_assets call)
if count == Count_Stop:
announce_all_assets(computer_ids, count)
# announce_all_assets(computer_ids, count)
def announce_all_assets(computer_ids, count):
print('Final list of ID\'s for review: ' + str(computer_ids))
print('Total number of computers to check against JSS: ' +
str(count))
extension_attribute_request(computer_ids, count)
def extension_attribute_request(computer_ids, count):
# Creating new variable, first half of new URL used in loop to get
# extension attributes using the computer ID's in computers_ids
base_url = 'jss'
what_we_want = '/subset/extensionattributes'
creds = get_creds()
print('Extension attribute function starts now:')
for ids in computer_ids:
request_url = base_url + str(ids) + what_we_want
headers= [ 'Authorization: Basic ' + base64.b64encode(creds) ]
response = pycurl_data( url, headers )
parsed_ext_json = json.dumps( response )
ext_att_json = parsed_ext_json['computer']['extension_attributes']
retrieve_all_ext(ext_att_json)
def retrieve_all_ext(ext_att_json):
new_computer = {}
# new_computer['original_id'] = ids['id']
# new_computer['original_name'] = ids['name']
for computer in ext_att_json:
new_computer[str(computer['name'])] = computer['value']
add_to_master_list(new_computer)
def add_to_master_list(new_computer):
final_output_list.append(new_computer)
print(final_output_list)
def main():
# Function to run the get all assets function
get_all_assets()
if __name__ == '__main__':
# Function to run the functions in order: main > get all assets >
main()
I'm new here to StackOverflow, but I have found a LOT of answers on this site. I'm also a programming newbie, so i figured i'd join and finally become part of this community - starting with a question about a problem that's been plaguing me for hours.
I login to a website and scrape a big body of text within the b tag to be converted into a proper table. The layout of the resulting Output.txt looks like this:
BIN STATUS
8FHA9D8H 82HG9F RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
INVENTORY CODE: FPBC *SOUP CANS LENTILS
BIN STATUS
HA8DHW2H HD0138 RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU 00A123 #2956- INVALID STOCK COUPON CODE (MISSING).
93827548 096DBR RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
There are a bunch of pages with the exact same blocks, but i need them to be combined into an ACTUAL table that looks like this:
BIN INV CODE STATUS
HA8DHW2HHD0138 FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU00A123 FPBC-*SOUP CANS LENTILS #2956- INVALID STOCK COUPON CODE (MISSING).
93827548096DBR FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8FHA9D8H82HG9F SSXR-98-20LM NM CORN CREAM RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
Essentially, all separate text blocks in this example would become part of this table, with the inv code repeating with its Bin values. I would post my attempts at parsing this data(have tried Pandas/bs/openpyxl/csv writer), but ill admit they are a little embarrassing, as i cannot find any information on this specific problem. Is there any benevolent soul out there that can help me out? :)
(Also, i am using Python 2.7)
A simple custom parser like the following should do the trick.
from __future__ import print_function
def parse_body(s):
line_sep = '\n'
getting_bins = False
inv_code = ''
for l in s.split(line_sep):
if l.startswith('INVENTORY CODE:') and not getting_bins:
inv_data = l.split()
inv_code = inv_data[2] + '-' + ' '.join(inv_data[3:])
elif l.startswith('INVENTORY CODE:') and getting_bins:
print("unexpected inventory code while reading bins:", l)
elif l.startswith('BIN') and l.endswith('MESSAGE'):
getting_bins = True
elif getting_bins == True and l:
bin_data = l.split()
# need to add exception handling here to make sure:
# 1) we have an inv_code
# 2) bin_data is at least 3 items big (assuming two for
# bin_id and at least one for message)
# 3) maybe some constraint checking to ensure that we have
# a valid instance of an inventory code and bin id
bin_id = ''.join(bin_data[0:2])
message = ' '.join(bin_data[2:])
# we now have a bin, an inv_code, and a message to add to our table
print(bin_id.ljust(20), inv_code.ljust(30), message, sep='\t')
elif getting_bins == True and not l:
# done getting bins for current inventory code
getting_bins = False
inv_code = ''
A rather complex one, but this might get you started:
import re, pandas as pd
from pandas import DataFrame
rx = re.compile(r'''
(?:INVENTORY\ CODE:)\s*
(?P<inv>.+\S)
[\s\S]+?
^BIN.+[\n\r]
(?P<bin_msg>(?:(?!^\ ).+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
string = your_string_here
# set up the dataframe
df = DataFrame(columns = ['BIN', 'INV', 'MESSAGE'])
for match in rx.finditer(string):
inv = match.group('inv')
bin_msg_raw = match.group('bin_msg').split("\n")
rxbinmsg = re.compile(r'^(?P<bin>(?:(?!\ {2}).)+)\s+(?P<message>.+\S)\s*$', re.MULTILINE)
for item in bin_msg_raw:
for m in rxbinmsg.finditer(item):
# append it to the dataframe
df.loc[len(df.index)] = [m.group('bin'), inv, m.group('message')]
print(df)
Explanation
It looks for INVENTORY CODE and sets up the groups (inv and bin_msg) for further processing in afterwork() (note: it would be easier if you had only one line of bin/msg as you need to split the group here afterwards).
Afterwards, it splits the bin and msg part and appends all to the df object.
I had a code written for a website scrapping which may help you.
Basically what you need to do is write click on the web page go to html and try to find the tag for the table you are looking for and using the module (i am using beautiful soup) extract the information. I am creating a json as I need to store it into mongodb you can create table.
#! /usr/bin/python
import sys
import requests
import re
from BeautifulSoup import BeautifulSoup
import pymongo
def req_and_parsing():
url2 = 'http://businfo.dimts.in/businfo/Bus_info/EtaByRoute.aspx?ID='
list1 = ['534UP','534DOWN']
for Route in list1:
final_url = url2 + Route
#r = requests.get(final_url)
#parsing_file(r.text,Route)
outdict = []
outdict = [parsing_file( requests.get(url2+Route).text,Route) for Route in list1 ]
print outdict
conn = f_connection()
for i in range(len(outdict)):
insert_records(conn,outdict[i])
def parsing_file(txt,Route):
soup = BeautifulSoup(txt)
table = soup.findAll("table",{"id" : "ctl00_ContentPlaceHolder1_GridView2"})
#trtags = table[0].findAll('tr')
tdlist = []
trtddict = {}
"""
for trtag in trtags:
print 'print trtag- ' , trtag.text
tdtags = trtag.findAll('td')
for tdtag in tdtags:
print tdtag.text
"""
divtags = soup.findAll("span",{"id":"ctl00_ContentPlaceHolder1_ErrorLabel"})
for divtag in divtags:
for divtag in divtags:
print "div tag - " , divtag.text
if divtag.text == "Currently no bus is running on this route" or "This is not a cluster (orange bus) route":
print "Page not displayed Errored with below meeeage for Route-", Route," , " , divtag.text
sys.exit()
trtags = table[0].findAll('tr')
for trtag in trtags:
tdtags = trtag.findAll('td')
if len(tdtags) == 2:
trtddict[tdtags[0].text] = sub_colon(tdtags[1].text)
return trtddict
def sub_colon(tag_str):
return re.sub(';',',',tag_str)
def f_connection():
try:
conn=pymongo.MongoClient()
print "Connected successfully!!!"
except pymongo.errors.ConnectionFailure, e:
print "Could not connect to MongoDB: %s" % e
return conn
def insert_records(conn,stop_dict):
db = conn.test
print db.collection_names()
mycoll = db.stopsETA
mycoll.insert(stop_dict)
if __name__ == "__main__":
req_and_parsing()
I have a question with using python and beautifulsoup.
My end result program basically fills out a form on a website and brings me back the results which I will eventually output to an lxml file. I'll be taking the results from https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS and I want to get a list for every city all into some excel documents.
Here is my code, I put it on pastebin:
http://pastebin.com/bZJfMp2N
MY RESULTS ARE ALMOST GOOD :D except now I'm getting 355 for my "correct value" instead of 355, for example. I want to parse that and only show the number, you will see when you put this into python.
However, anything I have tried does NOT work, there is no way I can parse that values_2 variable because the results are in bs4.element.resultset when I think i need to parse a string. Sorry if I am a noob, I am still learning and have worked very long on this program.
Would anyone have any input? Anything would be appreciated! I've read up that my results are in a list or something and i can't parse lists? How would I go about doing this?
Here is the code:
__author__ = 'kennytruong'
#THE PROBLEM HERE IS TO PARSE THE RESULTS PROPERLY!!
import urllib.parse, urllib.request
import re
from bs4 import BeautifulSoup
URL = "https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS"
#Goes through these locations, strips the whitespace in the string and creates a list that starts at every new line
LOCATIONS = '''
ALAMEDA ALAMEDA
'''.strip().split('\n') #strip() basically removes whitespaces
print('Available locations to choose from:', LOCATIONS)
INSURANCE_TYPES = '''
HOMEOWNERS,CONDOMINIUM,MOBILEHOME,RENTERS,EARTHQUAKE - Single Family,EARTHQUAKE - Condominium,EARTHQUAKE - Mobilehome,EARTHQUAKE - Renters
'''.strip().split(',') #strips the whitespaces and starts a newline of the list every comma
print('Available insurance types to choose from:', INSURANCE_TYPES)
COVERAGE_AMOUNTS = '''
15000,25000,35000,50000,75000,100000,150000,200000,250000,300000,400000,500000,750000
'''.strip().split(',')
print('All options for coverage amounts:', COVERAGE_AMOUNTS)
HOME_AGE = '''
New,1-3 Years,4-6 Years,7-15 Years,16-25 Years,26-40 Years,41-70 Years
'''.strip().split(',')
print('All Home Age Options:', HOME_AGE)
def get_premiums(location, coverage_type, coverage_amt, home_age):
formEntries = {'location':location,
'coverageType':coverage_type,
'coverageAmount':coverage_amt,
'homeAge':home_age}
inputData = urllib.parse.urlencode(formEntries)
inputData = inputData.encode('utf-8')
request = urllib.request.Request(URL, inputData)
response = urllib.request.urlopen(request)
responseData = response.read()
soup = BeautifulSoup(responseData, "html.parser")
parseResults = soup.find_all('tr', {'valign':'top'})
for eachthing in parseResults:
parse_me = eachthing.text
name = re.findall(r'[A-z].+', parse_me) #find me all the words that start with a cap, as many and it doesn't matter what kind.
# the . for any character and + to signify 1 or more of it.
values = re.findall(r'\d{1,10}', parse_me) #find me any digits, however many #'s long as long as btwn 1 and 10
values_2 = eachthing.find_all('div', {'align':'right'})
print('raw code for this part:\n' ,eachthing, '\n')
print('here is the name: ', name[0], values)
print('stuff on sheet 1- company name:', name[0], '- Premium Price:', values[0], '- Deductible', values[1])
print('but here is the correct values - ', values_2) #NEEDA STRIP THESE VALUES
# print(type(values_2)) DOING SO GIVES ME <class 'bs4.element.ResultSet'>, NEEDA PARSE bs4.element type
# values_3 = re.split(r'\d', values_2)
# print(values_3) ANYTHING LIKE THIS WILL NOT WORK BECAUSE I BELIEVE RESULTS ARENT STRING
print('\n\n')
def main():
for location in LOCATIONS: #seems to be looping the variable location in LOCATIONS - each location is one area
print('Here are the options that you selected: ', location, "HOMEOWNERS", "150000", "New", '\n\n')
get_premiums(location, "HOMEOWNERS", "150000", "New") #calls function get_premiums and passes parameters
if __name__ == "__main__": #this basically prevents all the indent level 0 code from getting executed, because otherwise the indent level 0 code gets executed regardless upon opening
main()
I'm trying to get some results from UniProt, which is a protein database (details are not important). I'm trying to use some script that translates from one kind of ID to another. I was able to do this manually on the browser, but could not do it in Python.
In http://www.uniprot.org/faq/28 there are some sample scripts. I tried the Perl one and it seems to work, so the problem is my Python attempts. The (working) script is:
## tool_example.pl ##
use strict;
use warnings;
use LWP::UserAgent;
my $base = 'http://www.uniprot.org';
my $tool = 'mapping';
my $params = {
from => 'ACC', to => 'P_REFSEQ_AC', format => 'tab',
query => 'P13368 P20806 Q9UM73 P97793 Q17192'
};
my $agent = LWP::UserAgent->new;
push #{$agent->requests_redirectable}, 'POST';
print STDERR "Submitting...\n";
my $response = $agent->post("$base/$tool/", $params);
while (my $wait = $response->header('Retry-After')) {
print STDERR "Waiting ($wait)...\n";
sleep $wait;
print STDERR "Checking...\n";
$response = $agent->get($response->base);
}
$response->is_success ?
print $response->content :
die 'Failed, got ' . $response->status_line .
' for ' . $response->request->uri . "\n";
My questions are:
1) How would you do that in Python?
2) Will I be able to massively "scale" that (i.e., use a lot of entries in the query field)?
question #1:
This can be done using python's urllibs:
import urllib, urllib2
import time
import sys
query = ' '.join(sys.argv)
# encode params as a list of 2-tuples
params = ( ('from','ACC'), ('to', 'P_REFSEQ_AC'), ('format','tab'), ('query', query))
# url encode them
data = urllib.urlencode(params)
url = 'http://www.uniprot.org/mapping/'
# fetch the data
try:
foo = urllib2.urlopen(url, data)
except urllib2.HttpError, e:
if e.code == 503:
# blah blah get the value of the header...
wait_time = int(e.hdrs.get('Retry-after', 0))
print 'Sleeping %i seconds...' % (wait_time,)
time.sleep(wait_time)
foo = urllib2.urlopen(url, data)
# foo is a file-like object, do with it what you will.
foo.read()
You're probably better off using the Protein Identifier Cross Reference service from the EBI to convert one set of IDs to another. It has a very good REST interface.
http://www.ebi.ac.uk/Tools/picr/
I should also mention that UniProt has very good webservices available. Though if you are tied to using simple http requests for some reason then its probably not useful.
Let's assume that you are using Python 2.5.
We can use httplib to directly call the web site:
import httplib, urllib
querystring = {}
#Build the query string here from the following keys (query, format, columns, compress, limit, offset)
querystring["query"] = ""
querystring["format"] = "" # one of html | tab | fasta | gff | txt | xml | rdf | rss | list
querystring["columns"] = "" # the columns you want comma seperated
querystring["compress"] = "" # yes or no
## These may be optional
querystring["limit"] = "" # I guess if you only want a few rows
querystring["offset"] = "" # bring on paging
##From the examples - query=organism:9606+AND+antigen&format=xml&compress=no
##Delete the following and replace with your query
querystring = {}
querystring["query"] = "organism:9606 AND antigen"
querystring["format"] = "xml" #make it human readable
querystring["compress"] = "no" #I don't want to have to unzip
conn = httplib.HTTPConnection("www.uniprot.org")
conn.request("GET", "/uniprot/?"+ urllib.urlencode(querystring))
r1 = conn.getresponse()
if r1.status == 200:
data1 = r1.read()
print data1 #or do something with it
You could then make a function around creating the query string and you should be away.
check this out bioservices. they interface a lot of databases through Python.
https://pythonhosted.org/bioservices/_modules/bioservices/uniprot.html
conda install bioservices --yes
in complement to O.rka answer:
Question 1:
from bioservices import UniProt
u = UniProt()
res = u.get_df("P13368 P20806 Q9UM73 P97793 Q17192".split())
This returns a dataframe with all information about each entry.
Question 2: same answer. This should scale up.
Disclaimer: I'm the author of bioservices
There is a python package in pip which does exactly what you want
pip install uniprot-mapper