.Split() method for parsing url paths python - python

How can I split urls like this (which are coming from a django object selection):
[<PathsOfDomain: www.somesite.com/>, <PathsOfDomain: somesite.com/prof.php?pID=589>, <PathsOfDomain: www.somesite.com/some/path/here/paramid=6, <PathsOfDomain: www.somesite.com/prof.php?pID=317>, <PathsOfDomain: www.somesite.com/prof.php?pID=523>]
I have code:
if self.path_object is not None:
dictpath = {}
for path in self.path_object:
print path #debugging only
self.params = path.pathToScan.split("?")[1].split("&")
out = list(map(lambda v: v.split("=")[0] +"=" + self.fuzz_vectors, self.params))
dictpath[path] = out
print dictpath
I'm getting an error of:
self.params = path.pathToScan.split("?")[1].split("&")
IndexError: list index out of range
What am I doing wrong here?
Thank you!

self.params = path.split("?")[1].split("&")
should be
self.params = path.path.split("?")[1].split("&")
path is the PathsOfDomain object, but you need path.path which the actual string containing the path.
You should also look at the urlparse module which contains code to help parsing urls. You can use it simplify your code here.

Related

unable to mock all the private methods using python unittest

I have a core class where I am trying to read the zip file, unzip, get a specific file, and get the contents of that file. This works fine but now I am trying to mock all the things I used along the way.
class ZipService:
def __init__(self, path: str):
self.path = path
def get_manifest_json_object(self):
s3 = boto3.resource('s3')
bucket_name, key = self.__get_bucket_and_key()
bucket = s3.Bucket(bucket_name)
zip_object_reference = bucket.Object(key).get()["Body"]
zip_object_bytes_stream = self.__get_zip_object_bytes_stream(zip_object_reference)
zipf = zipfile.ZipFile(zip_object_bytes_stream, mode='r')
return self.__get_manifest_json(zipf)
def __get_bucket_and_key(self):
pattern = "https:\/\/(.?[^\.]*)\.(.?[^\/]*)\/(.*)" # this regex is working but don't know how :D
result = re.split(pattern, self.path)
return result[1], result[3]
def __get_zip_object_bytes_stream(self, zip_object_reference):
return io.BytesIO(zip_object_reference.read())
def __get_manifest_json(self, zipf):
manifest_json_text = [zipf.read(name) for name in zipf.namelist() if "/manifest.json" in name][0].decode("utf-8")
return json.loads(manifest_json_text)
For this I have written a test case that throws an error:
#patch('boto3.resource')
class TestZipService(TestCase):
def test_zip_service(self, mock):
s3 = boto3.resource('s3')
bucket = s3.Bucket("abc")
bucket.Object.get.return_value = "some-value"
zipfile.ZipFile.return_value = "/some-path"
inst = ZipService("/some-path")
with mock.patch.object(inst, "_ZipService__get_manifest_json", return_value={"a": "b"}) as some_object:
expected = {"a": "b"}
actual = inst.get_manifest_json_object()
self.assertIsInstance(expected, actual)
Error:
bucket_name, key = self.__get_bucket_and_key()
File "/Us.tox/py38/lib/python3.8/site-packages/services/zip_service.py", line 29, in __get_bucket_and_key
return result[1], result[3]
IndexError: list index out of range
What exactly is wrong here? Any hints would also be appreciated. TIA
You are giving your ZipService a path of "/some-path".
Then you test its get_manifest_json_object method, whose 2nd statement calls __get_bucket_and_key.
You are not mocking __get_bucket_and_key, so when it's called it tries to process that input path with a regex split, which won't give you a collection with 4 items that it needs to return result[1], result[3].
Hence, IndexError: list index out of range.
Either give your ZipService a proper path you'd expect, or mock all private methods used in get_manifest_json_object.

Python add to url

I have a URL as follows:
http://www.example.com/boards/results/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01
I need to insert a node 'us' in this case, as follows:
http://www.example.com/boards/results/us/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01
Using Python's urlparse library, I can get to the path as follows:
path = urlparse(url).path
... and then using a complicated and ugly routine involving splitting the path based on slashes and inserting the new node and then reconstructing the URL
>>> path = urlparse(url).path
>>> path.split('/')
['', 'boards', 'results', 'current:entry1,current:entry2', 'modular', 'table', 'alltables', 'alltables', 'alltables', '2011-01-01']
>>> ps = path.split('/')
>>> ps.insert(4, 'us')
>>> '/'.join(ps)
'/boards/results/current:entry1,current:entry2/us/modular/table/alltables/alltables/alltables/2011-01-01'
>>>
Is there a more elegant/pythonic way to accomplish this using default libraries?
EDIT:
The 'results' in the URL is not fixed - it can be 'results' or 'products' or 'prices' and so on. However, it will always be right after 'boards'.
path = "http://www.example.com/boards/results/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01"
replace_start_word = 'results'
replace_word_length = len(replace_start_word)
replace_index = path.find(replace_start_word)
new_url = '%s/us%s' % (path[:replace_index + replace_word_length], path[replace_index + replace_word_length:])

urllib.urlopen() on variables that are assigned a string values wont work?

Im writing some code to parse an XML file. Im just wondering if someone could explain why this isn't working. If I put link itself into urllib.urlopen(), it does not seem to make it to that url. However, when I put "http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=today" inside urllib.urlopen(), it works. Does it need to be a string and not a variable or is there a way around it?
import urllib
from bs4 import BeautifulSoup
class Uel(object):
def __init__(self, link):
self.content_data = []
self.num_likes = []
self.num_dislikes = []
self.favoritecount = []
self.view_count = []
self.link = link
self.web_obj = urllib.urlopen(link)
self.file = open('youtubequery.txt', 'w+')
self.file.write(str(self.web_obj))
for i in self.web_obj:
self.file.write(i)
with open("youtubequery.txt", "r") as myfile:
self.file_2=myfile.read()
self.soup = BeautifulSoup(self.file_2)
for link in self.soup.find_all("content"):
self.content_data.append(str(link.get("src")))
for stat in self.soup.find_all("yt:statistics"):
self.favoritecount.append(str(stat.get("favoritecount")))
for views in self.soup.find_all("yt:statistics"):
self.view_count.append(str(views.get("viewcount")))
for numlikes in self.soup.find_all("yt:rating"):
self.num_likes.append(str(numlikes.get("numlikes")))
for numdislikes in self.soup.find_all("yt:rating"):
self.num_dislikes.append(str(numdislikes.get("numdislikes")))
def __str__(self):
return str(self.content_data),str(self.num_likes), str(self.num_dislikes)
link = "http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=5"
data = Uel(link)
print data.__str__()
In the code you've presented, you are using this url:
http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=5
a request to which produces:
Invalid value for time parameter: 5
But, in the question itself, you've mentioned the following URL:
http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=today
which has time=today. The code with this URL works for me.

Creating loop for __main__

I am new to Python, and I want your advice on something.
I have a script that runs one input value at a time, and I want it to be able to run a whole list of such values without me typing the values one at a time. I have a hunch that a "for loop" is needed for the main method listed below. The value is "gene_name", so effectively, i want to feed in a list of "gene_names" that the script can run through nicely.
Hope I phrased the question correctly, thanks! The chunk in question seems to be
def get_probes_from_genes(gene_names)
import json
import urllib2
import os
import pandas as pd
api_url = "http://api.brain-map.org/api/v2/data/query.json"
def get_probes_from_genes(gene_names):
if not isinstance(gene_names,list):
gene_names = [gene_names]
#in case there are white spaces in gene names
gene_names = ["'%s'"%gene_name for gene_name in gene_names]**
api_query = "?criteria=model::Probe"
api_query= ",rma::criteria,[probe_type$eq'DNA']"
api_query= ",products[abbreviation$eq'HumanMA']"
api_query= ",gene[acronym$eq%s]"%(','.join(gene_names))
api_query= ",rma::options[only$eq'probes.id','name']"
data = json.load(urllib2.urlopen(api_url api_query))
d = {probe['id']: probe['name'] for probe in data['msg']}
if not d:
raise Exception("Could not find any probes for %s gene. Check " \
"http://help.brain- map.org/download/attachments/2818165/HBA_ISH_GeneList.pdf? version=1&modificationDate=1348783035873 " \
"for list of available genes."%gene_name)
return d
def get_expression_values_from_probe_ids(probe_ids):
if not isinstance(probe_ids,list):
probe_ids = [probe_ids]
#in case there are white spaces in gene names
probe_ids = ["'%s'"%probe_id for probe_id in probe_ids]
api_query = "? criteria=service::human_microarray_expression[probes$in%s]"% (','.join(probe_ids))
data = json.load(urllib2.urlopen(api_url api_query))
expression_values = [[float(expression_value) for expression_value in data["msg"]["probes"][i]["expression_level"]] for i in range(len(probe_ids))]
well_ids = [sample["sample"]["well"] for sample in data["msg"] ["samples"]]
donor_names = [sample["donor"]["name"] for sample in data["msg"] ["samples"]]
well_coordinates = [sample["sample"]["mri"] for sample in data["msg"] ["samples"]]
return expression_values, well_ids, well_coordinates, donor_names
def get_mni_coordinates_from_wells(well_ids):
package_directory = os.path.dirname(os.path.abspath(__file__))
frame = pd.read_csv(os.path.join(package_directory, "data", "corrected_mni_coordinates.csv"), header=0, index_col=0)
return list(frame.ix[well_ids].itertuples(index=False))
if __name__ == '__main__':
probes_dict = get_probes_from_genes("SLC6A2")
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)
whoa, first things first. Python ain't Java, so do yourself a favor and use a nice """xxx\nyyy""" string, with triple quotes to multiline.
api_query = """?criteria=model::Probe"
,rma::criteria,[probe_type$eq'DNA']
...
"""
or something like that. you will get white spaces as typed, so you may need to adjust.
If, like suggested, you opt to loop on the call to your function through a file, you will need to either try/except your data-not-found exception or you will need to handle missing data without throwing an exception. I would opt for returning an empty result myself and letting the caller worry about what to do with it.
If you do opt for raise-ing an Exception, create your own, rather than using a generic exception. That way your code can catch your expected Exception first.
class MyNoDataFoundException(Exception):
pass
#replace your current raise code with...
if not d:
raise MyNoDataFoundException(your message here)
clarification about catching exceptions, using the accepted answer as a starting point:
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
#keep track of your input data
search_data = line.strip()
try:
probes_dict = get_probes_from_genes(search_data)
except MyNoDataFoundException, e:
#and do whatever you feel you need to do here...
print "bummer about search_data:%s:\nexception:%s" % (search_data, e)
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)
You may want to create a file with Gene names, then read content of the file and call your function in the loop. Here is an example below
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
probes_dict = get_probes_from_genes(line.strip())
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)

LXML Xpath does not seem to return full path

OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.
I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.
I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.
I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.
Here is my code:
xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
single_tag = '%s,%s' % (i.tag, i.text)
every_tag.append(single_tag)
print(every_tag)
This gives:
'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',
As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13
Can anyone point me towards what I need?
Thanks (hope my first post is OK)
Adam
EDIT:
This is my code now:
with open(fullpath, 'rb') as xmlFilepath:
xmlfile = xmlFilepath.read()
fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)
xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)
But this returns an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I remove the first two chars as thy are b' and it complained it didn't start with a tag
Update:
I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.
Any help on how I can achieve this?
ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element
Calling getpath on each element in a iter() loop should work for you:
from pprint import pprint
from lxml import etree
text = """
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
</TechAccount>
"""
xml = etree.fromstring(text)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
prints:
['/TechAccount, \n',
'/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
'/TechAccount/BrokerReference, HOY123/456',
'/TechAccount/ServiceProviderReference, 2012080921401A1',
'/TechAccount/CreationDate, 2012-08-10',
'/TechAccount/AccountTransactionType, premium',
'/TechAccount/GroupReference, 2012080921401A1',
'/TechAccount/ItemsInGroupTotal, \n',
'/TechAccount/ItemsInGroupTotal/Count, 1',
'/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
'/TechAccount/ServiceProviderGroupItemsTotal, \n',
'/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']
UPD:
If your xml data is in the file test.xml, the code would look like:
from pprint import pprint
from lxml import etree
xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
Hope that helps.
getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:
def human_xpath(element):
full_xpath = element.getroottree().getpath(element)
xpath = ''
human_xpath = ''
for i, node in enumerate(full_xpath.split('/')[1:]):
xpath += '/' + node
element = element.xpath(xpath)[0]
namespace, tag = element.tag[1:].split('}', 1)
if element.getparent() is not None:
nsmap = {'ns': namespace}
same_name = element.getparent().xpath('./ns:' + tag,
namespaces=nsmap)
if len(same_name) > 1:
tag += '[{}]'.format(same_name.index(element) + 1)
human_xpath += '/' + tag
return human_xpath

Categories

Resources