Fuzzy URL matching in Python

Fuzzy URL matching in Python - python

I'd like to find a tool that does a good job of fuzzy matching URLs that are the same expecting extra parameters. For instance, for my use case, these two URLs are the same:
atest = (http://www.npr.org/templates/story/story.php?storyId=4231170', 'http://www.npr.org/templates/story/story.php?storyId=4231170&sc=fb&cc=fp)
At first blush, fuzz.partial_ratio and fuzz.token_set_ratio fuzzywuzzy get the job done with a 100 threshold:
ratio = fuzz.ratio(atest[0], atest[1])
partialratio = fuzz.partial_ratio(atest[0], atest[1])
sortratio = fuzz.token_sort_ratio(atest[0], atest[1])
setratio = fuzz.token_set_ratio(atest[0], atest[1])
print('ratio: %s' % (ratio))
print('partialratio: %s' % (partialratio))
print('sortratio: %s' % (sortratio))
print('setratio: %s' % (setratio))
>>>ratio: 83
>>>partialratio: 100
>>>sortratio: 83
>>>setratio: 100
But this approach fails and returns 100 in other cases, like:
atest('yahoo.com','http://finance.yahoo.com/news/earnings-preview-monsanto-report-2q-174000816.html')
The URLs in my data and the parameters added vary a great deal. I interested to know if anyone has a better approach using url parsing or similar?

If all you want is check that all query parameters in the first URL are present in the second URL, you can do it in a simpler way by just doing set difference:
import urllib.parse as urlparse
base_url = 'http://www.npr.org/templates/story/story.php?storyId=4231170'
check_url = 'http://www.npr.org/templates/story/story.php?storyId=4231170&sc=fb&cc=fp'
base_url_parameters = set(urlparse.parse_qs(urlparse.urlparse(base_url).query).keys())
check_url_parameters = set(urlparse.parse_qs(urlparse.urlparse(check_url).query).keys())
print(base_url_parameters - check_url_parameters)
This will return an empty set, but if you change the base url to something like
base_url = 'http://www.npr.org/templates/story/story.php?storyId=4231170&test=1'
it will return {'test'}, which means that there are extra parameters in the base URL that are missing from the second URL.

Related

Googlemaps refuses to give more than 20 restaurants

I am trying to get more than 20 restaurants in googlemaps.places_nearby(). so, I am using the page_token mentioned in a question here on stackoverflow.
here's the code:
import time
with open('apikey.txt') as f:
api_key = f.readline()
f.close
import googlemaps
gmaps = googlemaps.Client(api_key)
locations = [(30.0416162646183, 31.187637297709912),(30.038828662868447, 31.21133524457125),(29.848956883337507, 31.334386571579085),(30.047845819479956, 31.262317130706496),(30.05312112490655, 31.24665544474578),(30.044482967408886, 31.23572953125353),(30.02023034028819, 31.176992671570066),(30.055592085960892, 31.18411299557052),(30.0512387364253, 31.20328697618034),(30.027741592767295, 31.174344307489818),(30.043337503059586, 31.17587613443309),(30.049286828183856, 31.181250916540794),(30.043423144171197, 31.187248209629644),(30.040934096091647, 31.183299998037857),(30.038296379882215, 31.189823130232988),(29.960107152991863, 31.250999388927262) , (29.83911392727571, 31.30468827173587) , (29.842752004034566, 31.332961535887694)]
search_string = "كشري"
distance = 10000
kosharies = []
for location in locations:
response = gmaps.places_nearby(location = location , keyword = search_string , name = 'كشري' , radius = distance)
kosharies.extend(response.get('results'))
next_page_token = response.get('next_page_token')
while next_page_token:
time.sleep(2)
another_response = gmaps.places_nearby(location = location , keyword = search_string , name = 'كشري' , radius = distance ,\
page_token = next_page_token)
kosharies.extend(another_response.get('results'))
next_page_token = another_response.get('next_page_token')
I provided 18 different locations.
Judging by the fact that each request must give back exactly 20 or less.
I looked manually in the 18 locations, I know that each location has more than 20 restaurants in this category!
I tried going for a while loop over the page_tokens, but no luck.
the shape of the dataset returned is (142 , 18) rows, columns.
I'd appreciate your help so much.

This seems to work just fine, please check your requests' parameters carefully:
First request:
location=30.0416162646183,31.187637297709912&radius=10000&keyword=كشري
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=30.0416162646183,31.187637297709912&radius=10000&keyword=%D9%83%D8%B4%D8%B1%D9%8A
Returns 20 results and a next_page_token. Second request uses only that token, with pagetoken (not page_token):
https://maps.googleapis.com/maps/api/place/nearbysearch/json?pagetoken=AfLeUgPos3GU6Ew0BWV52EyHv9ay7Q7H7N-44c0pSTfb0JN039qifhKotQiUPlF9O3P1jdeSJarnR72GHMmUW1jkBS0ErfYe_jpi9cUCx--XO1n1DUp_eF0XZ_Ue4KJ7_l5h6FjaE_1f2Z0G9q3lwGLg-0Ch5n4p7KaTKMq8CET8QX-lXa2_ssemCiFGXrOj6vn4wDXKNRqYAOGquNiaq9_3RWUs_k7Epe_uCicW94hC-PX90nisZxuW-zy3SBAmuRJpL4pV3CA9CWH0ygBigHWy88Sle6b1S-4k2GWK72n-eAMEEmmAOziqt1ETCy-li92pqjP4BgDv7jKYCD2uKgL3jRWdGyglroNkP02HFX49qHNbNrl0MfhKn_lTvw6zSjwF001nDOnYq5mgE8KCTe3b7cDxGgZVYWFFKwyNuswXiUPTy9D4lXhRJX7oRF6DH7YF-lH7faLpeh2eZBMP73AtFJlPX7B0c4_riCgeCK2C0Pvz_lBivx4VtkOehmuYfOVwBpi54rJW4-iZnrIjW0NrRD7HibL76MWyr_njLIf5eLx9Tl2PYiwTOj3Vd6Rjafry7b15M69Jhku1C22AVhy0R4HfYd5LHFn38N_ILL8PhDaMk3S2TKkzrohYyomrbvlffrBKUDij9Bbggvoy2s3iHrg6N-Em3SNrTPcWS65chZUALp_kve04rfU4wjhKow
This also returns 20 results and a final next_page_token. Third request:
https://maps.googleapis.com/maps/api/place/nearbysearch/json?pagetoken=AfLeUgPos3GU6Ew0BWV52EyHv9ay7Q7H7N-44c0pSTfb0JN039qifhKotQiUPlF9O3P1jdeSJarnR72GHMmUW1jkBS0ErfYe_jpi9cUCx--XO1n1DUp_eF0XZ_Ue4KJ7_l5h6FjaE_1f2Z0G9q3lwGLg-0Ch5n4p7KaTKMq8CET8QX-lXa2_ssemCiFGXrOj6vn4wDXKNRqYAOGquNiaq9_3RWUs_k7Epe_uCicW94hC-PX90nisZxuW-zy3SBAmuRJpL4pV3CA9CWH0ygBigHWy88Sle6b1S-4k2GWK72n-eAMEEmmAOziqt1ETCy-li92pqjP4BgDv7jKYCD2uKgL3jRWdGyglroNkP02HFX49qHNbNrl0MfhKn_lTvw6zSjwF001nDOnYq5mgE8KCTe3b7cDxGgZVYWFFKwyNuswXiUPTy9D4lXhRJX7oRF6DH7YF-lH7faLpeh2eZBMP73AtFJlPX7B0c4_riCgeCK2C0Pvz_lBivx4VtkOehmuYfOVwBpi54rJW4-iZnrIjW0NrRD7HibL76MWyr_njLIf5eLx9Tl2PYiwTOj3Vd6Rjafry7b15M69Jhku1C22AVhy0R4HfYd5LHFn38N_ILL8PhDaMk3S2TKkzrohYyomrbvlffrBKUDij9Bbggvoy2s3iHrg6N-Em3SNrTPcWS65chZUALp_kve04rfU4wjhKow
This also returns 20 results, but no next_page_token.
Bear in mind that there is no guarantee on the amount of results for any given request. The API can return up to 60 results (in 3 pages), but it can also return any number of them from 0 to 60. This can happen more easily with Place Nearby Search because, as the name indicates, it is meant to return only results that are near by; even though it will accept a large radius value, it usually doesn't return results that are far away.

Selenium - Can't get text from element (Python)

I'm trying to get the result of an input from:
https://web2.0calc.com/
But I can't get the result. I've tried:
result = browser.find_element_by_id("input")
result.text
result.get_attribute("textContent")
result.get_attribute("innerHtml")
result.get_attribute("textContent")
But it doesn't work and returns an empty string...

The required element is a Base64 image, so you can either get a Base64 value from #src, convert it to an image and get a value with a tool like PIL (quite complicated approach) or you can get a result with a direct API call:
import requests
url = 'https://web2.0calc.com/calc'
data = data={'in[]': '45*23'} # Pass your expression as a value
response = requests.post(url, data=data).json()
print(response['results'][0]['out'])
# 1035
If you need the value of #input:
print(browser.find_element_by_id('input').get_attribute('value'))

My preference would be for the POST example (+ for that) given but you can grab the expression and evaluate that using asteval. There may be limitations on asteval. It is safer than eval.
from selenium import webdriver
from asteval import Interpreter
d = webdriver.Chrome()
url = 'https://web2.0calc.com/'
d.get(url)
d.maximize_window()
d.find_element_by_css_selector('[name=cookies]').click()
d.find_element_by_id('input').send_keys(5)
d.find_element_by_id('BtnPlus').click()
d.find_element_by_id('input').send_keys(50)
d.find_element_by_id('BtnCalc').click()
expression = ''
while len(expression) == 0:
expression = d.find_element_by_id('result').get_attribute('title')
aeval = Interpreter()
print(aeval(expression))
d.quit()

python googlemaps all possible distances between different locations

schools=['GSGS','GSGL','JKG','JMG','MCGD','MANGD','SLSA','WHGR','WOG','GCG','LP',
'PGG', 'WVSG', 'ASGE','CZG', 'EAG','GI']
for i in range (1,17):
gmaps = googlemaps.Client(key='')
distances = gmaps.distance_matrix((GSGS), (schools), mode="driving"['rows'][0]['elements'][0]['distance']['text']
print(distances)
The elements of the list are schools. I didn't want to make the list to long so I used these abbreviations.
I want to get all the distances between "GSGS" and the schools in the list. I don't know what to write inside the second bracket.
distances = gmaps.distance_matrix((GSGS), (schools)
If I run it like that, it outputs this error:
Traceback (most recent call last):
File "C:/Users/helpmecoding/PycharmProjects/untitled/distance.py", line 31, in
<module>
distances = gmaps.distance_matrix((GSGS), (schools), mode="driving")['rows'][0]['elements'][0]['distance']['text']
KeyError: 'distance'
I could do it one for one but thats not what I want. If I write another school from the list schools and delete the for loop it works fine.
I know I have to do a loop so that it cycles trough the list, but I don't know how to do it. Behind every variable for example "GSGS" is the address/location from the school.
I deleted the key just for safety.

My Dad helped me and we solved the problem. Now i have what i want :) Now i have to do a list with all distances between the schools. And if i got that i have to do the Dijkstra Algorithm to find the shortest route between them. Thanks for helping!
import googlemaps
GSGS = (address)
GSGL = (address)
. . .
. . .
. . .
schools =
(GSGS,GSGL,JKG,JMG,MCGD,MANGD,SLSA,WHGR,WOG,GCG,LP,PGG,WVSG,ASGE,CZG,EAG,GI)
school_names = ("GSGS","GSGL","JKG","JMG","MCGD","MANGD","SLSA","WHGR","WOG","GCG","LP","PGG","WVSG","ASGE","CZG","EAG","GI")
school_distances = ()
for g in range(0,len(schools)):
n = 0
for i in schools:
gmaps = googlemaps.Client(key='TOPSECRET')
distances = gmaps.distance_matrix(schools[g], i)['rows'][0]['elements'][0]['distance']['text']
if school_names[g] != school_names[n]:
print(school_names[g] + " - " + school_names[n] + " " + distances)
else:
print(school_names[g] + " - " + school_names[n] + " " + "0 km")
n = n + 1

In my experience, it is sometimes difficult to know what is going on when you use a third-party api. Though I am not a proponent of reinventing the wheel sometimes it is necessary to get a full picture of what is going on. So, I recommend giving it a shot building your own api endpoint request call and see if that works.
import requests
schools = ['GSGS','GSGL','JKG','JMG','MCGD','MANGD','SLSA','WHGR','WOG','GCG','LP','PGG', 'WVSG', 'ASGE','CZG', 'EAG','GI']
def gmap_dist(apikey, origins, destinations, **kwargs):
units = kwargs.get("units", "imperial")
mode = kwargs.get("mode", "driving")
baseurl = "https://maps.googleapis.com/maps/api/distancematrix/json?"
urlargs = {"key": apikey, "units": units, "origins": origins, "destinations": destinations, "mode": mode}
req = requests.get(baseurl, params=urlargs)
data = req.json()
print(data)
# do this for each key and index pair until you
# find the one causing the problem if it
# is not immediately evident from the whole data print
print(data["rows"])
print(rows[0])
# Check if there are elements
try:
distances = data['rows'][0]['elements'][0]['distance']
except KeyError:
raise KeyError("No elements found")
except IndexError:
raise IndexError("API Request Error. No response returned")
else:
return distances
Also as a general rule of thumb it is good to have a test case to make sure things are working as they should before testing the whole list,
#test case
try:
test = gmap_dist(apikey="", units="imperial", origins="GSGS", destinations="GSGL", mode="driving")
except Exception as err:
raise Exception(err)
else:
dists = gmap_dist(apikey="", units="imperial", origins="GSGS", destinations=schools, mode="driving")
print(dists)
Lastly, if you are testing the distance from "GSGS" to other schools, then you might want to get it out of your list of schools as the distance will be 0.
Now, I suspect that the reason you are getting this exception is because there are no json elements returned. Probably, because one of your parameters was improperly formatted.
If this function returns a KeyError still. Check the address spelling and make sure your apikey is valid. Although if it was the Apikey I would expect they would not bother to give you even empty results.
Hope this helps. Comment if it doesn't work.

download list of urls with scrapy, to a list of filenames, with rate-limiting

I have a large list of urls I would like to download (about 400K) and I would like to use the concurrent downloading capability of scrapy. The most basic Pipeline examples I have found have been too complicated.
Can you point me to a simple example that would take a list like this:
url_list = ['http://www.example.com/index.html',
'http://www.something.com/index.html']
and I would store them in a list of files like this:
file_list = ['../file1.html',
'../file2.html']
Rate-limiting would be a nice bonus so as not to overload a poor server.
Note: Does not need to be with scrapy if there is another way.

You can modify this snippet of code to do what you want:
import requests
import grequests
def exception_handler(request, exception):
print "Request failed"
def chop(seq,size):
"""Chop a sequence into chunks of the given size."""
chunk = lambda i: seq[i:i+size]
return map(chunk,xrange(0,len(seq),size))
def get_chunk(chunk):
reqs = (grequests.get(u) for u in chunk)
foo = grequests.map(reqs)
for r in foo:
player_id = r.request.url.split('=')[-1]
print r.status_code, player_id, r.request.url, len(r.content)
open('data/%s.html' %player_id, 'w').write(r.content)
urls = [a.strip() for a in open('temp/urls.txt').read().split('\n') if a]
chunks = chop(urls, 150)
for chunk in chunks:
get_chunk(chunk)

rrdscript won't plot anything

Hey guys (and off course Ladys) ,
i have this little script which should show me some
nice rrd graphs. But i seems like i cant find a way to
bring it to work to show me some stats. This is my Script:
# Function: Simple ping plotter for rrd
import rrdtool,tempfile,commands,time,sys
from model import hosts
sys.path.append('/home/dirk/devel/python/stattool/stattool/lib')
import nurrd
from nurrd import RRDplot
class rrdPing(RRDplot):
def __init__(self):
self.DAY = 86400
self.YEAR = 365 * self.DAY
self.rrdfile = 'hostname.rrd'
self.interval = 300
self.probes = 5
self.rrdList = []
def create_rrd(self, interval):
ret = rrdtool.create("%s" % self.rrdfile, "--step", "%s" % self.interval,
"DS:packets:COUNTER:600:U:U",
"RRA:AVERAGE:0.5:1:288",
"RRA:AVERAGE:0.5:1:336")
def getHosts(self, userID):
myHosts = hosts.query.filter_by(uid=userID).all()
return myHosts.pop(0)
def _doPing(self,host):
for x in xrange(0, self.probes):
ans,unans = commands.getstatusoutput("ping -c 3 -w 6 %s| grep rtt| awk -F '/' '{ print $5 }'" % host)
print x
self.probes -=1
self.rrdList.append(unans)
return self.rrdList
def plotRRD(self):
self.create_rrd(self.interval)
times = self._doPing(self.getHosts(3))
for x in xrange(0,len(times)):
loc = times.pop(0)
rrdtool.update(self.rrdfile, '%d:%d' % (int(time.time()), int(float(loc))))
print '%d:%d' % (int(time.time()), int(float(loc)))
time.sleep(5)
self.graph(60)
def graph(self, mins):
ret = rrdtool.graph( "%s.png" % self.rrdfile, "--start", "-1", "--end" , "+1","--step","300",
"--vertical-label=Bytes/s",
"DEF:inoctets=%s:packets:AVERAGE" % self.rrdfile ,
"AREA:inoctets#7113D6:In traffic",
"CDEF:inbits=inoctets,8,*",
"COMMENT:\\n",
"GPRINT:inbits:AVERAGE:Avg In traffic\: %6.2lf \\r",
"COMMENT: ",
"GPRINT:inbits:MAX:Max In traffic\: %6.2lf")
if __name__ == "__main__":
ping = rrdPing()
ping.plotRRD()
info = rrdtool.info('hostname.rrd')
print info['last_update']
Could somebody please give me some advice or some tips how to solve this?
(Sorry code is a litte mess)
Thanks in advance
Kind regards,
Dirk

Several issues.
Firstly, you appear to only be collecting a single data sample and storing it before you try to generate a graph. You will need at least two samples, separated by about 300s, before you can get a single Primary Data Point and therefore something to graph.
Secondly, you do not post nay information as to what data you are actually storing. Are you sure your rrdPing function is returning valid data to store? You are not testing the error status of the write either.
Thirdly, the data you are collecting seems to be ping times or similar, which is a GAUGE type value. Yet, your RRD DS definition uses a COUNTER type and your graph call is treating it as network traffic data. A COUNTER type assumes increasing values and converts to a rate of change, so if you give it ping RTT data you'll get either unknowns or zeroes stored, which will not show up on a graph.
Fourthly, your call to RRDGraph is specifying a start of -1 and and end of +1. From 1 second in the past to 1 second in the future? Since your step is 300s this is an odd graph. Maybe you should have--end 'now-1' --start 'end-1day' or similar?
You should make your code test the return values for any error messages produced by the RRDTool library -- this is good practice anyway. When testing, print out the values you are updating with to be sure you are giving valid values. With RRDTool, you should collect several data samples at the step interval and store them before expecting to see a line on the graph. Also, make sure you are using the correct data type, GAUGE or COUNTER.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fuzzy URL matching in Python - python

Related

Googlemaps refuses to give more than 20 restaurants

Selenium - Can't get text from element (Python)

python googlemaps all possible distances between different locations

download list of urls with scrapy, to a list of filenames, with rate-limiting

rrdscript won't plot anything

Categories

Resources