How to classify users into different countries, based on the Location field - python

Most web applications have a Location field, in which uses may enter a Location of their choice.
How would you classify users into different countries, based on the location entered.
For eg, I used the Stack Overflow dump of users.xml and extracted users' names, reputation and location:
['Jeff Atwood', '12853', 'El Cerrito, CA']
['Jarrod Dixon', '1114', 'Morganton, NC']
['Sneakers OToole', '200', 'Unknown']
['Greg Hurlman', '5327', 'Halfway between the boardwalk and Six Flags, NJ']
['Power-coder', '812', 'Burlington, Ontario, Canada']
['Chris Jester-Young', '16509', 'Durham, NC']
['Teifion', '7024', 'Wales']
['Grant', '3333', 'Georgia']
['TimM', '133', 'Alabama']
['Leon Bambrick', '2450', 'Australia']
['Coincoin', '3801', 'Montreal']
['Tom Grochowicz', '125', 'NJ']
['Rex M', '12822', 'US']
['Dillie-O', '7109', 'Prescott, AZ']
['Pete', '653', 'Reynoldsburg, OH']
['Nick Berardi', '9762', 'Phoenixville, PA']
['Kandis', '39', '']
['Shawn', '4248', 'philadelphia']
['Yaakov Ellis', '3651', 'Israel']
['redwards', '21', 'US']
['Dave Ward', '4831', 'Atlanta']
['Liron Yahdav', '527', 'San Rafael, CA']
['Geoff Dalgas', '648', 'Corvallis, OR']
['Kevin Dente', '1619', 'Oakland, CA']
['Tom', '3316', '']
['denny', '573', 'Winchester, VA']
['Karl Seguin', '4195', 'Ottawa']
['Bob', '4652', 'US']
['saniul', '2352', 'London, UK']
['saint_groceon', '1087', 'Houston, TX']
['Tim Boland', '192', 'Cincinnati Ohio']
['Darren Kopp', '5807', 'Woods Cross, UT']
using the following Python script:
from xml.etree import ElementTree
root = ElementTree.parse('SO Export/so-export-2009-05/users.xml').getroot()
items = ['DisplayName','Reputation','Location']
def loop1():
for count,i in enumerate(root):
det = [i.get(x) for x in items]
print det
if count>30: break
loop1()
What is the simplest way to classify people into different countries? Are there any ready lookup tables available that provide me an output saying X location belongs to Y country?
The lookup table need not be totally accurate. Reasonably accurate answers are obtained by querying the location string on Google, or better still, Wolfram Alpha.

You best bet is to use a Geocoding API like geopy (some Examples).
The Google Geocoding API, for example, will return the country in the CountryNameCode-field of the response.
With just this one location field the number of false matches will probably be relatively high, but maybe it is good enough.
If you had server logs, you could try to also look up the users IP address with an IP geocoder (more information and pointers on Wikipedia

Force users to specify country, because you'll have to deal with ambiguities. This would be the right way.
If that's not possible, at least make your best-guess in conjunction with their IP address.
For example, ['Grant', '3333', 'Georgia']
Is this Georgia, USA?
Or is this the Republic of Georgia?
If their IP address suggests somewhere in Central Asia or Eastern Europe, then chances are it's the Republic of Georgia. If it's North America, chances are pretty good they mean Georgia, USA.
Note that mappings for IP address to country isn't 100% accurate, and the database needs to be updated regularly. In my opinion, far too much trouble.

Related

Last item on the reduce Method

I have this list of countries:
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
I need to resolve following exersice: Use reduce to concatenate all the countries and to produce this sentence: Estonia, Finland, Sweden, Denmark, Norway, and Iceland are north European countries
def sentece(pais,pais_next):
if pais_next=='Iceland':
return pais+' and '+pais_next + ' are north European countries'
else: return pais+', '+pais_next
countries_reduce=reduce(sentece,countries)
print(countries_reduce)
The code run perfect, but if I want to do in general, How I know what is the last element?.
The reduce function doesn't have a way to tell it what to do about the last item, only what to do about the initialization.
There's two general ways to go about it:
Just do simple concatenation with a comma and a space, but only on the first n-1 items of the list, then manually append the correct format for the last item
Change the last item from Iceland to and Iceland are north European countries, then do the concatenation for the full list.
Figuring out which is the last element is a bad idea; any solution that would give you that information would be a royal hack.
Normally, you wouldn't use reduce to solve this at all (repeated concatenation is a form of Schlemiel the Painter's Algorithm, involving O(n²) work, where efficient algorithms can be O(n)), so you'd just use ', '.join, e.g.:
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
countries_str = f'{", ".join(countries[:-1])} and {countries[-1]} are north European countries'
or, premodifying countries[-1] to reduce the complexity to the point where an f-string isn't necessary (assuming an Oxford comma is okay):
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
countries[-1] = 'and ' + countries[-1] # Put the "and " prefix in front ahead of time
countries_str = ', '.join(countries) + ' are north European countries'
where join is used for the consistent join components, wrapped in an f-string that inserts the last item along with the rest of the formatting.
If you must use reduce, you'd still want to handle the final element separately, either by processing it completely separately at the end, e.g.
from functools import reduce
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
countries_str = f'{reduce(lambda x, y: f"{x}, {y}", countries[:-1])} and {countries[-1]} are north European countries'
print(countries_str)
Try it online!
or by manually tweaking it ahead of time so it can be used in a consistent manner (assuming you're okay with the Oxford comma):
from functools import reduce
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
countries[-1] = 'and ' + countries[-1] # Put the "and " prefix in front ahead of time
countries_str = f'{reduce(lambda x, y: f"{x}, {y}", countries)} are north European countries'
print(countries_str)
Try it online!
Again, reduce is a bad solution to the problem; str.join (using ', '.join) is a O(n) solution (on CPython, it pre-scans the items to join to compute the final length, then preallocates the complete final str, and copies each input exactly once), where reduce is O(n²) (and unlike an actual loop using +=, it can't even benefit from the CPython reference interpreter's implementation detail that sometimes allows concatenation to mutate in-place, reducing the number of data copies).

Python: select Wikipedia pages of locations and places using Wikidata

I have a list of Wikipedia pages related to some entities and I want to select only geographical places and locations (cities, provinces, but also regions, mountains, rivers and so on).
I can easily select pages with coordinates but this is not enough since many places actually in Wikipedia are not associated to their coordinates. I guess I should use labels from Wikidata, but I never used them and I am a bit lost with Python API. For example, if I use wptools:
import wptools
page = wptools.page('Indianapolis')
print(page.get_wikidata())
I obtain this:
www.wikidata.org (wikidata) Indianapolis
www.wikidata.org (labels) Q1000136|P1830|P421|Q1093829|P163|Q2579...
www.wikidata.org (labels) Q537853|P281|P949|Q2494513|Q3166162|Q18...
www.wikidata.org (labels) P1036|Q499547|P1997|P31|P17|P268|Q62049...
en.wikipedia.org (imageinfo) File:IndianapolisC12.png
Indianapolis (en) data
{
aliases: <list(10)> Circle City, Indy, Naptown, Crossroads of Am...
claims: <dict(61)> P1082, P227, P1151, P31, P17, P131, P163, P41...
description: <str(109)> city in and county seat of Marion County...
image: <list(1)> {'file': 'File:IndianapolisC12.png', 'kind': 'w...
label: Indianapolis
labels: <dict(145)> Q1000136, P1830, P421, Q1093829, P163, Q2579...
modified: <dict(1)> wikidata
requests: <list(5)> wikidata, labels, labels, labels, imageinfo
title: Indianapolis
what: county seat
wikibase: Q6346
wikidata: <dict(61)> population (P1082), GND ID (P227), topic's ...
wikidata_pageid: 7459
wikidata_url: https://www.wikidata.org/wiki/Q6346
}
How can I extract only the labels?
I suppose there exists a label "THIS IS A LOCATION" but how to use it?
Thanks in advance

Cleaning up text data extracted from scanned .pdf

I am creating a script to extract text from a scanned pdf to create a JSON dictionary for implementation into a MongoDB later. The issue I have run into is that using tesseract-ocr via Textract module successfully extracted all the text but it is being read by python so all of the whitespace on the PDF is being turned in '\n' making it very hard to extract the information necessary.
I have tried cleaning it up using a bunch of lines of code, but it still is not very readable. and it gets rid of all the colons which i feel will make identifying the keys and values a lot easier.
stringedText = str(text)
cleanText = rmStop.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
I am still left with quite a bit of unclean almost over processed data. so at this point idk how to use it.
this is how i extracted the data
text = textract.process(filename, method="tesseract", language="eng")
I have tried nltk as well and that took out some data and made it a little easier to read but there are still a lot of \n muddling up the data.
here is the nltk code:
stringedText = str(text)
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(stringedText)
rmStop = [i for i in tokens if not i in ENGLISH_STOP_WORDS]
here is what I get from the first clean up i tried:
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
from that i need a JSON array that looks like:
"lacz-rp" : {
"Date" : "21-feb-2019",
"Sequence" : "gatctctaccatggcgcacatttccccgaaaagtgc"
"Order No." : "15775199"
"Ref No." : "207335463"
}
and so on... I am just not sure what to do. I can also provide the raw output. This is what it looked like before I touched it. the above data is all the information I need to make a complete array.
b' \n\nIDT\nINTEGRATED DNA TECHNOLOGIES,\nOLIGONUCLEOTIDE SPECIFICATION SHEET\n\n \n\n21-Feb-2019\n\nSequence - LacZ-RP\n\n5\'- GAT CTC TAC CAT GGC GCA CAT TTC CCC GAA AAG TGC -3\'\n\nOrder No. 15775199\nref.No. 207335463\n\n25 nmole DNA Oligo, 36 bases\n\n \n\nProperties Amount Of Oligo Shipped To\nTm (50mM NaCl)*:66.8 \xc2\xb0C 11.0= 32.6 DAVID COOK\nGC Content: 52.8% D260 mmoles KANSAS STATE UNIVERSITY-BIOTECHNO.\n\nMolecular Weight: 10,965.1\nnmoles/OD260: 3.0\nug/OD260: 32.6\nExt. Coefficient: 336,200 L/(mole-cm)\n\nSecondary Structure Calcul\n\n \n\nns\nLowest folding free energy (kcal/mole): -3.53 at 25 \xc2\xb0C\n\nStrongest Folding Tm: 46.6 \xc2\xb0C\n\n \n\nOligo Base Types Quantity\n\nDi eo\nModifications and Services Quantity\nStandard Desalting 7\n\nMfg. 1D 289505556\n\n207335463 ~<<IDT\nD.cooK,\n\n2eosoesse 2uren20%9\n\n207335463 ~XIDT\nD.cooK,\n\n \n \n \n\n \n\nINSTRUCTIONS\n\n.d contents may appear as either a translucent film or a white powder.\nice does not affect the quality of the oligo,\n\n\xe2\x80\x9cPlease centrifuge tubes prior to opening. Some of the product may have been\ndislodged during shipping.\n\n\xe2\x80\x9cThe Tm shown takes no account of Mg?* and dNTP concentrations. Use the\nOligoAnalyzer\xc2\xae Program at www.idtdna.com/scitools to calculate accurate Tm for\nyour reaction conditions.\n\nFor 100 |M: add 326 [iL\n\nBURT HALL #207\n\nMANHATTAN, KS 66506\n\nUSA\n\n7855321362\n\nCustomer No. 378741 PO No.06BF3000\n\nDisclaimer\n\nSee on reverse page notes (I) (Il) & (lll) for usage, label\nlicense, and product warranties\n\x0cUse Restrictions: Oligonucleotides and nucleic acid products are manufactured and sold by IDT for the\ncustomer\'s research purposes only. Resale of IDT products requires the express written consent of IDT.\nUnless pursuant to a separate signed agreement by authorized IDT officials, IDT products are not sold\nfor (and have not been approved) for use in any clinical, diagnostic or therapeutic applications.\nObtaining license or approval to use IDT products in proprietary applications or in any non-research\n(clinical) applications is the customer\'s exclusive responsibility. DT will not be responsible or liable for\nany losses, costs, expenses, or other forms of liability arising out of the unauthorized or unlicensed use\nof IDT products. Purchasers of IDT products shall indemnify and hold IDT harmless for any and all\ndamages and/or liability, however characterized, related to the unauthorized or unlicensed use of IDT\nproducts. Under no circumstances shall IDT be liable for any consequential damages, resulting from\nany use (approved or otherwise) of IDT products. All orders received by IDT, and all sales of IDT\nproducts are made subject to the aforementioned use restrictions and customer indemnification of IDT.\n\nGeneral Warranty: IDT\'s products are guaranteed to meet or exceed our published specifications for\nidentity, purity and yield as measured under normal laboratory conditions. If our product fails to meet\nsuch specifications, IDT will promptly replace the product. A// other warranties are hereby expressly\ndisclaimed, including but not limited to, the implied warranties of merchantability and fitness for a\nparticular purpose, and any warranty that the products, or the use of products, manufactured by IDT will\nnot infringe the patents of one or more third-partiesAll orders received by IDT, and all sales of IDT\nproducts are made subject to the aforementioned disclaimers of warranties.\n\nSee http://www.idtdna.com/Catalog/Usage/Page1.aspx for further details\na) Cy Dyes: The purchase of this Product includes a limited non-exclusive sublicense under U.S\n\nPatent Nos. 5 556 959 and 5 808 044 and foreign equivalent patents and other foreign and U.S\ncounterpart applications to use the amidites in the Product to perform research. NO OTHER\nLICENSE IS GRANTED EXPRESSLY, IMPLIEDLY OR BY ESTOPPEL. Use of the Product for\ncommercial purposes is strictly prohibited without written permission from Amersham Biosciences\nCorp. For information concerning availability of additional licenses to practice the patented\nmethodologies, please contact Amersham Biosciences Corp, Business Licensing Manager,\nAmersham Place, Little Chalfont, Bucks, HP79NA.\n\nb) \xe2\x80\x94 BHQ: Black Hole Quencher, BHQ-0, BHQ-1, BHQ-2 and BHQ-3 are registered trademarks of\nBiosearch Technologies, Inc., Novato, California, U.S.A Patents are currently pending for the BHQ\ntechnology and such BHQ technology is licensed by the manufacturer pursuant to an agreement\nwith BTI and these products are sold exclusively for research and development use only. They\nmay not be used for human veterinary in vitro or clinical diagnostic purposes and they may not be\nre-sold, distributed or re-packaged. For information on licensing programs to permit use for human\nor veterinary in vitro or clinical diagnostic purposes, please contact Biosearch at\nlicensing#biosearchtech.com.\n\nc) MPI dyes: MPI dyes. This product is provided under license from Molecular Probes, Inc., for\nresearch use only, and is covered by pending and issued patents.\n\nd) Molecular Beacons: Molecular Beacons. This product is sold under license from the Public Health\nResearch Institute only for use in the purchaser\'s research and development activities.\n\ne) ddRNAi: This product is sold solely for use for research purposes in fields other than plants. This\nproduct is not transferable. If the purchaser is not willing to accept the conditions of this label\nlicense, supplier is willing to accept the return of the unopened product and provide the purchaser\nwith a full refund. However if the product is opened, then the purchaser agrees to be bound by the\nconditions of this limited use statement. This product is sold by supplier under license from\nBenitec Australia Ltd and CSIRO as co-owners of U.S Patent No. 6,573,099 and foreign\ncounterparts. For information regarding licenses to these patents for use of ddRNAi as a\ntherapeutic agent or as a method to treat/prevent human disease, please contact Benitec at\nlicensing#benitec.com. For the use of ddRNAi in other fields, please contact CSIRO at\nwww.pi.csiro.au/RNAi.\n\x0cf)\n\n9)\n\nh)\n\nk)\n\n))\n\nm)\n\nn)\n\nDicer Substrate RNAi:\n\n* These products are not for use in humans or non-human animals and may not be used for\nhuman or veterinary diagnostic, prophylactic or therapeutic purposes. Sold under license of\npatents pending jointly assigned to IDT and the City of Hope Medical Center.\n\nThis product is licensed under European Patents 1144623, 121945 and foreign equivalents\nfrom Alnylam Pharmaceuticals, Inc., Cambridge, USA and is provided only for use in\nacademic and commercial research whose purpose is to elucidate gene function, including\nresearch to validate potential gene products and pathways for drug discovery and\ndevelopment and to screen non-siRNA based compounds (but excluding the evaluation or\ncharacterization of this product as the potential basis for a siRNA based drug) and not for\nany other commercial purposes. Information about licenses for commercial use (including\ndiscovery and development of siRNA-based drugs) is available from Alnylam\nPharmaceuticals, Inc., 300 Third Street, Cambridge MA 02142, USA\n\nLicense under U.S. Patent # 6506559; Domestic and Foreign Progeny; including European\nPatent Application # 98964202\n\nLNAs: Protected by US. Pat No. 6,268,490 and foreign applications and patents owned or\ncontrolled by Exiqon A/S. For Research Use Only. Not for resale or for therapeutic use or use in\nhumans\n\nOther siRNA duplexes: This product is provided under license from Molecular Probes, Inc., for\nresearch use only, and is covered by pending and issued patents.\n\nAcrydite: IDT is licensed under U.S Patent Number 6,180,770 and 5,932,711 to sell this product\nfor use solely in the purchaser\'s own life sciences research and development activities. Resale, or\nuse of this product in clinical or diagnostic applications, or other commercial applications, requires\nseparate license from Mosaic, Inc.\n\nlso-Bases: Licensed under EraGen, Inc. United States Patents Number 5,432,272; 6,001,983;\n6,037,120; and 6,140,496. For research use Only.\n\nDig: Licensed from Roche Diagnostics GmbH\n\n5\' Nuclease Assay: The 5\' Nuclease Assay and other homogenous amplification methods used in\nconnection with the Polymerase Chain Reaction (PCR) process are covered by patents owned by\nRoche Molecular Systems, Inc. and F. Hoffman La-Roche Ltd (Roche). No license to use the 5"\nNuclease Assay or any Roche patented homogenous amplification process is conveyed expressly\nor by implication to the purchaser by the purchase of the above listed products or any other IDT\nproducts.\n\nlowa Black\xc2\xae FQ and RQ: lowa Black is a registered trademark of IDT, and lowa Black-labeled\noligos are covered by pending patents owned and controlled by IDT.\n\nIRDye\xc2\xae 700 and IRDye\xc2\xae 800: IRDye\xc2\xae 700 and IRDye\xc2\xae 800 are products manufactured under\nlicense from LI-COR\xc2\xae Biosciences, which expressly excludes the right to use this product in\nQPCR or AFLP applications.\n\x0c'
You can convert your \n with newline. Please use following;
formatted_text = text.replace('\\n', '\n')
This will replace escaped newlines by actual newlines in the output.

Python module for getting latitude and longitude from the name of a US city

I am looking for a python module which can take in the name of the city as the input and return the latitude and longitude of the input.
Have a look at geopy. In the "getting started" documentation it shows:
>>> from geopy import geocoders
>>> gn = geocoders.GeoNames()
>>> print gn.geocode("Cleveland, OH 44106")
(u'Cleveland, OH, US', (41.4994954, -81.6954088))
>>> gn.geocode("Cleveland, OH", exactly_one=False)[0]
(u'Cleveland, OH, US', (41.4994954, -81.6954088))
An example:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent='myapplication')
location = geolocator.geocode("Chicago Illinois")
print(location.address)
Available info:
>>> location.raw
{u'display_name': u'Chicago, Cook County, Illinois, United States of America', u
'importance': 1.0026476104889, u'place_id': u'97957568', u'lon': u'-87.6243706',
u'lat': u'41.8756208', u'osm_type': u'relation', u'licence': u'Data \xa9 OpenSt
reetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright', u'osm_i
d': u'122604', u'boundingbox': [u'41.6439170837402', u'42.0230255126953', u'-87.
9401016235352', u'-87.5239791870117'], u'type': u'city', u'class': u'place', u'i
con': u'http://nominatim.openstreetmap.org/images/mapicons/poi_place_city.p.20.p
ng'}
You can also use geocoder with Bing Maps API. The API gets latitude and longitude data for all addresses for me (unlike Nominatim) and it has a pretty good free version for non-commercial uses (max 125000 requests per year for free). To get free API, go here and click "Get a free Basic key". After getting your API, you can use it in the code below:
import geocoder # pip install geocoder
g = geocoder.bing('Tokyo', key='<API KEY>')
results = g.json
print(results['lat'], results['lng'])
Output:
>>> 35.68696212768555 139.7494659423828
The results contains much more information than longitude and latitude. Check it out.

How to map the most "similar" strings from one list to another in python?

Given are two lists containing strings.
One contains the name of organisations (mostly universitys) all around the world - not only written in english but always using latin alphabet.
The other list contains mostly full addresses in which strings (organisations) from the first list may occur.
An Example:
addresses = [
"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
"Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076",
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
"Department of Computer Science, Cornell University, Ithaca, NY",
"University of Wisconsin-Madison"
]
organisations = [
"Catholic University of Leuven"
"Fraunhofer IAIS"
"Cornell University of Ithaca"
"Tübingener Max Plank Institut"
]
As you can see the desired mapping would be:
"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
--> Catholic University of Leuven
"Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076",
--> Max Plank Institut Tübingen
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
--> --
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
--> Fraunhofer IAIS
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
--> Fraunhofer IAIS
"Department of Computer Science, Cornell University, Ithaca, NY"
--> "Cornell University of Ithaca",
"University of Wisconsin-Madison",
--> --
My thinking was to use some kind of "disctance- algorithm" to calculate the similarity of the strings. Since I cannot just look for an organisation in an address just by doing if address in organisation because it could be written slightly differently at in different places. So my first guess was using the difflib module. Especially the difflib.get_close_matches() function for selecting for every address the closest string from the organisations list. But I am not quite confident, that the results will be accurate enough. Although I don't know how high I should set the ratio which seams to be a similarity measure.
Before spending too much time in trying the difflib module I thought of asking the more experienced people here, if this is the right approach or if there is a more suited tool / way to solve my problem. Thanks!
PS: I don't need an optimal solution.
Use the following as your string distance function (instead of plain levenshtein distance):
def strdist(s1, s2):
words1 = set(w for w in s1.split() if len(w) > 3)
words2 = set(w for w in s2.split() if len(w) > 3)
scores = [min(levenshtein(w1, w2) for w2 in words2) for w1 in words1]
n_shared_words = len([s for s in scores if s <= 3])
return -n_shared_words
Then use the Munkres assignment algorithm shown here since there appears to be a 1:1 mapping between organisations and adresses.
You can use soundex or metaphone to translate the sentence into a list of phonems, then compare the most similar lists.
Here is a Python implementation of the double-metaphone algo.

Categories

Resources