Data Cleanse: Grouping within variable company names

Data Cleanse: Grouping within variable company names - python

So working on some research on nursing homes which are often owned by a chain. We have a list of 9,000 + nursing homes corporate ownership. Now, if I was MERGING this data into anything I think this would not be too much of a challenge, but I am being asked to group the facilities that are associated with each other for another analysis.
For example:
ABCM
ABCM CORP
ABCM CORPORATION
ABCM CORPORATE
I have already removed all the extra spaces, non-alphanumeric, and upcased everything. Just trying to think of a way within like a 90% accuracy I can do this. The within the same variable is the part that is throwing me off. I do have some other details such as ownership, state, zip, etc. I use STATA, SAS, and Python if that helps!

welcome to SO.
String matching is - broadly speaking - a pain, whatever the software you are using, and in most cases need a human intervention to yield satisfactory results.
In Stata you may want to try matchit (ssc install matchit) for fuzzy string merge. I won't go into the details (I suggest you to look at the helpfile, which is pretty well-outlined), but the command returns each string matched with multiple similar entries - where "similar" depends on the chosen method, and you can specify a threshold for the level of similarities kept or discarded.
Even with all the above options, though, the final step is up to you: my personal experience tells me that no matter how restrictive you are, you'll always end up with several "false positives" that you'll have to work yourself!
Good luck!

Related

Using a Decision Tree to build a Recommendations Application

First of all, my apologies if I am not following some of the best practices of this site, as you will see, my home is mostly MSE (math stack exchange).
I am currently working on a project where I build a vacation recommendation system. The initial idea was somewhat akin to 20 questions: We ask the user certain questions, such as "Do you like museums?", "Do you like architecture", "Do you like nightlife" etc., and then based on these answers decide for the user their best vacation destination. We answer these questions based on keywords scraped from websites, and the decision tree we would implement would allow us to effectively determine the next question to ask a user. However, we are having some difficulties with the implementation. Some examples of our difficulties are as follows:
There are issues with granularity of questions. For example, to say that a city is good for "nature-lovers" is great, but this does not mean much. Nature could involve say, hot, sunny and wet vacations for some, whereas for others, nature could involve a brisk hike in cool woods. Fortunately, the API we are currently using provides us with a list of attractions in a city, down to a fairly granular level (for example, it distinguishes between different watersport activities such as jet skiing, or white water rafting). My question is: do we need to create some sort of hiearchy like:
nature-> (Ocean,Mountain,Plains) (Mountain->Hiking,Skiing,...)
or would it be best to simply include the bottom level results (the activities themselves) and just ask questions regarding those? I only ask because I am unfamiliar with exactly how the classification is done and the final output produced. Is there a better sort of structure that should be used?
Thank you very much for your help.

I think using a decision tree is a great idea for this problem. It might be an idea to group your granular activities, and for the "nature lovers" category list a number of different climate types: Dry and sunny, coastal, forests, etc and have subcategories within them.
For the activities, you could make a category called watersports, sightseeing, etc. It sounds like your dataset is more granular than you want your decision tree to be, but you can just keep dividing that granularity down into more categories on the tree until you reach a level you're happy with. It might be an idea to include images too, of each place and activity. Maybe even without descriptive text.

Bins and sub bins are a good idea, as is the nature, ocean_nature thing.
I was thinking more about your problem last night, TripAdvisor would be a good idea. What I would do is, take the top 10 items in trip advisor and categorize them by type.
Or, maybe your tree narrows it down to 10 cities. You would rank those cities according to popularity or distance from the user.
I’m not sure how to decide which city would be best for watersports, etc. You could even have cities pay to be top of the list.

How to classify unseen text data?

I am training an text classifier for addresses such that if given sentence is an address or not.
Sentence examples :-
(1) Mirdiff City Centre, DUBAI United Arab Emirates
(2) Ultron Inc. <numb> Toledo Beach Rd #1189 La Salle, MI <numb>
(3) Avenger - HEAD OFFICE P.O. Box <numb> India
As addresses can be of n types it's very difficult to make such classifier. Is there any pre-trained model or database for the same or any other non ML way.

As mentioned earlier, verifying that an address is valid - is probably better formalized as an information retrieval problem rather than a machine learning problem. (e.g. using a service).
However, from the examples you gave, it seems like you have several entity types that reoccur, such as organizations and locations.
I'd recommend enriching the data with a NER, such a spacy, and use the entity types for either a feature or a rule.
Note that named-entity recognizers rely more on context than the typical bag-of-words classifier, and are usually more robust to unseen data.

When I did this the last time the problem was very hard, esp. since I had international adresses and the variation across countries is enormous. Add to that the variation added by people and the problem becomes quite hard even for humans.
I finally build a heuristic (contains it some like PO BOX, a likely country name (grep wikipedia), maybe city names) and then threw every remaining maybe address into the google maps API. GM is quite good a recognizing adresses, but even that will have false positives, so manual checking will most likely be needed.
I did not use ML because my adress db was "large" but not large enough for training, esp. we lacked labeled training data.

As you are asking for recommendation for literature (btw this question is probably to broad for this place), I can recommend you two links:
https://www.reddit.com/r/datasets/comments/4jz7og/how_to_get_a_large_at_least_100k_postal_address/
https://www.red-gate.com/products/sql-development/sql-data-generator/
https://openaddresses.io/
You need to build a labeled data as #Christian Sauer has already mentioned, where you have examples with adresses. And probably you need to make false data with wrong adresses as well! So for example you have to make sentences with only telephone numbers or whatever. But in anyway this will be a quite disbalanced dataset, as you will have a lot of correct adresses and only a few which are not adresses. In total you would need around 1000 examples to have a starting point for it.
Other option is to identify the basic adresses manually and do a similarity analysis to identify the sentences which are clostet to it.

As mentioned by Uri Goren, the problem is of Named entity recognition, while there are a lot of trained models in the market. Still, the best one cant get is the Stanford NER.
https://nlp.stanford.edu/software/CRF-NER.shtml
It is a conditional random field NER. It is available in java.
If you are looking for a python implementation of the same. Have a look at:
How to install and invoke Stanford NERTagger?
Here you can gather info from a multiple sequence of tags like
, , or any other sequence like that. If it doesn't give you the correct stuff, it will still somehow get you closer to any address in the whole document. That's a head start.
Thanks.

Automatic tagging of words or phrases

I want to automatically tag a word/phrase with one of the defined words/phrases from a list. My list contains about 230 words in columnA which are tagged in columnB. There are around 16 unique tags and every of those 230 words are tagged with one of these 16 tags.
Have a look at my list:
The words/phrases in column A are tagged as words/phrases in column B.
From time to time, new words are added for which tag has to be given manually.
I want to build a predictive algorithm/model to tag new words automatically(or suggest). So if I write a new word, let say 'MIP Reserve' (A36), then it should predict the tag as 'Escrow Deposits'(B36) and not 'Operating Reserve'(B33). How should I predict the tags of new word precisely even if the words do not match with the words in its actual tag?
If someone is willing to see the full list, I can happily share.

Short version
I think your question is a little ill-defined and doesn't have a short coding or macro answer. Given that each item contains such little information, I don't think it is possible to build a good predictive model from your source data. Instead, do the tagging exercise once and look at how you control tagging in the future.
Long version
Here are the steps I would take to create a predictive model and why I don't think you can do this.
Understand why you want to have a predictive program at all
Why do you need a predictive program? Are you sorting through hundreds or thousands of records, all of which are changing and need tagging? If so, I agree, you wouldn't want to do this manually.
If this is a one-off exercise, because over time the tags have become corrupted from their original meaning, your problem is that your tags have become corrupted, not that you need to somehow predict where each item should be tagged. You should be looking at controlling use of the tags, not at predicting how people in the future might mistag or misname something.
Don't forget that there are lots of tools in Excel to make the problem easier. Let's say you know for certain that all items with 'cash' definitely go to 'Operating Cash'. Put an AutoFilter on the list and filter on the word 'cash' - now just copy and paste 'Operating Cash' next to all of these. This way, you can quickly get rid of the obvious ones from your list and focus on the tricky ones.
Understand the characteristics of the tags you want to use.
Take time to look at the tags you are using - what do each of them mean? What are the unique features or combinations of features that this tag is representing?
For example, your tag 'Operating Cash' carries the characteristics of being cash (i.e. not tied up so available for use fairly quickly) and as being earmarked for operations. From these, we could possibly derive further characteristics that it is held in a certain place, or a certain person has responsibility for it.
If you had more source data to go on, you could perhaps use fields such as 'year created', or 'customer' to help you categorise further.
Understand what it is about the items you want to tag that could give you an idea of where they should go.
This is your biggest problem. A quick example - what in the string "MIP Reserve" gives any clues that it should be linked to "Escrow Deposits"? You have no easy way of matching many of the items in your list - many words appear in multiple items across multiple tags.
However, try and look for unique identifiers that will give you clues - for example, all items with the word 'developer' seem to be tagged to 'Developer Fee Note & Interest'. Do you have any more of those? Use these to reduce your problem, since they should be a straightforward mapping.
Any unique identifiers will allow you to set up rules for these strings. You don't even need to stick to one word - perhaps when you see several words, you can narrow down where it will end up e.g. when I see 'egg' this could go into 'bird' or 'reptile', but if 'egg' is paired with 'wing', I can be fairly confident it's 'bird'.
You need to match the characteristics of the items you want to tag with the unique identifiers of the tags you developed in step 1.
Write a program or macro to look for the identifiers in step 2 and return the relevant tag from step 1.
This is the straightforward bit. Look for the identifiers you want (e.g. uses 'cash', contains tag 'Really Important Customer') and look for the best match in the tags you have earlier.
Ensure you catch any errors - what happens if no tag is found? Does it create a new one? Does it recommend contacting you for help? What happens if more than one tag is relevant? What are your tiebreaking criteria?
But be aware of...
Understand how you will control use of these unique identifiers.
Imagine you somehow manage to come up with a list of unique identifiers. How will you control their use? If you have decided to send any item with the word 'cash' to the tag 'Operating Cash' and then in a year, someone comes along and makes an item 'Capital Cash', because they want somewhere to put cash that is about to be spent on capital items, how do you stop this? How are you going to control use of these words?
You will effectively need to take control of the item naming system and set up an agreed list of identifying words. Whenever anyone makes an item, they need to include your identifiers somewhere. I can tell you that this will not work. Either they will use the wrong words and you will end up manually doing it anyway, or they will ring you up confused and you will end up manually doing it anyway.
If you are the only person doing this, just do the exercise once, to your own standard (that you record) and stick to that standard. When you need to hand it over, it's clearly ordered and makes sense. If more than one person is doing this, do the exercise once between you and the team and then agree a way of controlling it.
Writing a predictive program sounds great and might save you some time. But consider why you are writing it. Are you likely to need to tag accounts constantly in the future? If so, control their naming centrally and make it so a tag is mandatory when they are made. If not, why are you writing a program to do this? Just do it once, manually.

Named entity recognition : For new/latest entities

Sorry for that weird "question title" , but I couldnt think of an appropriate title.
Im new to NLP concepts, so I used NER demo (http://cogcomp.cs.illinois.edu/demo/ner/results.php). Now the issue is that "how & in what ways" can I use these taggings done by NER. I mean these what answers or inferences can one draw from these named-entities which have been tagged in certain groups - location, person ,organization etc. If I have a data which has names of entirely new companies, places etc then how am I going to do these NER taggings for such a data ?
Pls dont downvote or block me, I just need guidance/expert suggestions thats it. Reading about a concept is another thing, while being able to know where & when to apply it is another thing, which is where Im asking for guidance. Thanks a ton !!!
A snippet from the demo:-
Dogs have been used in cargo areas for some time, but have just been introduced recently in
passenger areas at LOC Newark and LOC JFK airports. LOC JFK has one dog and LOC Newark has a
handful, PER Farbstein said.

Usually NER is a step in a pipeline. For example, once all entities have been tagged, if you have many sentences like [PER John Smith], CEO of [ORG IBM] said..., then you can set up a table of Companies and CEOs. This is a form of knowledge base population.
There are plenty of other uses, though, depending on the type of data you already have and what you are trying to accomplish.

I think there are two parts in your question:
What is the purpose of NER?
This is vast question, generally it is used for Information Retrieval (IR) tasks such as indexing, document classification, Knowledge Base Population (KBP) but also many, many others (speech recognition, translation)... quite hard to figure out an extensive list...
How can we NER be extended to also recognize new/unkown entities?
E.g. how can we recognize entities that have never been seen by the NER system. In a glance, two solutions are likely to work:
Let's say you have some linked database that is updated on a regular basis: than the system may rely on generic categories. For instance, let's say "Marina Silva" comes up in news and is now added to lexicon associated to category "POLITICIAN". As the system knows that every POLITICIAN should be tagged as a person, i.e. doesn't rely on lexical items but on categories, and will thus tag "Marina Silva" as a PERS named entity. You don't have to re-train the whole system, just to update its lexicon.
Using morphological and contextual clues, the system may guess for new named entities that have never been seen (and are not in the lexicon). For instance, a rule like "The presidential candidate XXX YYY" (or "Marina YYY") will guess that "XXX YYY" (or just "YYY") is a PERS (or part of a PERS). This involves, most of the times, probabilistic modeling.
Hope this helps :)

A good way to make Django geolocation aware? - Django/Geolocation

I would like to be able to associate various models (Venues, places, landmarks) with a City/Country.
But I am not sure what some good ways of implementing this would be.
I'm following a simple route, I have implemented a Country and City model.
Whenever a new city or country is mentioned it is automatically created.
Unfortunately I have various problems:
The database can easily be polluted
Django has no real knowledge of what those City/Countries really are
Any tips or ideas? Thanks! :)

A good starting places would be to get a location dataset from a service like Geonames. There is also GeoDjango which came up in this question.
As you encounter new location names, check them against your larger dataset before adding them. For your 2nd point, you'll need to design this into your object model and write your code accordingly.
Here are some other things you may want to consider:
Aliases & Abbreviations
These come up more than you would think. People often use the names of suburbs or neighbourhoods that aren't official towns. You can also consider ones like LA -> Los Angeles MTL for Montreal, MT. Forest -> Mount Forest, Saint vs (ST st. ST-), etc.
Fuzzy Search
Looking up city names is much easier when differences in spelling are accounted for. This also helps reduce the number of duplicate names for the same place.
You can do this by pre-computing the Soundex or Double Metaphone values for the cities in your data set. When performing a lookup, compute the value for the search term and compare against the pre-computed values. This will work best for English, but you may have success with other romance language derivatives (unsure what your options are beyond these).
Location Equivalence/Inclusion
Be able to determine that Brooklyn is a borough of New York City.
At the end of the day, this is a hard problem, but applying these suggestions should greatly reduce the amount of data corruption and other headaches you have to deal with.

Geocoding datasets from yahoo and google can be a good starting poing, Also take a look at geopy library in django.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.