Compensating for "variance" in a survey

Compensating for "variance" in a survey - python

The title for this one was quite tricky.
I'm trying to solve a scenario,
Imagine a survey was sent out to XXXXX amount of people, asking them what their favourite football club was.
From the response back, it's obvious that while many are favourites of the same club, they all "expressed" it in different ways.
For example,
For Manchester United, some variations include...
Man U
Man Utd.
Man Utd.
Manchester U
Manchester Utd
All are obviously the same club however, if using a simple technique, of just trying to get an extract string match, each would be a separate result.
Now, if we further complication the scenario, let's say that because of the sheer volume of different clubs (eg. Man City, as M. City, Manchester City, etc), again plagued with this problem, its impossible to manually "enter" these variances and use that to create a custom filter such that converters all Man U -> Manchester United, Man Utd. > Manchester United, etc. But instead we want to automate this filter, to look for the most likely match and converter the data accordingly.
I'm trying to do this in Python (from a .cvs file) however welcome any pseudo answers that outline a good approach to solving this.
Edit: Some additional information
This isn't working off a set list of clubs, the idea is to "cluster" the ones we have together.
The assumption is there are no spelling mistakes.
There is no assumed length of how many clubs
And the survey list is long. Long enough that it doesn't warranty doing this manually (1000s of queries)

Google Refine does just this, but I'll assume you want to roll your own.
Note, difflib is built into Python, and has lots of features (including eliminating junk elements). I'd start with that.
You probably don't want to do it in a completely automated fashion. I'd do something like this:
# load corrections file, mapping user input -> output
# load survey
import difflib
possible_values = corrections.values()
for answer in survey:
output = corrections.get(answer,None)
if output = None:
likely_outputs = difflib.get_close_matches(input,possible_values)
output = get_user_to_select_output_or_add_new(likely_outputs)
corrections[answer] = output
possible_values.append(output)
save_corrections_as_csv

Please edit your question with answers to the following:
You say "we want to automate this filter, to look for the most likely match" -- match to what?? Do you have a list of the standard names of all of the possible football clubs, or do the many variations of each name need to be clustered to create such a list?
How many clubs?
How many survey responses?
After doing very light normalisation (replace . by space, strip leading/trailing whitespace, replace runs of whitespace by a single space, convert to lower case [in that order]) and counting, how many unique responses do you have?
Your focus seems to be on abbreviations of the standard name. Do you need to cope with nicknames e.g. Gunners -> Arsenal, Spurs -> Tottenham Hotspur? Acronyms (WBA -> West Bromwich Albion)? What about spelling mistakes, keyboard mistakes, SMS-dialect, ...? In general, what studies of your data have you done and what were the results?
You say """its impossible to manually "enter" these variances""" -- is it possible/permissible to "enter" some "variances" e.g. to cope with nicknames as above?
What are your criteria for success in this exercise, and how will you measure it?

It seems to me that you could convert many of these into a standard form by taking the string, lower-casing it, removing all punctuation, then comparing the start of each word.
If you had a list of all the actual club names, you could compare directly against that as well; and for strings which don't match first-n-letters to any actual team, you could try lexigraphical comparison against any of the returned strings which actually do match.
It's not perfect, but it should get you 99% of the way there.
import string
def words(s):
s = s.lower().strip(string.punctuation)
return s.split()
def bestMatchingWord(word, matchWords):
score,best = 0., ''
for matchWord in matchWords:
matchScore = sum(w==m for w,m in zip(word,matchWord)) / (len(word) + 0.01)
if matchScore > score:
score,best = matchScore,matchWord
return score,best
def bestMatchingSentence(wordList, matchSentences):
score,best = 0., []
for matchSentence in matchSentences:
total,words = 0., []
for word in wordList:
s,w = bestMatchingWord(word,matchSentence)
total += s
words.append(w)
if total > score:
score,best = total,words
return score,best
def main():
data = (
"Man U",
"Man. Utd.",
"Manch Utd",
"Manchester U",
"Manchester Utd"
)
teamList = (
('arsenal',),
('aston', 'villa'),
('birmingham', 'city', 'bham'),
('blackburn', 'rovers', 'bburn'),
('blackpool', 'bpool'),
('bolton', 'wanderers'),
('chelsea',),
('everton',),
('fulham',),
('liverpool',),
('manchester', 'city', 'cty'),
('manchester', 'united', 'utd'),
('newcastle', 'united', 'utd'),
('stoke', 'city'),
('sunderland',),
('tottenham', 'hotspur'),
('west', 'bromwich', 'albion'),
('west', 'ham', 'united', 'utd'),
('wigan', 'athletic'),
('wolverhampton', 'wanderers')
)
for d in data:
print "{0:20} {1}".format(d, bestMatchingSentence(words(d), teamList))
if __name__=="__main__":
main()
run on sample data gets you
Man U (1.9867767507647776, ['manchester', 'united'])
Man. Utd. (1.7448074166742613, ['manchester', 'utd'])
Manch Utd (1.9946817328797555, ['manchester', 'utd'])
Manchester U (1.989100008901989, ['manchester', 'united'])
Manchester Utd (1.9956787398647866, ['manchester', 'utd'])

Related

Google Kickstart 2014 Round D Sort a scrambled itinerary - Do I need to bring the input in a ready-to-use array format?

Problem:
Once upon a day, Mary bought a one-way ticket from somewhere to somewhere with some flight transfers.
For example: SFO->DFW DFW->JFK JFK->MIA MIA->ORD.
Obviously, transfer flights at a city twice or more doesn't make any sense. So Mary will not do that.
Unfortunately, after she received the tickets, she messed up the tickets and she forgot the order of the ticket.
Help Mary rearrange the tickets to make the tickets in correct order.
Input:
The first line contains the number of test cases T, after which T cases follow.
For each case, it starts with an integer N. There are N flight tickets follow.
Each of the next 2 lines contains the source and destination of a flight ticket.
Output:
For each test case, output one line containing "Case #x: itinerary", where x is the test case number (starting from 1) and the itinerary is a sorted list of flight tickets that represent the actual itinerary.
Each flight segment in the itinerary should be outputted as pair of source-destination airport codes.
Sample Input: Sample Output:
2 Case #1: SFO-DFW
1 Case #2: SFO-DFW DFW-JFK JFK-MIA MIA-ORD
SFO
DFW
4
MIA
ORD
DFW
JFK
SFO
DFW
JFK
MIA
My question:
I am a beginner in the field of competitive programming. My question is how to interpret the given input in this case. How did Googlers program this input? When I write a function with a Python array as its argument, will this argument be in a ready-to-use array format or will I need to deal with the above mentioned T and N numbers in the input and then arrange airport strings in an array format to make it ready to be passed in the function's argument?
I have looked up at the following Google Kickstart's official Python solution to this problem and was confused how they simply pass the ticket_list argument in the function. Don't they need to clear the input from the numbers T and N and then arrange the airport strings into an array, as I have explained above?
Also, I could not understand how could the methods first and second simply appear if no Class has been initialized? But I think this should be another question...
def print_itinerary(ticket_list):
arrival_map = {}
destination_map = {}
for ticket in ticket_list:
arrival_map[ticket.second] += 1
destination_map[ticket.first] += 1
current = FindStart(arrival_map)
while current in destination_map:
next = destination_map[current]
print current + "-" + next
current = next

You need to implement it yourself to read data from standard input and write results to standard output.
Sample code for reading from standard input and writing to standard output can be found in the coding section of the FAQ on the KickStart Web site.
If you write the solution to this problem in python, you can get T and N as follows.
T = int(input())
for t in range(1, T + 1):
N = int(input())
...
Then if you want to get the source and destination of the flight ticket as a list, you can use the same input method to get them in the list.
ticket_list = [[input(), input()] for _ in range(N)]
# [['MIA', 'ORD'], ['DFW', 'JFK'], ['SFO', 'DFW'], ['JFK', 'MIA']]
If you want to use first and second, try a namedtuple.
Pair = namedtuple('Pair', ['first', 'second'])
ticket_list = [Pair(input(), input()) for _ in range(N)]

Using regex to capture substring within a pandas df

I’m trying to extract specific substrings from larger phrases contained in my Pandas dataframe. I have rows formatted like so:
Appointment of DAVID MERRIGAN of Hammonds Plains, Nova Scotia, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of CARLA R. CONKIN of Fort Steele, British Columbia, to be Vice-Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of JUDY A. WHITE, Q.C., of Conne River, Newfoundland and Labrador, to be Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of GRETA SITTICHINLI of Inuvik, Northwest Territories, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
and I've been able to capture the capitalized names (e.g. DAVID MERRIGAN) with the regex below but I'm struggling to capture the locations, i.e. the 'of' statement following the capitalized name that ends with the second comma. I've tried just isolating the rest of the string that follows the name with the following code, but it just doesn't seem to work, I keep getting -1 as a response.
df_appointments['Name'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)')
df_appointments['Location'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)\b\s([^\n\r]*)')
Any help showing me how to isolate the location substring with regex (after that I can figure out how to get the position, etc) would be tremendously appreciated. Thank you.

The following pattern works for your sample set:
rgx = r'(?:\w\s)+([A-Z\s\.,]+)(?:\sof\s)([A-Za-z\s]+,\s[A-Za-z\s]+)'
It uses capture groups & non-capture groups to isolate only the names & locations from the strings. Rather than requiring two patterns, and having to perform two searches, you can then do the following to extract that information into two new columns:
df[['name', 'location']] = df['precis'].str.extract(rgx)
This then produces:
df
precis name location
0 Appointment of... DAVID MERRIGAN Hammonds Plains, Nova Scotia
1 Appointment of... CARLA R. CONKIN Fort Steele, British Columbia
2 Appointment of... JUDY A. WHITE, Q.C., Conne River, Newfoundland and...
3 Appointment of... GRETA SITTICHINLI Inuvik, Northwest Territories`
Depending on the exact format of all of your precis values, you might have to tweak the pattern to suit perfectly, but hopefully it gets you going...

# Final Answer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\yueheng.li\Desktop\Up\Question_20220824\Data.csv")
data[['Field_Part1','Field_Part2','Field_Part3']] = data['Precis'].str.split('of',2,expand=True)
data['Address_part1'] = data['Field_Part3'].str.split(',').str[0]
data['Address_part2'] = data['Field_Part3'].str.split(',').str[1]
data['Address'] = data['Address_part1']+','+data['Address_part2']
data.drop(['Field_Part1','Field_Part2','Field_Part3','Address_part1','Address_part2'],axis=1,inplace=True)
# Output Below
data
Easy Way to understand
Thanks
Leon

Less memory intensive way to parse large JSON file in Python

Here is my code
import json
data = []
with open("review.json") as f:
for line in f:
data.append(json.loads(line))
lst_string = []
lst_num = []
for i in range(len(data)):
if (data[i]["stars"] == 5.0):
x = data[i]["text"]
for word in x.split():
if word in lst_string:
lst_num[lst_string.index(word)] += 1
else:
lst_string.append(word)
lst_num.append(1)
result = set(zip(lst_string, lst_num))
print(result)
with open("set.txt", "w") as g:
g.write(str(result))
I'm trying to write a set of all words in reviews that were given 5 stars from a pulled in json file formatted like
{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}
{"review_id":"GJXCdrto3ASJOqKeVWPi6Q","user_id":"yXQM5uF2jS6es16SJzNHfg","business_id":"NZnhc2sEQy3RmzKTZnqtwQ","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon! I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level! \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit. Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room. Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure. That was superb! Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen. The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement. It was so much fun to be there! \n\nNext Travis started with the flat iron. The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable. It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists. At the end of the blowout & style my hair was perfectly bouncey and looked terrific. The only thing better? That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas. You make me feel beauuuutiful!","date":"2017-01-14 21:30:33"}
{"review_id":"2TzJjDVDEuAW6MR5Vuc1ug","user_id":"n6-Gk65cPZL6Uz8qRm3NYw","business_id":"WTqjgwHlXbSFevF32_DJVw","stars":1.0,"useful":3,"funny":0,"cool":0,"text":"I have to say that this office really has it together, they are so organized and friendly! Dr. J. Phillipp is a great dentist, very friendly and professional. The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable! I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit! I highly recommend this office for the nice synergy the whole office has!","date":"2016-11-09 20:09:03"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.\n\nDrink prices were pretty good.\n\nThe Server, Dawn, was friendly and accommodating. Very happy with her.\n\nIn summation, a great pub experience. Would go again!","date":"2018-01-09 20:56:38"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"a b aa bb a b","date":"2018-01-09 20:56:38"}
but it is using all the memory on my computer before it can output into a text file. How can I use a less memory intensive way?

Only get text where stars == 5:
Data:
Based on the question, the data is a file containing rows of dicts.
Get the text into a list:
Given the data from Yelp Challenge, getting the 5 stars text into a list, doesn't take that much memory.
The Windows resource manager showed an increase of about 1.3GB, but the object size of text_list was about 25MB.
import json
text_list = list()
with open("review.json", encoding="utf8") as f:
for line in f:
line = json.loads(line)
if line['stars'] == 5:
text_list.append(line['text'])
print(text_list)
>>> ['Test text, example 1!', 'Test text, example 2!']
Extra:
Everything after loading the data, seems to require a lot of memory that isn't being released.
When cleaning the text, Windows resource manager went up by 16GB, though the final size of clean_text was also only about 25MB.
Interestingly, deleting clean_text does not release the 16GB of memory.
In Jupyter Lab, restarting the Kernel will release the memory
In PyCharm, stopping the process also releases the memory
I tried manually running the garbage collector, but that didn't release the memory
Clean text_list:
import string
def clean_string(value: str) -> list:
value = value.lower()
value = value.translate(str.maketrans('', '', string.punctuation))
value = value.split()
return value
clean_text = [clean_string(item) for item in text_list]
print(clean_text)
>>> [['test', 'text', 'example', '1'], ['test', 'text', 'example', '2']]
Count words in clean_text:
from collection import Counter
words = Counter()
for item in clean_text:
words.update(item)
print(words)
>>> Counter({'test': 2, 'text': 2, 'example': 2, '1': 1, '2': 1})

The usage of radix sort

Let's say i have a file containing data on users and their favourite movies.
Ace: FANTASTIC FOUR, IRONMAN
Jane: EXOTIC WILDLIFE, TRANSFORMERS, NARNIA
Jack: IRONMAN, FANTASTIC FOUR
and based of that, the program I'm about to write returns me the name of the users that likes the same movies.
Since Ace and Jack likes the same movie, they will be partners hence the program would output:
Movies: FANTASTIC FOUR, IRONMAN
Partners: Ace, Jack
Jane would be exempted since she doesn't have anyone who shares the same interest in movies as her.
The problem I'm having now is figuring out on how Radix Sort would help me achieve this as I've been thinking whole day long. I don't have much knowledge on radix sort but i know that it compares elements one by one but I'm terribly confused in cases such as FANTASTIC FOUR being arranged first in Ace's data and second in Jack's data.
Would anyone kindly explain some algorithms that i could understand to achieve the output?

Can you show us how you sort your lists ? The quick and dirty code below give me the same output for sorted Ace and Jack.
Ace = ["FANTASTIC FOUR", "IRONMAN"]
Jane = ["EXOTIC WILDLIFE", "TRANSFORMERS", "NARNIA"]
Jack = ["IRONMAN", "FANTASTIC FOUR"]
sorted_Ace = sorted(Ace)
print (sorted_Ace)
sorted_Jack = sorted(Jack)
print (sorted_Jack)
You could start comparing elements one by one from here.
I made you a quick solution, it can show you how you can proceed as it's not optimized at all and not generalized.
Ace = ["FANTASTIC FOUR", "IRONMAN"]
Jane = ["EXOTIC WILDLIFE", "TRANSFORMERS", "NARNIA"]
Jack = ["IRONMAN", "FANTASTIC FOUR"]
Movies = []
Partners = []
sorted_Ace = sorted(Ace)
sorted_Jane = sorted(Jane)
sorted_Jack = sorted(Jack)
for i in range(len(sorted_Ace)):
if sorted_Ace[i] == sorted_Jack[i]:
Movies.append(sorted_Ace[i])
if len(Movies) == len(sorted_Ace):
Partners.append("Ace")
Partners.append("Jack")
print(Movies)
print(Partners)

How to speed up the sum of presence of keys in the series of documents? - Pandas, nltk

I have a dataframe column with documents like
38909 Hotel is an old style Red Roof and has not bee...
38913 I will never ever stay at this Hotel again. I ...
38914 After being on a bus for -- hours and finally ...
38918 We were excited about our stay at the Blu Aqua...
38922 This hotel has a great location if you want to...
Name: Description, dtype: object
I have a bag of words like keys = ['Hotel','old','finally'] but the actual length of keys = 44312
Currently Im using
df.apply(lambda x : sum([i in x for i in keys ]))
Which gives the following output based on sample keys
38909 2
38913 2
38914 3
38918 0
38922 1
Name: Description, dtype: int64
When I apply this on actual data for just 100 rows timeit gives
1 loop, best of 3: 5.98 s per loop
and I have 50000 rows. Is there a faster way of doing the same in nltk or pandas.
EDIT :
In case looking for document array
array([ 'Hotel is an old style Red Roof and has not been renovated up to the new standard, but the price was also not up to the level of the newer style Red Roofs. So, in overview it was an OK stay, and a safe',
'I will never ever stay at this Hotel again. I stayed there a few weeks ago, and I had my doubts the second I checked in. The guy that checked me in, I think his name was Julio, and his name tag read F',
"After being on a bus for -- hours and finally arriving at the Hotel Lawerence at - am, I bawled my eyes out when we got to the room. I realize it's suppose to be a boutique hotel but, there was nothin",
"We were excited about our stay at the Blu Aqua. A new hotel in downtown Chicago. It's architecturally stunning and has generally good reviews on TripAdvisor. The look and feel of the place is great, t",
'This hotel has a great location if you want to be right at Times Square and the theaters. It was an easy couple of blocks for us to go to theater, eat, shop, and see all of the activity day and night '], dtype=object)

The following code is not exactly equivalent to your (slow) version, but it demonstrates the idea:
keyset = frozenset(keys)
df.apply(lambda x : len(keyset.intersection(x.split())))
Differences/limitation:
In your version a word is counted even if it is contained as a substring in a word in the document. For example, had your keys contained the word tyl, it would be counted due to occurrence of "style" in your first document.
My solution doesn't account for punctuation in the documents. For example, the word again in the second document comes out of split() with the full stop attached to it. That can be fixed by preprocessing the document (or postprocessing the result of the split()) with a function that removes the punctuation.

It seems you can just use np.char.count -
[np.count_nonzero(np.char.count(i, keys)) for i in arr]
Might be better to feed a boolean array for counting -
[np.count_nonzero(np.char.count(i, keys)!=0) for i in arr]

If need check only if present values of list:
from numpy.core.defchararray import find
v = df['col'].values.astype(str)
a = (find(v[:, None], keys) >= 0).sum(axis=1)
print (a)
[2 1 1 0 0]
Or:
df = pd.concat([df['col'].str.contains(x) for x in keys], axis=1).sum(axis=1)
print (df)
38909 2
38913 1
38914 1
38918 0
38922 0
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.