How to fuzzy match two lists in Python - python

I have two lists: ref_list and inp_list. How can one make use of FuzzyWuzzy to match the input list from the reference list?
inp_list = pd.DataFrame(['ADAMS SEBASTIAN', 'HAIMBILI SEUN', 'MUTESI
JOHN', 'SHEETEKELA MATT', 'MUTESI JOHN KUTALIKA',
'ADAMS SEBASTIAN HAUSIKU', 'PETERS WILSON',
'PETERS MARIO', 'SHEETEKELA MATT NICKY'],
columns =['Names'])
ref_list = pd.DataFrame(['ADAMS SEBASTIAN HAUSIKU', 'HAIMBILI MIKE', 'HAIMBILI SEUN', 'MUTESI JOHN
KUTALIKA', 'PETERS WILSON MARIO', 'SHEETEKELA MATT NICKY MBILI'], columns =
['Names'])
After some research, I modified some codes I found on the internet. Problems with these codes - they work very well on small sample size. In my case the inp_list and ref_list are 29k and 18k respectively in length and it takes more than a day to run.
Below are the codes, first a helper function was defined.
def match_term(term, inp_list, min_score=0):
# -1 score in case I don't get any matches
max_score = -1
# return empty for no match
max_name = ''
# iterate over all names in the other
for term2 in inp_list:
# find the fuzzy match score
score = fuzz.token_sort_ratio(term, term2)
# checking if I am above my threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = term2
max_score = score
return (max_name, max_score)
# list for dicts for easy dataframe creation
dict_list = []
#iterating over the sales file
for name in inp_list:
#use the defined function above to find the best match, also set the threshold to a chosen #
match = match_term(name, ref_list, 94)
#new dict for storing data
dict_ = {}
dict_.update({'passenger_name': name})
dict_.update({'match_name': match[0]})
dict_.update({'score': match[1]})
dict_list.append(dict_)
Where can these codes be improved to run smoothly and perhaps avoid evaluating items that have already been assessed?

You can try to vectorized the operations instead of evaluate the scores in a loop.
Make a df where the firse col ref is ref_list and the second col inp is each name in inp_list. Then call df.apply(lambda row:process.extractOne(row['inp'], row['ref']), axis=1). Finally you'll get the best match name and score in ref_list for each name in inp_list.

The measures you are using are computationally demanding with a number of pairs of strings that high. Alternatively to fuzzywuzzy, you could try to use instead a library called string-grouper which exploits a faster Tf-idf method and the cosine similarity measure to find similar words. As an example:
import random, string, time
import pandas as pd
from string_grouper import match_strings
alphabet = list(string.ascii_lowercase)
from_r, to_r = 0, len(alphabet)-1
random_strings_1 = ["".join(alphabet[random.randint(from_r, to_r)]
for i in range(6)) for j in range(5000)]
random_strings_2 = ["".join(alphabet[random.randint(from_r, to_r)]
for i in range(6)) for j in range(5000)]
series_1 = pd.Series(random_strings_1)
series_2 = pd.Series(random_strings_2)
t_1 = time.time()
matches = match_strings(series_1, series_2,
min_similarity=0.6)
t_2 = time.time()
print(t_2 - t_1)
print(matches)
It takes less than one second to do 25.000.000 comparisons! For a surely more useful test of the library look here: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html where it is claimed that
"Using this approach made it possible to search for near duplicates in
a set of 663,000 company names in 42 minutes using only a dual-core
laptop".
To tune your matching algorithm further look at the **kwargs arguments you can give to the match_strings function above.

Related

Speeding up fuzzy match on large list

I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.
Is there a way that we can speed up our process? Here's the snippet of the script that we use.
# Importing packages
import pandas as pd
import Levenshtein as lev
# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file')
df_name_to_screen = pd.read_csv('path_to_file')
# Function used in name screening
def get_similarity_score(s1, s2):
''' Return match percentage between 2 strings disregarding name swapping
Parameters
-----------
s1 : str : name from df_name_reference (to be used within pandas apply)
s2 : str : name from df_name_to_screen (ref_name variable)
Return
-----------
float
'''
# Get sorted names
s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''
# Get ratios and return the max value
# THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
return max([
lev.ratio(s1, s2),
lev.ratio(s1_sort, s2),
lev.ratio(s1, s2_sort),
lev.ratio(s1_sort, s2_sort)
])
# Returning file
screening_results = []
for row in range(df_name_to_screen.shape[0]):
# Get name to screen
ref_name = df_name_to_screen.loc[row, 'fullname']
# Get scores
scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))
# Append results
screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))
I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.
Can anyone give an approach that I could try? Thanks!
EDIT: Here's a sample dataset that you may try. This is in Google Drive.
In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:
screening_results = []
for row in range(df_name_to_screen.shape[0]):
ref_name = df_name_to_screen.loc[row, 'fullname']
skimmed = pd.DataFrame({
'fullname': difflib.get_close_matches(
ref_name,
df_name_reference.fullname,
N_RESULTS,
0
)
})
scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))
This takes about 50ms per row using the file you provided.

Python closest match between two string columns

I am looking to get the closest match between two columns of string data type in two separate tables. I don't think the content matters too much. There are words that I can match by pre-processing the data (lower all letters, replace spaces and stop words, etc...) and doing a join. However I get around 80 matches out of over 350. It is important to know that the length of each table is different.
I did try to use some code I found online but it isn't working:
def Races_chien(df1,df2):
myList = []
total = len(df1)
possibilities = list(df2['Rasse'])
s = SequenceMatcher(isjunk=None, autojunk=False)
for idx1, df1_str in enumerate(df1['Race']):
my_str = ('Progress : ' + str(round((idx1 / total) * 100, 3)) + '%')
sys.stdout.write('\r' + str(my_str))
sys.stdout.flush()
# get 1 best match that has a ratio of at least 0.7
best_match = get_close_matches(df1_str, possibilities, 1, 0.7)
s.set_seq2(df1_str, best_match)
myList.append([df1_str, best_match, s.ratio()])
return myList
It says: TypeError: set_seq2() takes 2 positional arguments but 3 were given
How can I make this work?
I think you need s.set_seqs(df1_str, best_match) function instead of s.set_seq2(df1_str, best_match) (docs)
You can use jellyfish library that has useful tools for comparing how similar two strings are if that is what you are looking for.
Try changing:
s = SequenceMatcher(isjunk=None, autojunk=False)
To:
s = SequenceMatcher(None, isjunk=None, autojunk=False)
Here is an answer I finally got:
from fuzzywuzzy import process, fuzz
value = []
similarity = []
for i in df1.col:
ratio = process.extract(i, df2.col, limit= 1)
value.append(ratio[0][0])
similarity.append(ratio[0][1])
df1['value'] = pd.Series(value)
df1['similarity'] = pd.Series(similarity)
This will add the value with the closest match from df2 in df1 together with the similarity %

What is the fastest way to match, replace, and extract substrings from pandas dataframe with multiple criteria?

I have an approximately 1 million row pandas dataframe containing data parsed from federal appellate court opinions. I need to extract the names of judges hearing the cases. The data has an unknown number of judges per case (one row) which are contained in a string. That string (currently stored in a single column) contains a lot of excess text as well as has inconsistent formatting and capitalization. I use different dictionaries of judge names (with 2,575 regex keys possible to be used) to match judges listed based on multiple criteria described below. I use the dictionary with the most stringent matching criteria first and gradually loosen the criteria. I also remove the matched string from the source column. The current methods that I have tried are simply too slow (taking days, weeks, or even months).
The reason there are multiple possible dictionaries is that many judges share the same (last) names. The strings don't ordinarily include full names. I use data contained in two other columns to get the right match: year the case was decided and the court hearing the case (both integers). I also have higher and lower quality substring search terms. The dictionaries I use can be recreated at will in different formats besides regex if needed.
The fastest solution I have tried was crude and unpythonic. In the initial parsing of the data (extraction of sections and keywords from raw text files), which occurs on a case-by-case basis, I did the following: 1) removed excess text to the degree possible, 2) sorted the remaining text into a list stored within a pandas column, 3) concatenated as strings the year and court to each item in that list, and 4) matched that concatenated string to a dictionary that I had similarly prepared. That dictionary didn't use regular expressions and had approximately 800,000 keys. That process took about a day (with all of the other parsing involved as well) and was not as accurate as I would have liked (because it omitted certain name format permutations).
The code below contains my most recent attempt (which is currently running and looks to be among the slowest options yet). It creates subset dictionaries on the fly and still ends up iterating through those smaller dictionaries with regex keys. I've read through and tried to apply solutions from many stackoverflow questions, but couldn't find a workable solution. I'm open to any python-based idea. The data is real data that I've cleaned with a prior function.
import numpy as np
import pandas as pd
test_data = {'panel_judges' : ['CHAGARES, VANASKIE, SCHWARTZ',
'Sidney R. Thomas, Barry G. Silverman, Raymond C. Fisher, Opinion by Thomas'],
'court_num' : [3, 9],
'date_year' : [2014, 2014]}
test_df = pd.DataFrame(data = test_data)
name_dict = {'full_name' : ['Chagares, Michael A.',
'Vanaskie, Thomas Ignatius',
'Schwartz, Charles, Jr.',
'Schwartz, Edward Joseph',
'Schwartz, Milton Lewis',
'Schwartz, Murray Merle'],
'court_num' : [3, 3, 1061, 1097, 1058, 1013],
'circuit_num' : [3, 3, 5, 9, 9, 3],
'start_year' : [2006, 2010, 1976, 1968, 1979, 1974],
'end_year' : [2016, 2019, 2012, 2000, 2005, 2013],
'hq_match' : ['M(?=ICHAEL)? ?A?(?=\.)? ?CHAGARES',
'T(?=HOMAS)? ?I?(?=GNATIUS)? ?VANASKIE',
'C(?=HARLES)? SCHWARTZ',
'E(?=DWARD)? ?J?(?=OSEPH)? ?SCHWARTZ',
'M(?=ILTON)? ?L?(?=EWIS)? ?SCHWARTZ',
'M(?=URRAY)? ?M?(?=ERLE)? ?SCHWARTZ'],
'lq_match' : ['CHAGARES',
'VANASKIE',
'SCHWARTZ',
'SCHWARTZ',
'SCHWARTZ',
'SCHWARTZ']}
names = pd.DataFrame(data = name_dict)
in_col = 'panel_judges'
year_col = 'date_year'
out_col = 'fixed_panel'
court_num_col = 'court_num'
test_df[out_col] = ''
test_df[out_col].astype(list, inplace = True)
def judge_matcher(df, in_col, out_col, year_col, court_num_col,
size_column = None):
general_cols = ['start_year', 'end_year', 'full_name']
court_cols = ['court_num', 'circuit_num']
match_cols = ['hq_match', 'lq_match']
for match_col in match_cols:
for court_col in court_cols:
lookup_cols = general_cols + [court_col] + [match_col]
judge_df = names[lookup_cols]
for year in range(df[year_col].min(),
df[year_col].max() + 1):
for court in range(df[court_num_col].min(),
df[court_num_col].max() + 1):
lookup_subset = ((judge_df['start_year'] <= year)
& (year < (judge_df['end_year'] + 2))
& (judge_df[court_col] == court))
new_names = names.loc[lookup_subset]
df_subset = ((df[year_col] == year)
& (df[court_num_col] == court))
df.loc[df_subset] = matcher(df.loc[df_subset],
in_col, out_col, new_names, match_col)
return df
def matcher(df, in_col, out_col, lookup, keys):
patterns = dict(zip(lookup[keys], lookup['full_name']))
for key, value in patterns.items():
df[out_col] = (
np.where(df[in_col].astype(str).str.upper().str.contains(key),
df[out_col] + value + ', ', df[out_col]))
df[in_col] = df[in_col].astype(str).str.upper().str.replace(key, '')
return df
df = judge_matcher(test_df, in_col, out_col, year_col, court_num_col)
The output I currently get is essentially right (although the names should be sorted and in a list). The proper "Schwartz" is picked and the matches are all correct. The problem is speed.
My goal is to have a de-deduplicated, sorted (alphabetically) list of judges on each panel either stored in a single column or exploded into up to 15 separate columns (I presently do that in a separate vectorized function). I then will do other lookups on those judges based upon other demographic and biographical information. The produced data will be openly available to researchers in the area and the code will be part of a free, publicly available platform usable for studying other courts as well. So accuracy and speed are both important considerations for users on many different machines.
For anyone who stumbles across this question and has a similar complex string matching issue in pandas, this is the solution I found to be the fastest.
It isn't fully vectorized like I wanted, but I used df.apply with this method within a class:
def judge_matcher(self, row, in_col, out_col, year_col, court_num_col,
size_col = None):
final_list = []
raw_list = row[in_col]
cleaned_list = [x for x in raw_list if x]
cleaned_list = [x.strip() for x in cleaned_list]
for name in cleaned_list:
name1 = self.convert_judge_name(row[year_col],
row[court_num_col], name, 1)
name2 = self.convert_judge_name(row[year_col],
row[court_num_col], name, 2)
if name1 in self.names_dict_list[0]:
final_list.append(self.names_dict_list[0].get(name1))
elif name1 in self.names_dict_list[1]:
final_list.append(self.names_dict_list[1].get(name1))
elif name2 in self.names_dict_list[2]:
final_list.append(self.names_dict_list[2].get(name2))
elif name2 in self.names_dict_list[3]:
final_list.append(self.names_dict_list[3].get(name2))
elif name in self.names_dict_list[4]:
final_list.append(self.names_dict_list[4].get(name))
final_list = list(unique_everseen(final_list))
final_list.sort()
row[out_col] = final_list
if size_col and final_list:
row[size_col] = len(final_list)
return row
#staticmethod
def convert_judge_name(year, court, name, dict_type):
if dict_type == 1:
return str(int(court) * 10000 + int(year)) + name
elif dict_type == 2:
return str(int(year)) + name
else:
return name
Basically, it concatenates three columns together and performs hashed dictionary lookups (instead of regexes) with the concatenated strings. Multiplication is used to efficiently concatenate the two numbers to be side-by-side as strings. The dictionaries had similarly prepared keys (and the values are the desired strings). By using lists and then deduplicating, I didn't have to remove the matched strings. I didn't time this specific function, but the overall module took just over 10 hours to process ~ 1 million rows. When I run it again, I will try to remember to time this applied function specifically and post the results here. The method is ugly, but fairly effective.

python: data cleaning - detect pattern for fraudulent email addresses

I am cleaning a data set with fraudulent email addresses that I am removing.
I established multiple rules for catching duplicates and fraudulent domains. But there is one screnario, where I can't think of how to code a rule in python to flag them.
So I have for example rules like this:
#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')
#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)
This is the data where I can't figure out a rule to catch it. Basically I am looking for a way to flag addresses that start the same way, but then have consecutive numbers in the end.
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
My solution isn't efficient, nor pretty. But check it out and see if it works for you #jeangelj. It definitely works for the examples you provided. Good luck!
import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('#')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['abc7020#gmail.com', 'attn1#gmail.com']
In [31]: spam
Out[31]:
['abc7020.10#gmail.com',
'abc7020.11#gmail.com',
'abc7020.12#gmail.com',
'abc7020.13#gmail.com',
'abc7020.14#gmail.com',
'abc7020.15#gmail.com',
'abc7020.1#gmail.com',
'attn12345678#gmail.com',
'attn1234567#gmail.com',
'attn123456#gmail.com',
'attn12345#gmail.com',
'attn1234#gmail.com',
'attn123#gmail.com',
'attn12#gmail.com']
# THE TRUTH YALL!
You can use a regular expression to do this; example below:
import re
a = "attn12345#gmail.comf"
b = "abc7020.14#gmail.com"
c = "abc7020#gmail.com"
d = "attn12345678#gmail.com"
pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?#")
if pattern.search(a):
print("spam1")
if pattern.search(b):
print("spam2")
if pattern.search(c):
print("spam3")
if pattern.search(d):
print("spam4")
If you run the code you will see:
$ python spam.py
spam1
spam2
spam3
spam4
The benefit to this method is that its standardized (regular expressions) and that you can adjust the strength of the match easily by adjusting the values within {}; which means you can have a global configuration file where you set/adjust the values. You can also adjust the regular expression easily without having to rewrite code.
First take a look at regexp question here
Second, try to filter email address like that:
# Let's email is = 'attn1234#gmail.com'
email = 'attn1234#gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
print ('%s is good' % email)
else:
print ('%s is BAD' % email)
You could pick a diff threshold using edit distance (aka Levenshtein distance). In python:
$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1#gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])
If you wanted to be smarter about it, you could run through the resulting list and, instead of turning it into a set, keep track of how many other email addresses it was near - then use that as a 'weight' to determine fake-ness.
This gets you not only the given cases (where the fraudulent addresses all share a common start and differ only in numerical suffix, but additionally number or letter padding eg at the beginning or in the middle of an email address.
ids = [s.split('#')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
mi = ids[i]
mj = ids[j]
if len(mj) == len(mi) + 1 and mj.startswith(mi):
try:
int(mj[-1])
det[j,i] = True
det[i,j] = True
except:
continue
spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()
Here's one way to approach it, that should be pretty efficient.
We do it by grouping the email address in lengths, so that we only need to check if each email address matches the level down, by a slice and set membership check.
The code:
First, read in the data:
import pandas as pd
import numpy as np
string = '''
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
foo123#bar.com
foo1#bar.com
'''
x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]
We strip off the #foo.bar part, and then filer to only those that end with a number, and add on a 'length' column:
#split on #, expand means into two columns
emails = x.x.str.split('#', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()
Now, all we have to do, is take each length, and length -1, and see if the length. with it's last character dropped, appears in a set of the n-1 lengths (and, we have to check if the opposite is true, in case it is the shortest repeat):
#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)
#for each length
for j in lengths:
#we subset those of that length
totest = emails['lengths'] == j
#and those who might be the shorter version
against = emails['lengths'] == j -1
#we make a set of unique values, for a hashed lookup
againstset = set([i for i in emails.loc[against,0]])
#we cut off the last char of each in to test
tests = emails.loc[totest,0].str[:-1]
#we check matches, by checking the set
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
#viceversa, otherwise we miss the smallest one in the group
againstset = set([i for i in emails.loc[totest,0].str[:-1]])
tests = emails.loc[against,0]
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
The resulting mask can be converted to boolean, and used to subset the original (deduplicated) dataframe, and the indices should match the original indices to subset like that:
x.loc[~mask.astype(bool),:]
x
0 abc7020#gmail.com
16 foo123#bar.com
17 foo1#bar.com
You can see that we have not removed your first value, as the '.' means it did not match - you can remove the punctuation first.
I have an idea on how to solve this:
fuzzywuzzy
Create a set of unique emails, for-loop over them and compare them with fuzzywuzzy.
Example:
from fuzzywuzzy import fuzz
for email in emailset:
for row in data:
emailcomp = re.search(pattern=r'(.+)#.+',string=email).groups()[0]
rowemail = re.search(pattern=r'(.+)#.+',string=row['email']).groups()[0]
if row['email']==email:
continue
elif fuzz.partial_ratio(emailcomp,rowemail)>80:
'flagging operation'
I took some liberties with how the data is represented, but I feel the variable names are mnemonic enough for you to understand what I am getting at. It is a very rough piece of code, in that I have not thought through how to stop repetitive flagging.
Anyways, the elif part compares the two email addresses without #gmail.com (or any other email e.g. #yahoo.com), if the ratio is above 80 (play around with this number) use your flagging operation.
For example:
fuzz.partial_ratio("abc7020.1", "abc7020")
100

put stockprices into groups when they are within 0.5% of each other

Thanks for the answers, I have not used StackOverflow before so I was suprised by the number of answers and the speed of them - its fantastic.
I have not been through the answers properly yet, but thought I should add some information to the problem specification. See the image below.
I can't post an image in this because i don't have enough points but you can see an image
at http://journal.acquitane.com/2010-01-20/image003.jpg
This image may describe more closely what I'm trying to achieve. So you can see on the horizontal lines across the page are price points on the chart. Now where you get a clustering of lines within 0.5% of each, this is considered to be a good thing and why I want to identify those clusters automatically. You can see on the chart that there is a cluster at S2 & MR1, R2 & WPP1.
So everyday I produce these price points and then I can identify manually those that are within 0.5%. - but the purpose of this question is how to do it with a python routine.
I have reproduced the list again (see below) with labels. Just be aware that the list price points don't match the price points in the image because they are from two different days.
[YR3,175.24,8]
[SR3,147.85,6]
[YR2,144.13,8]
[SR2,130.44,6]
[YR1,127.79,8]
[QR3,127.42,5]
[SR1,120.94,6]
[QR2,120.22,5]
[MR3,118.10,3]
[WR3,116.73,2]
[DR3,116.23,1]
[WR2,115.93,2]
[QR1,115.83,5]
[MR2,115.56,3]
[DR2,115.53,1]
[WR1,114.79,2]
[DR1,114.59,1]
[WPP,113.99,2]
[DPP,113.89,1]
[MR1,113.50,3]
[DS1,112.95,1]
[WS1,112.85,2]
[DS2,112.25,1]
[WS2,112.05,2]
[DS3,111.31,1]
[MPP,110.97,3]
[WS3,110.91,2]
[50MA,110.87,4]
[MS1,108.91,3]
[QPP,108.64,5]
[MS2,106.37,3]
[MS3,104.31,3]
[QS1,104.25,5]
[SPP,103.53,6]
[200MA,99.42,7]
[QS2,97.05,5]
[YPP,96.68,8]
[SS1,94.03,6]
[QS3,92.66,5]
[YS1,80.34,8]
[SS2,76.62,6]
[SS3,67.12,6]
[YS2,49.23,8]
[YS3,32.89,8]
I did make a mistake with the original list in that Group C is wrong and should not be included. Thanks for pointing that out.
Also the 0.5% is not fixed this value will change from day to day, but I have just used 0.5% as an example for spec'ing the problem.
Thanks Again.
Mark
PS. I will get cracking on checking the answers now now.
Hi:
I need to do some manipulation of stock prices. I have just started using Python, (but I think I would have trouble implementing this in any language). I'm looking for some ideas on how to implement this nicely in python.
Thanks
Mark
Problem:
I have a list of lists (FloorLevels (see below)) where the sublist has two items (stockprice, weight). I want to put the stockprices into groups when they are within 0.5% of each other. A groups strength will be determined by its total weight. For example:
Group-A
115.93,2
115.83,5
115.56,3
115.53,1
-------------
TotalWeight:12
-------------
Group-B
113.50,3
112.95,1
112.85,2
-------------
TotalWeight:6
-------------
FloorLevels[
[175.24,8]
[147.85,6]
[144.13,8]
[130.44,6]
[127.79,8]
[127.42,5]
[120.94,6]
[120.22,5]
[118.10,3]
[116.73,2]
[116.23,1]
[115.93,2]
[115.83,5]
[115.56,3]
[115.53,1]
[114.79,2]
[114.59,1]
[113.99,2]
[113.89,1]
[113.50,3]
[112.95,1]
[112.85,2]
[112.25,1]
[112.05,2]
[111.31,1]
[110.97,3]
[110.91,2]
[110.87,4]
[108.91,3]
[108.64,5]
[106.37,3]
[104.31,3]
[104.25,5]
[103.53,6]
[99.42,7]
[97.05,5]
[96.68,8]
[94.03,6]
[92.66,5]
[80.34,8]
[76.62,6]
[67.12,6]
[49.23,8]
[32.89,8]
]
I suggest a repeated use of k-means clustering -- let's call it KMC for short. KMC is a simple and powerful clustering algorithm... but it needs to "be told" how many clusters, k, you're aiming for. You don't know that in advance (if I understand you correctly) -- you just want the smallest k such that no two items "clustered together" are more than X% apart from each other. So, start with k equal 1 -- everything bunched together, no clustering pass needed;-) -- and check the diameter of the cluster (a cluster's "diameter", from the use of the term in geometry, is the largest distance between any two members of a cluster).
If the diameter is > X%, set k += 1, perform KMC with k as the number of clusters, and repeat the check, iteratively.
In pseudo-code:
def markCluster(items, threshold):
k = 1
clusters = [items]
maxdist = diameter(items)
while maxdist > threshold:
k += 1
clusters = Kmc(items, k)
maxdist = max(diameter(c) for c in clusters)
return clusters
assuming of course we have suitable diameter and Kmc Python functions.
Does this sound like the kind of thing you want? If so, then we can move on to show you how to write diameter and Kmc (in pure Python if you have a relatively limited number of items to deal with, otherwise maybe by exploiting powerful third-party add-on frameworks such as numpy) -- but it's not worthwhile to go to such trouble if you actually want something pretty different, whence this check!-)
A stock s belong in a group G if for each stock t in G, s * 1.05 >= t and s / 1.05 <= t, right?
How do we add the stocks to each group? If we have the stocks 95, 100, 101, and 105, and we start a group with 100, then add 101, we will end up with {100, 101, 105}. If we did 95 after 100, we'd end up with {100, 95}.
Do we just need to consider all possible permutations? If so, your algorithm is going to be inefficient.
You need to specify your problem in more detail. Just what does "put the stockprices into groups when they are within 0.5% of each other" mean?
Possibilities:
(1) each member of the group is within 0.5% of every other member of the group
(2) sort the list and split it where the gap is more than 0.5%
Note that 116.23 is within 0.5% of 115.93 -- abs((116.23 / 115.93 - 1) * 100) < 0.5 -- but you have put one number in Group A and one in Group C.
Simple example: a, b, c = (0.996, 1, 1.004) ... Note that a and b fit, b and c fit, but a and c don't fit. How do you want them grouped, and why? Is the order in the input list relevant?
Possibility (1) produces ab,c or a,bc ... tie-breaking rule, please
Possibility (2) produces abc (no big gaps, so only one group)
You won't be able to classify them into hard "groups". If you have prices (1.0,1.05, 1.1) then the first and second should be in the same group, and the second and third should be in the same group, but not the first and third.
A quick, dirty way to do something that you might find useful:
def make_group_function(tolerance = 0.05):
from math import log10, floor
# I forget why this works.
tolerance_factor = -1.0/(-log10(1.0 + tolerance))
# well ... since you might ask
# we want: log(x)*tf - log(x*(1+t))*tf = -1,
# so every 5% change has a different group. The minus is just so groups
# are ascending .. it looks a bit nicer.
#
# tf = -1/(log(x)-log(x*(1+t)))
# tf = -1/(log(x/(x*(1+t))))
# tf = -1/(log(1/(1*(1+t)))) # solved .. but let's just be more clever
# tf = -1/(0-log(1*(1+t)))
# tf = -1/(-log((1+t))
def group_function(value):
# don't just use int - it rounds up below zero, and down above zero
return int(floor(log10(value)*tolerance_factor))
return group_function
Usage:
group_function = make_group_function()
import random
groups = {}
for i in range(50):
v = random.random()*500+1000
group = group_function(v)
if group in groups:
groups[group].append(v)
else:
groups[group] = [v]
for group in sorted(groups):
print 'Group',group
for v in sorted(groups[group]):
print v
print
For a given set of stock prices, there is probably more than one way to group stocks that are within 0.5% of each other. Without some additional rules for grouping the prices, there's no way to be sure an answer will do what you really want.
apart from the proper way to pick which values fit together, this is a problem where a little Object Orientation dropped in can make it a lot easier to deal with.
I made two classes here, with a minimum of desirable behaviors, but which can make the classification a lot easier -- you get a single point to play with it on the Group class.
I can see the code bellow is incorrect, in the sense the limtis for group inclusion varies as new members are added -- even it the separation crieteria remaisn teh same, you heva e torewrite the get_groups method to use a multi-pass approach. It should nto be hard -- but the code would be too long to be helpfull here, and i think this snipped is enoguh to get you going:
from copy import copy
class Group(object):
def __init__(self,data=None, name=""):
if data:
self.data = data
else:
self.data = []
self.name = name
def get_mean_stock(self):
return sum(item[0] for item in self.data) / len(self.data)
def fits(self, item):
if 0.995 < abs(item[0]) / self.get_mean_stock() < 1.005:
return True
return False
def get_weight(self):
return sum(item[1] for item in self.data)
def __repr__(self):
return "Group-%s\n%s\n---\nTotalWeight: %d\n\n" % (
self.name,
"\n".join("%.02f, %d" % tuple(item) for item in self.data ),
self.get_weight())
class StockGrouper(object):
def __init__(self, data=None):
if data:
self.floor_levels = data
else:
self.floor_levels = []
def get_groups(self):
groups = []
floor_levels = copy(self.floor_levels)
name_ord = ord("A") - 1
while floor_levels:
seed = floor_levels.pop(0)
name_ord += 1
group = Group([seed], chr(name_ord))
groups.append(group)
to_remove = []
for i, item in enumerate(floor_levels):
if group.fits(item):
group.data.append(item)
to_remove.append(i)
for i in reversed(to_remove):
floor_levels.pop(i)
return groups
testing:
floor_levels = [ [stock. weight] ,... <paste the data above> ]
s = StockGrouper(floor_levels)
s.get_groups()
For the grouping element, could you use itertools.groupby()? As the data is sorted, a lot of the work of grouping it is already done, and then you could test if the current value in the iteration was different to the last by <0.5%, and have itertools.groupby() break into a new group every time your function returned false.

Categories

Resources