Format a python list and search for patterns - python

I am getting rows from a spreadsheet with mixtures of numbers, text and dates
I want to find elements within the list, some numbers and some text
for example
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(str, sg)
#sg = map(unicode, sg) #option?
if any("-1" in s for s in sg):
#do something if matched
I don't feel this is the correct way to do this, I am also trying to match stuff like -1.5 and -1.5C and other unexpected characters like OPEN15 compared to 15
I have also looked at
sg.index("-1")
If positive then its a match (Only good for direct matches)
Some help would be appreciated

If you want to call a function for each case, I would do it this way:
def stub1(elem):
#do something for match of type '-1'
return
def stub2(elem):
#do something for match of type 'SD4'
return
def stub3(elem):
#do something for match of type 'OPEN15'
return
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(unicode, sg)
patterns = {u"-1":stub1, u"SD4": stub2, u"OPEN15": stub3} # add more if you want
for elem in sg:
for k, stub in patterns.iteritems():
if k in elem:
stub(elem)
break
Where stub1, stub2, ... are the fonctions that contains the code for each case.
It will be called (max 1 time per strings) if the string contains a matching substring.

What do you mean by "I don't feel this is the correct way to do this" ? Are you not getting the result you expect ? Is it too slow ?
Maybe, you can organize your data by columns instead of rows and have a more specific filters. If you are looking for speed, I'd suggest using the numpy module which has a very intersting function called select()
Scipy select example
By transforming all your rows in a numpy array, you can test several columns in one pass. This function is amazingly efficient and powerful ! Basically it's used like this:
import numpy as np
a = array(...)
conds = [a < 10, a % 3 == 0, a > 25]
actions = [a + 100, a / 3, a * 10]
result = np.select(conds, actions, default = 0)
All values in a will be transformed as follow:
A value 100 will be added to any value of a which is smaller than 10
Any value in a which is a multiple of 3, will be divided by 3
Any value above 25 will be multiplied by 10
Any other value, not matching the previous conditions, will be set to 0
Bot conds and actions are lists, and must have the same number of arguments. The first element in conds has its action set as the first element of actions.
It could be used to determine the index in a vector for a particular value (eventhough this should be done using the nonzero() numpy function).
a = array(....)
conds = [a <= target, a > target]
actions = [1, 0]
index = select(conds, actions).sum()
This is probably a stupid way of getting an index, but it demonstrates how we can use select()... and it works :-)

Related

How to implement a for loop to iterate over this dataset

I'm implementing the following code to get match history data from an API:
my_matches = watcher.match.matchlist_by_puuid(
region=my_region,
puuid=me["puuid"],
count=100,
start=1)
The max items I can display per page is 100 (count), I would like my_matches to equal the first 1000 matches, thus looping start from 1 - 10.
Is there any way to effectively do this?
Based on the documentation (see page 17), this function returns a list of strings. The function can only return a 100 count max. Also, it accepts a start for where to start returning these matches (which defaults at 0). A possible solution for your problem would look like this:
allMatches = [] # will become a list containing 10 lists of matches
for match_page in range(9): # remember arrays start at 0!
countNum = match_page * 100 # first will be 0, second 100, third 200 etc...
my_matches = watcher.match.matchlist_by_puuid(
region=my_region,
puuid=me["puuid"],
count=100,
start=countNum)
# ^ Notice how we use countNum as the start for returning
allMatches.append(my_matches)
If you want to remain concise, and you want your matchesto be a 1000 long list of results, you can concatenate direclty all the outputs of size 100 as:
import itertools
matches = list(itertools.chain.from_iterable(watcher.match.matchlist_by_puuid(
region=my_region,
puuid=me["puuid"],
count=100,
start=i*100) for i in range(10)))

Python closest match between two string columns

I am looking to get the closest match between two columns of string data type in two separate tables. I don't think the content matters too much. There are words that I can match by pre-processing the data (lower all letters, replace spaces and stop words, etc...) and doing a join. However I get around 80 matches out of over 350. It is important to know that the length of each table is different.
I did try to use some code I found online but it isn't working:
def Races_chien(df1,df2):
myList = []
total = len(df1)
possibilities = list(df2['Rasse'])
s = SequenceMatcher(isjunk=None, autojunk=False)
for idx1, df1_str in enumerate(df1['Race']):
my_str = ('Progress : ' + str(round((idx1 / total) * 100, 3)) + '%')
sys.stdout.write('\r' + str(my_str))
sys.stdout.flush()
# get 1 best match that has a ratio of at least 0.7
best_match = get_close_matches(df1_str, possibilities, 1, 0.7)
s.set_seq2(df1_str, best_match)
myList.append([df1_str, best_match, s.ratio()])
return myList
It says: TypeError: set_seq2() takes 2 positional arguments but 3 were given
How can I make this work?
I think you need s.set_seqs(df1_str, best_match) function instead of s.set_seq2(df1_str, best_match) (docs)
You can use jellyfish library that has useful tools for comparing how similar two strings are if that is what you are looking for.
Try changing:
s = SequenceMatcher(isjunk=None, autojunk=False)
To:
s = SequenceMatcher(None, isjunk=None, autojunk=False)
Here is an answer I finally got:
from fuzzywuzzy import process, fuzz
value = []
similarity = []
for i in df1.col:
ratio = process.extract(i, df2.col, limit= 1)
value.append(ratio[0][0])
similarity.append(ratio[0][1])
df1['value'] = pd.Series(value)
df1['similarity'] = pd.Series(similarity)
This will add the value with the closest match from df2 in df1 together with the similarity %

Pandas Dataframe replace string based on length

I have a pandas dataframe (df2) with about 160,000 rows. I'm trying to change some of the values in a column (url).
The strings in this column have lengths between 108 and 150 characters. If the string is not 108 characters, I want to replace it with the same string, cutting off the last 10 characters. IF the string is 108 characters. I want to leave it alone. Please note that i'm not trying to make every string 108 characters, I'm just trying to cut off the last 10 characters of any string that isn't 108 characters.
example: len(s) = 114, replace with s[:-10]
I built a function that will do this, but it's insanely slow, probably because it rebuilds the dataframe in each loop.
for i in df2.url:
if len(i) != 108:
new_i = i[:-10]
df2 = df2.replace(i,new_i)
There has to be a faster way to do this, but I haven't been able to figure out how. I would love the expertise of someone more versed in pandas.
Below is an example of 200 rows of the column I'm trying to change:
['https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301108?gameHash=bde58669fc59c853&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291187?gameHash=f7fcd2d6ca775fb5&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291192?gameHash=005335984c8f8a3a&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301128?gameHash=fcbd2630c0faec49&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301159?gameHash=9a7726176fdabfde&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301169?gameHash=5d816e6d30d2b659&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301183?gameHash=396641afdcdd99d9&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271494?gameHash=bd51798e1358c47f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130153?gameHash=00a7861ac0a23aef',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271495?gameHash=0d828bbc9aa9996c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271497?gameHash=bd4810bb801abf24',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130166?gameHash=1cff679b64acb047',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130177?gameHash=1f92cbefd9a965e0',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271500?gameHash=abbdae6c3e7b4006',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271505?gameHash=7c970a84e132a578',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130182?gameHash=ccb50f6e86e4c3df',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130193?gameHash=0995997660a65721',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301262?gameHash=c594a9a52f46cc50',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130196?gameHash=31553f5bb6ba4420',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301270?gameHash=5b3babb5d392d78d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130201?gameHash=3d2aa031c17d90ae',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301290?gameHash=31ce80069fdbc873',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130210?gameHash=91c7b22cded939ff',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301305?gameHash=3f8d664b3b988446',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130221?gameHash=a8580ee66ffbb525',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291406?gameHash=5220923eb35c42c6',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291426?gameHash=83c7c51530ea074e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291442?gameHash=28f7b485f710168f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291458?gameHash=49cc14d02ccd0674',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291470?gameHash=f087c853097c2dd9',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261474?gameHash=e6c01a288de5dc41',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130229?gameHash=1489421028163983',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261475?gameHash=c984e795d6406cd5',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130243?gameHash=5491d110de253089',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261482?gameHash=f2283324f82caa66&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130253?gameHash=f8e39ae785d11c0c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130264?gameHash=a98718c088ce663c',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261488?gameHash=6517011920487fbf&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291651?gameHash=5ec1b3473060dfd2',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291682?gameHash=a8f2c06d04117279',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291703?gameHash=cfb2d078f289825c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291737?gameHash=cf67a15df43c2bb2',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291748?gameHash=7a3c085cf703d7bd',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291789?gameHash=51e5ed28085fd299',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291812?gameHash=e540d208bbc69bb3',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291835?gameHash=a75ab48a22470022',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291845?gameHash=2eab12f8ffd0dfd0',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130294?gameHash=ecf040ad60fa9726&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130299?gameHash=499a21480080a722&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130306?gameHash=d0e60bf49b6bf008&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292296?gameHash=3db885bd11a047bc',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292315?gameHash=2ecf71aaea031312',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292329?gameHash=5ed85b948b32b8e8',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292341?gameHash=7335d6ca06763dc0&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292345?gameHash=6f86444cce429244',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292348?gameHash=c6a4eec48810e8d5&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292353?gameHash=6db57c090ed235bd&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292354?gameHash=79845cdf9a6e88db',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120429?gameHash=436739b9e99a246e&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120436?gameHash=58bc4281a76534f3&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120440?gameHash=4b74592ff226c39f&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120447?gameHash=9358d210749ab778&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292579?gameHash=14865e88bd1e30a7',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292607?gameHash=0ae34d7f67620dc4',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292635?gameHash=f94944bb4f061f0d',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292648?gameHash=1338dde99c71877f&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302501?gameHash=f71748ae9cad5866&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302519?gameHash=672c1377c3d37ed0&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302531?gameHash=49cf9a8f3942b9c8&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302595?gameHash=314d39ea940b354f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302628?gameHash=0ab39ec364a3ff5b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302635?gameHash=5625553825f5994e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302651?gameHash=555c7cd73dff952d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292960?gameHash=e3ce73c142354517',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292974?gameHash=ab79b8f6f354bc0b',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302827?gameHash=6a1a5de57a7ce6b9&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302855?gameHash=f9144d0822d68632&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302881?gameHash=369cd071defeadd9&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302906?gameHash=c65d2e76e9aa721e&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130488?gameHash=411522a3de69bb79',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130489?gameHash=51c4c81c13a484c7',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130496?gameHash=9575986535e4f4c2',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293312?gameHash=8e2209227e28843b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120557?gameHash=f5bec07774ed5a5e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293319?gameHash=762cb3a92744846f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130515?gameHash=548d7e528ef1f81e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293370?gameHash=8a70038d2eba61de',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293393?gameHash=841d85edbfa78057',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130518?gameHash=6764d64a5ef8377e&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120578?gameHash=838a1db0f44411c8&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120583?gameHash=c542c3368048efd6&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120585?gameHash=925d9c523a0b0bdb&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293765?gameHash=53412e36eb2eab86',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303478?gameHash=8df5ef3d826ad211&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303509?gameHash=d0849b1ba82d4826&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293812?gameHash=48d825f1bb110b55',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303533?gameHash=3a712b015a672d8d&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293850?gameHash=0a29fdee10ed35d0',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293885?gameHash=1ffaffd98da7e806',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303581?gameHash=2bf61273d44c302f&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293897?gameHash=77ccf507e1eaa05c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293899?gameHash=aa93723cded96f3b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293901?gameHash=f5fb660360f96ad6',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293909?gameHash=245dbdf428788434',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303619?gameHash=2e2f2ff9c6a32595&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303626?gameHash=3bba86d0f9ff1d11&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293929?gameHash=f4b6f53e68bbbc86',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303641?gameHash=25ffa91aeb9ed707',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293950?gameHash=e2f3a99412844d36',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303655?gameHash=4ff2ebbe72e635bb',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293964?gameHash=10bd6ec239231196',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130548?gameHash=afd267703d3cbbb1&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303666?gameHash=a30e98d241d22eef',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130553?gameHash=f4360fb632593491&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130560?gameHash=e1e5bae936585a24&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120607?gameHash=f4b702f689f87c90',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130563?gameHash=43a7c73ecd281a63&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120622?gameHash=c87f08d06f392f3f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120629?gameHash=6b39ee929c2ebc47',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120638?gameHash=eb17c2013b9ee77d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120649?gameHash=aab6f321110ef3ed',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294174?gameHash=8f5cb3f02bf790d7&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303861?gameHash=02847551947ca67d&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294191?gameHash=e574ac58bbe81abb&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303876?gameHash=e733bc45e47f4856&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303904?gameHash=d8aac7332b9edfe8',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294233?gameHash=6762c2c72bc47359&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303974?gameHash=28f566b2fa35a32a&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294260?gameHash=246841d34c9660aa&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303999?gameHash=619d5a2d571a1b01&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304020?gameHash=99508a2da285eb4c&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304032?gameHash=30e5b243a407326c&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304035?gameHash=f8e3702e77f87cc7',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304054?gameHash=db39c4bcb7c2320e&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304075?gameHash=a3d0d6acfb8b92f1&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304079?gameHash=8339c23d8d925f8b',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304081?gameHash=d3e5c8f0270ce96f&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304111?gameHash=64ded61e41c18ccd',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304122?gameHash=bf7e80351592ce98',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304132?gameHash=ff37582431bd7e7b&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130611?gameHash=a099e1df984018a1',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304158?gameHash=62b9c13c8cecf652',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294417?gameHash=746905a629b8f374',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130621?gameHash=9d171c9622870a7b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304165?gameHash=c34ae80c4ee8c7bd',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304169?gameHash=ee6bc6a087a6bc36&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304193?gameHash=fe234e8ca7d2343f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130630?gameHash=b1b183ad3374db06',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304217?gameHash=b29c2b7461c7700f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304223?gameHash=7c70a52e69b01c56',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130643?gameHash=15bb88ac79a622a1&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120705?gameHash=a6532b3af6accaf2',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304264?gameHash=f5e69d8e2f6bae5e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120711?gameHash=79659ad2d107f0d9',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304282?gameHash=f9ea42cec97e930f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130676?gameHash=f3e34e47140460ff',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304302?gameHash=617f3af3e7d2ab4d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130696?gameHash=532d412c3f38c0c5',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304324?gameHash=01bf1a7465a412ba',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130709?gameHash=1491ee8228ad66ec',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304356?gameHash=fc584c5143087c0b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304374?gameHash=175112ba57cbce5e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304391?gameHash=72ad86120a14eb54',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304399?gameHash=2536b98ac19e617d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304412?gameHash=a30a480459e9151c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294710?gameHash=0d01dd80aa803997',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294725?gameHash=3ae63821918e2b43',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294738?gameHash=5fd20a0eec2c86f4',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304487?gameHash=0d5d1e4e719e8c46&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130723?gameHash=c5113e7f25839c2e&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304504?gameHash=fda4895c0bea1e8a&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294750?gameHash=fa9335e1a61165a5',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130734?gameHash=88de231d14ea4b07&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304521?gameHash=b70af7bde6c54520&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294784?gameHash=7d1bf4754cda9b46',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130739?gameHash=4ddb470392dd9248&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304544?gameHash=b14635dac9add7b4&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294813?gameHash=51f4579db6e7049f',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130748?gameHash=8f544ac73c53a606&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304574?gameHash=a5dcde67b90f29e3&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304609?gameHash=7a5a6778a7074f09',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304635?gameHash=4cda5972cf8dd6bd',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304671?gameHash=67e29eccbbc8f667',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304691?gameHash=1572b2bb76b73da1',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304700?gameHash=b50bc9265ac35f9f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304739?gameHash=b80bb99cbce5bd71',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304768?gameHash=6ed6e9d7108f27e0',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304785?gameHash=13054e8f14fcae76',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304794?gameHash=b4c27881f0c4481c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304884?gameHash=29e8f7f002108b46',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304889?gameHash=7774d22665d9526d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304894?gameHash=bac5c580914f7aaa',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304928?gameHash=b8b029b3d4002fbc',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130793?gameHash=a3f6e45612b56302',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295356?gameHash=731afc76037bd245',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295369?gameHash=e743122ca08b77d8',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295383?gameHash=072ee1028f03f4c9',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295402?gameHash=560c984fc1ba1168',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295440?gameHash=32cdbf5ce1441159&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295467?gameHash=bcc21c92fa78e889&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295489?gameHash=95386dea5cf1ab09&tab=overview']
Basic Solution
The below solution makes use of a lambda function defined within a call to pandas.DataFrame.apply().
df['url'] = df['url'].apply(lambda x: x if len(x) == 108 else x[:-10])
Here, each value within df['url'] (x) remains the same if len(x) == 108, otherwise it is updated to be x[:-10].
Handling Exceptions
The below solution is similar to that provided above, however in this case some basic exception handling has been implemented within the url_trim() function called by pandas.DataFrame.apply().
This is more robust than the first solution and will not halt code execution when an exception is thrown within pandas.DataFrame.apply() due to unexpected values within df['url'] rows, in these cases the value is simply left unchanged - for example if numpy.nan is used for null values.
def url_trim(x):
try:
if len(x) != 108:
return x[:-10]
else:
return x
except:
return x
df['url'] = df['url'].apply(lambda x: url_trim(x))
The following code will check the length of the url columns and prune of last 10 characters of the string if the string is below 108.The modified url will be included in modified_url column.
# Get string length
df["string_length"] = df["url"].astype(str).str.len()
# Create a filter based on string length
filter_length = df["string_length"]<108
# Extract string for the filter
df["modified_url"]=df["url"]
df.loc[filter_length,"modified_url"]=df[filter_length]["url"].astype(str).str[:-10]

How to efficient find existent key-values of 2-dimensional dictionary in python which are between 4 values?

I have a little Problem in Python. I got a 2 dimensional dictionary. Lets call it dict[x,y] now. x and y are integers. I try to only select the key-pair-values, which match between 4 points. Function should look like this:
def search(topleft_x, topleft_y, bottomright_x, bottomright_y):
For example: search(20, 40, 200000000, 300000000)
Now are Dictionary-items should be returned that match to:
20 < x < 20000000000
AND 40 < y < 30000000000
Most of the key-pair-values in this huge matrix are not set (see picture - this is why i cant just iterate).
This function should return a shorted dictionary. In the example shown in the picture, it would be a new dictionary with the 3 green circled values. Is there any simple solution to realize this?
I recently used 2-for-loops. In this example they would look like this:
def search():
for x in range(20, 2000000000):
for y in range(40, 3000000000):
try:
#Do something
except:
#Well item just doesnt exist
Of course this is highly inefficient. So my question is: How to Boost up this simple thing in Python? In C# i used Linq for stuff like this... What to use in python?
Thanks for help!
Example Picture
You dont go over random number ranges and ask 4million times for forgiveness - you use 2 number range to specify your "filters" and go only over existing keys in the dictionary that fall into those ranges:
# get fancy on filtering if you like, I used explicit conditions and continues for clearity
def search(d:dict,r1:range, r2:range)->dict:
d2 = {}
for x in d: # only use existing keys in d - not 20k that might be in
if x not in r1: # skip it - not in range r1
continue
d2[x] = {}
for y in d[x]: # only use existing keys in d[x] - not 20k that might be in
if y not in r2: # skip it - not in range r2
continue
d2[x][y] = "found: " + d[x][y][:] # take it, its in both ranges
return d2
d = {}
d[20] = {99: "20",999: "200",9999: "2000",99999: "20000",}
d[9999] = { 70:"70",700:"700",7000:"7000",70000:"70000"}
print(search(d,range(10,30), range(40,9000)))
Output:
{20: {99: 'found: 20', 999: 'found: 200'}}
It might be useful to take a look at modules providing sparse matrices.

python: data cleaning - detect pattern for fraudulent email addresses

I am cleaning a data set with fraudulent email addresses that I am removing.
I established multiple rules for catching duplicates and fraudulent domains. But there is one screnario, where I can't think of how to code a rule in python to flag them.
So I have for example rules like this:
#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')
#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)
This is the data where I can't figure out a rule to catch it. Basically I am looking for a way to flag addresses that start the same way, but then have consecutive numbers in the end.
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
My solution isn't efficient, nor pretty. But check it out and see if it works for you #jeangelj. It definitely works for the examples you provided. Good luck!
import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('#')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['abc7020#gmail.com', 'attn1#gmail.com']
In [31]: spam
Out[31]:
['abc7020.10#gmail.com',
'abc7020.11#gmail.com',
'abc7020.12#gmail.com',
'abc7020.13#gmail.com',
'abc7020.14#gmail.com',
'abc7020.15#gmail.com',
'abc7020.1#gmail.com',
'attn12345678#gmail.com',
'attn1234567#gmail.com',
'attn123456#gmail.com',
'attn12345#gmail.com',
'attn1234#gmail.com',
'attn123#gmail.com',
'attn12#gmail.com']
# THE TRUTH YALL!
You can use a regular expression to do this; example below:
import re
a = "attn12345#gmail.comf"
b = "abc7020.14#gmail.com"
c = "abc7020#gmail.com"
d = "attn12345678#gmail.com"
pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?#")
if pattern.search(a):
print("spam1")
if pattern.search(b):
print("spam2")
if pattern.search(c):
print("spam3")
if pattern.search(d):
print("spam4")
If you run the code you will see:
$ python spam.py
spam1
spam2
spam3
spam4
The benefit to this method is that its standardized (regular expressions) and that you can adjust the strength of the match easily by adjusting the values within {}; which means you can have a global configuration file where you set/adjust the values. You can also adjust the regular expression easily without having to rewrite code.
First take a look at regexp question here
Second, try to filter email address like that:
# Let's email is = 'attn1234#gmail.com'
email = 'attn1234#gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
print ('%s is good' % email)
else:
print ('%s is BAD' % email)
You could pick a diff threshold using edit distance (aka Levenshtein distance). In python:
$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1#gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])
If you wanted to be smarter about it, you could run through the resulting list and, instead of turning it into a set, keep track of how many other email addresses it was near - then use that as a 'weight' to determine fake-ness.
This gets you not only the given cases (where the fraudulent addresses all share a common start and differ only in numerical suffix, but additionally number or letter padding eg at the beginning or in the middle of an email address.
ids = [s.split('#')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
mi = ids[i]
mj = ids[j]
if len(mj) == len(mi) + 1 and mj.startswith(mi):
try:
int(mj[-1])
det[j,i] = True
det[i,j] = True
except:
continue
spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()
Here's one way to approach it, that should be pretty efficient.
We do it by grouping the email address in lengths, so that we only need to check if each email address matches the level down, by a slice and set membership check.
The code:
First, read in the data:
import pandas as pd
import numpy as np
string = '''
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
foo123#bar.com
foo1#bar.com
'''
x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]
We strip off the #foo.bar part, and then filer to only those that end with a number, and add on a 'length' column:
#split on #, expand means into two columns
emails = x.x.str.split('#', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()
Now, all we have to do, is take each length, and length -1, and see if the length. with it's last character dropped, appears in a set of the n-1 lengths (and, we have to check if the opposite is true, in case it is the shortest repeat):
#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)
#for each length
for j in lengths:
#we subset those of that length
totest = emails['lengths'] == j
#and those who might be the shorter version
against = emails['lengths'] == j -1
#we make a set of unique values, for a hashed lookup
againstset = set([i for i in emails.loc[against,0]])
#we cut off the last char of each in to test
tests = emails.loc[totest,0].str[:-1]
#we check matches, by checking the set
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
#viceversa, otherwise we miss the smallest one in the group
againstset = set([i for i in emails.loc[totest,0].str[:-1]])
tests = emails.loc[against,0]
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
The resulting mask can be converted to boolean, and used to subset the original (deduplicated) dataframe, and the indices should match the original indices to subset like that:
x.loc[~mask.astype(bool),:]
x
0 abc7020#gmail.com
16 foo123#bar.com
17 foo1#bar.com
You can see that we have not removed your first value, as the '.' means it did not match - you can remove the punctuation first.
I have an idea on how to solve this:
fuzzywuzzy
Create a set of unique emails, for-loop over them and compare them with fuzzywuzzy.
Example:
from fuzzywuzzy import fuzz
for email in emailset:
for row in data:
emailcomp = re.search(pattern=r'(.+)#.+',string=email).groups()[0]
rowemail = re.search(pattern=r'(.+)#.+',string=row['email']).groups()[0]
if row['email']==email:
continue
elif fuzz.partial_ratio(emailcomp,rowemail)>80:
'flagging operation'
I took some liberties with how the data is represented, but I feel the variable names are mnemonic enough for you to understand what I am getting at. It is a very rough piece of code, in that I have not thought through how to stop repetitive flagging.
Anyways, the elif part compares the two email addresses without #gmail.com (or any other email e.g. #yahoo.com), if the ratio is above 80 (play around with this number) use your flagging operation.
For example:
fuzz.partial_ratio("abc7020.1", "abc7020")
100

Categories

Resources