How to get match result by given range using regular expression? - python

I'm stucking with my code to get all return match by given range. My data sample is:
comment
0 [intj74, you're, whipping, people, is, a, grea...
1 [home, near, kcil2, meniaga, who, intj47, a, l...
2 [thematic, budget, kasi, smooth, sweep]
3 [budget, 2, intj69, most, people, think, of, e...
I want to get the result as: (where the given range is intj1 to intj75)
comment
0 [intj74]
1 [intj47]
2 [nan]
3 [intj69]
My code is:
df.comment = df.comment.apply(lambda x: [t for t in x if t=='intj74'])
df.ix[df.comment.apply(len) == 0, 'comment'] = [[np.nan]]
I'm not sure how to use regular expression to find the range for t=='range'. Or any other idea to do this?
Thanks in advance,
Pandas Python Newbie

you could replace [t for t in x if t=='intj74'] with, e.g.,
[t for t in x if re.match('intj[0-9]+$', t)]
or even
[t for t in x if re.match('intj[0-9]+$', t)] or [np.nan]
which would also handle the case if there are no matches (so that one wouldn't need to check for that explicitly using df.ix[df.comment.apply(len) == 0, 'comment'] = [[np.nan]]) The "trick" here is that an empty list evaluates to False so that the or in that case returns its right operand.

I am new to pandas as well. You might have initialized your DataFrame differently. Anyway, this is what I have:
import pandas as pd
data = {
'comment': [
"intj74, you're, whipping, people, is, a",
"home, near, kcil2, meniaga, who, intj47, a",
"thematic, budget, kasi, smooth, sweep",
"budget, 2, intj69, most, people, think, of"
]
}
print(df.comment.str.extract(r'(intj\d+)'))

Related

Python closest match between two string columns

I am looking to get the closest match between two columns of string data type in two separate tables. I don't think the content matters too much. There are words that I can match by pre-processing the data (lower all letters, replace spaces and stop words, etc...) and doing a join. However I get around 80 matches out of over 350. It is important to know that the length of each table is different.
I did try to use some code I found online but it isn't working:
def Races_chien(df1,df2):
myList = []
total = len(df1)
possibilities = list(df2['Rasse'])
s = SequenceMatcher(isjunk=None, autojunk=False)
for idx1, df1_str in enumerate(df1['Race']):
my_str = ('Progress : ' + str(round((idx1 / total) * 100, 3)) + '%')
sys.stdout.write('\r' + str(my_str))
sys.stdout.flush()
# get 1 best match that has a ratio of at least 0.7
best_match = get_close_matches(df1_str, possibilities, 1, 0.7)
s.set_seq2(df1_str, best_match)
myList.append([df1_str, best_match, s.ratio()])
return myList
It says: TypeError: set_seq2() takes 2 positional arguments but 3 were given
How can I make this work?
I think you need s.set_seqs(df1_str, best_match) function instead of s.set_seq2(df1_str, best_match) (docs)
You can use jellyfish library that has useful tools for comparing how similar two strings are if that is what you are looking for.
Try changing:
s = SequenceMatcher(isjunk=None, autojunk=False)
To:
s = SequenceMatcher(None, isjunk=None, autojunk=False)
Here is an answer I finally got:
from fuzzywuzzy import process, fuzz
value = []
similarity = []
for i in df1.col:
ratio = process.extract(i, df2.col, limit= 1)
value.append(ratio[0][0])
similarity.append(ratio[0][1])
df1['value'] = pd.Series(value)
df1['similarity'] = pd.Series(similarity)
This will add the value with the closest match from df2 in df1 together with the similarity %

Improving loop in loops with Numpy

I am using numpy arrays aside from pandas for speed purposes. However, I am unable to advance my codes using broadcasting, indexing etc. Instead, I am using loop in loops as below. It is working but seems so ugly and inefficient to me.
Basically what I am doing is, I am trying to imitate groupby of pandas at the step mydata[mydata[:,1]==i]. You may consider it as a firm id number. Then with respect to the lookup data, I am checking if it is inside the selected firm or not at the step all(np.isin(lookup[u],d[:,3])). But as I denoted at the beginning, I feel so uncomfortable about this.
out = []
for i in np.unique(mydata[:,1]):
d = mydata[mydata[:,1]==i]
for u in range(0,len(lookup)):
control = all(np.isin(lookup[u],d[:,3]))
if(control):
out.append(d[np.isin(d[:,3],lookup[u])])
It takes about 0.27 seconds. However there must exist some clever alternatives.
I also tried Numba jit() but it does not work.
Could anyone help me about that?
Thanks in advance!
Fake Data:
a = np.repeat(np.arange(100)+5000, np.random.randint(50, 100, 100))
b = np.random.randint(100,200,len(a))
c = np.random.randint(10,70,len(a))
index = np.arange(len(a))
mydata = np.vstack((index,a, b,c)).T
lookup = []
for i in range(0,60):
lookup.append(np.random.randint(10,70,np.random.randint(3,6,1) ))
I had some problems getting the goal of your Program, but I got a decent performance improvement, by refactoring your second for loop. I was able to compress your code to 3 or 4 lines.
f = (
lambda lookup: out1.append(d[np.isin(d[:, 3], lookup)])
if all(np.isin(lookup, d[:, 3]))
else None
)
out = []
for i in np.unique(mydata[:, 1]):
d = mydata[mydata[:, 1] == i]
list(map(f, lookups))
This resolves to the same output list you received previously and the code runs almost twice as quick (at least on my machine).

How to efficient find existent key-values of 2-dimensional dictionary in python which are between 4 values?

I have a little Problem in Python. I got a 2 dimensional dictionary. Lets call it dict[x,y] now. x and y are integers. I try to only select the key-pair-values, which match between 4 points. Function should look like this:
def search(topleft_x, topleft_y, bottomright_x, bottomright_y):
For example: search(20, 40, 200000000, 300000000)
Now are Dictionary-items should be returned that match to:
20 < x < 20000000000
AND 40 < y < 30000000000
Most of the key-pair-values in this huge matrix are not set (see picture - this is why i cant just iterate).
This function should return a shorted dictionary. In the example shown in the picture, it would be a new dictionary with the 3 green circled values. Is there any simple solution to realize this?
I recently used 2-for-loops. In this example they would look like this:
def search():
for x in range(20, 2000000000):
for y in range(40, 3000000000):
try:
#Do something
except:
#Well item just doesnt exist
Of course this is highly inefficient. So my question is: How to Boost up this simple thing in Python? In C# i used Linq for stuff like this... What to use in python?
Thanks for help!
Example Picture
You dont go over random number ranges and ask 4million times for forgiveness - you use 2 number range to specify your "filters" and go only over existing keys in the dictionary that fall into those ranges:
# get fancy on filtering if you like, I used explicit conditions and continues for clearity
def search(d:dict,r1:range, r2:range)->dict:
d2 = {}
for x in d: # only use existing keys in d - not 20k that might be in
if x not in r1: # skip it - not in range r1
continue
d2[x] = {}
for y in d[x]: # only use existing keys in d[x] - not 20k that might be in
if y not in r2: # skip it - not in range r2
continue
d2[x][y] = "found: " + d[x][y][:] # take it, its in both ranges
return d2
d = {}
d[20] = {99: "20",999: "200",9999: "2000",99999: "20000",}
d[9999] = { 70:"70",700:"700",7000:"7000",70000:"70000"}
print(search(d,range(10,30), range(40,9000)))
Output:
{20: {99: 'found: 20', 999: 'found: 200'}}
It might be useful to take a look at modules providing sparse matrices.

I'm trying to make a simple script that says two different two phrase lines(Python)

So, I'm just starting to program Python and I wanted to make a very simple script that will say something like "Gabe- Hello, my name is Gabe (Just an example of a sentence" + "Jerry- Hello Gabe, I'm Jerry" OR "Gabe- Goodbye, Jerry" + "Jerry- Goodbye, Gabe". Here's pretty much what I wrote.
answers1 = [
"James-Hello, my name is James!"
]
answers2 = [
"Jerry-Hello James, my name is Jerry!"
]
answers3 = [
"Gabe-Goodbye, Samuel."
]
answers4 = [
"Samuel-Goodbye, Gabe"
]
Jack1 = (answers1 + answers2)
Jack2 = (answers3 + answers4)
Jacks = ([Jack1,Jack2])
import random
for x in range(2):
a = random.randint(0,2)
print (random.sample([Jacks, a]))
I'm quite sure it's a very simple fix, but as I have just started Python (Like, literally 2-3 days ago) I don't quite know what the problem would be. Here's my error message
Traceback (most recent call last):
File "C:/Users/Owner/Documents/Test Python 3.py", line 19, in <module>
print (random.sample([Jacks, a]))
TypeError: sample() missing 1 required positional argument: 'k'
If anyone could help me with this, I would very much appreciate it! Other than that, I shall be searching on ways that may be relevant to fixing this.
The problem is that sample requires a parameter k that indicates how many random samples you want to take. However in this case it looks like you do not need sample, since you already have the random integer. Note that that integer should be in the range [0,1], because the list Jack has only two elements.
a = random.randint(0,1)
print (Jacks[a])
or the same behavior with sample, see here for an explanation.
print (random.sample(Jacks,1))
Hope this helps!
random.sample([Jacks, a])
This sample method should looks like
random.sample(Jacks, a)
However, I am concerted you also have no idea how lists are working. Can you explain why do you using lists of strings and then adding values in them? I am losing you here.
If you going to pick a pair or strings, use method described by Florian (requesting data by index value.)
k parameter tell random.sample function that how many sample you need, you should write:
print (random.sample([Jacks, a], 3))
which means you need 3 sample from your list. the output will be something like:
[1, jacks, 0]

Format a python list and search for patterns

I am getting rows from a spreadsheet with mixtures of numbers, text and dates
I want to find elements within the list, some numbers and some text
for example
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(str, sg)
#sg = map(unicode, sg) #option?
if any("-1" in s for s in sg):
#do something if matched
I don't feel this is the correct way to do this, I am also trying to match stuff like -1.5 and -1.5C and other unexpected characters like OPEN15 compared to 15
I have also looked at
sg.index("-1")
If positive then its a match (Only good for direct matches)
Some help would be appreciated
If you want to call a function for each case, I would do it this way:
def stub1(elem):
#do something for match of type '-1'
return
def stub2(elem):
#do something for match of type 'SD4'
return
def stub3(elem):
#do something for match of type 'OPEN15'
return
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(unicode, sg)
patterns = {u"-1":stub1, u"SD4": stub2, u"OPEN15": stub3} # add more if you want
for elem in sg:
for k, stub in patterns.iteritems():
if k in elem:
stub(elem)
break
Where stub1, stub2, ... are the fonctions that contains the code for each case.
It will be called (max 1 time per strings) if the string contains a matching substring.
What do you mean by "I don't feel this is the correct way to do this" ? Are you not getting the result you expect ? Is it too slow ?
Maybe, you can organize your data by columns instead of rows and have a more specific filters. If you are looking for speed, I'd suggest using the numpy module which has a very intersting function called select()
Scipy select example
By transforming all your rows in a numpy array, you can test several columns in one pass. This function is amazingly efficient and powerful ! Basically it's used like this:
import numpy as np
a = array(...)
conds = [a < 10, a % 3 == 0, a > 25]
actions = [a + 100, a / 3, a * 10]
result = np.select(conds, actions, default = 0)
All values in a will be transformed as follow:
A value 100 will be added to any value of a which is smaller than 10
Any value in a which is a multiple of 3, will be divided by 3
Any value above 25 will be multiplied by 10
Any other value, not matching the previous conditions, will be set to 0
Bot conds and actions are lists, and must have the same number of arguments. The first element in conds has its action set as the first element of actions.
It could be used to determine the index in a vector for a particular value (eventhough this should be done using the nonzero() numpy function).
a = array(....)
conds = [a <= target, a > target]
actions = [1, 0]
index = select(conds, actions).sum()
This is probably a stupid way of getting an index, but it demonstrates how we can use select()... and it works :-)

Categories

Resources