Slow code in "inner joins" lists in python - python

I have seen several posts about lists in python here, but I don't find a correct answer to my question; because it is about optimize a code.
I have a python code to compare two lists. It has to find the same code, and modifying the value of the second positions. It finally works perfectly, but it takes a lot of time. In SQL this query take 2 minuts, no more...., however, here, I spend 15 minutes.... so I don't understand if it is problem of memory or bad code written.
I have two lists.
The first one [code, points]. The second [code, license]
If the first value(code) in the first list, match with the first value of the second list(code); it has to update the second value of the first list (points) if the license is equal to 'THIS', for example:
itemswithscore = [5675, 0], [6676, 0], [9898, 0], [4545, 0]
itemswithlicense = [9999, 'ATR'], [9191, 'OPOP'], [9898, 'THIS'], [2222, 'PLPL']
for sublist1 in itemswithscore:
for sublist2 in itemswithlicense:
if sublist1[0] == sublist2[0]: #this is the "inner join" :)
if sublist2[1] == 'THIS': #It has to be license 'THIS'
sublist1[1] += 50 #I add 50 to the score value
Finally, I have this list updated in the code 9868:
itemswithscore = [5675, 0], [6676, 0], [9898, 50], [4545, 0]
It is true that the two lists have 80.000 values everyone.. :(
Thanks in advance!!!

I'll suggest to transform/keep your data structure into/as dicts. In that way, you won't need to iterate over both lists with nested for loops - an O(n2) or O(n x m) operation - searching for where the lists' code numbers align before updating the score value.
You'll simply update the value of the score where the key at the corresponding dict matches the search string:
dct_score = dict(itemswithscore)
dct_license = dict(itemswithlicense)
for k in dct_score:
if dct_license.get(k) == 'THIS': # use dict.get in case key does not exist
dct_score[k] += 50

It would be very efficient if you can use pandas.
So You can make two dataframes and merge them on a single column
Something like this
itemswithscore = [5675, 0], [6676, 0], [9898, 0], [4545, 0]
itemswithlicense = [9999, 'ATR'], [9191, 'OPOP'], [9898, 'THIS'], [2222, 'PLPL']
df1 = pd.DataFrame(list(itemswithscore), columns =['code', 'points'])
df2 = pd.DataFrame(list(itemswithlicence), columns=['code', 'license'])
df3 = pd.merge(df1, df2 , on='code', how='inner')
df3 = df3.drop('points', axis=1)
Hope this helps, accept if correct
Cheers!

I'm pretty sure the slowness is mostly due to the looping itself, which is not very fast in Python. You can speed up the code somewhat by caching variables, like so:
for sublist1 in itemswithscore:
a = sublist1[0] # Save to variable to avoid repeated list-lookup
for sublist2 in itemswithlicense:
if a == sublist2[0]:
if sublist2[1] == 'THIS':
sublist1[1] += 50
Also, if you happen to know that 'THIS' does not occur in itemswithlicense more than once, you should insert a break after you update sublist1[1].
Let me know how much of a different this make.

Related

Improving loop in loops with Numpy

I am using numpy arrays aside from pandas for speed purposes. However, I am unable to advance my codes using broadcasting, indexing etc. Instead, I am using loop in loops as below. It is working but seems so ugly and inefficient to me.
Basically what I am doing is, I am trying to imitate groupby of pandas at the step mydata[mydata[:,1]==i]. You may consider it as a firm id number. Then with respect to the lookup data, I am checking if it is inside the selected firm or not at the step all(np.isin(lookup[u],d[:,3])). But as I denoted at the beginning, I feel so uncomfortable about this.
out = []
for i in np.unique(mydata[:,1]):
d = mydata[mydata[:,1]==i]
for u in range(0,len(lookup)):
control = all(np.isin(lookup[u],d[:,3]))
if(control):
out.append(d[np.isin(d[:,3],lookup[u])])
It takes about 0.27 seconds. However there must exist some clever alternatives.
I also tried Numba jit() but it does not work.
Could anyone help me about that?
Thanks in advance!
Fake Data:
a = np.repeat(np.arange(100)+5000, np.random.randint(50, 100, 100))
b = np.random.randint(100,200,len(a))
c = np.random.randint(10,70,len(a))
index = np.arange(len(a))
mydata = np.vstack((index,a, b,c)).T
lookup = []
for i in range(0,60):
lookup.append(np.random.randint(10,70,np.random.randint(3,6,1) ))
I had some problems getting the goal of your Program, but I got a decent performance improvement, by refactoring your second for loop. I was able to compress your code to 3 or 4 lines.
f = (
lambda lookup: out1.append(d[np.isin(d[:, 3], lookup)])
if all(np.isin(lookup, d[:, 3]))
else None
)
out = []
for i in np.unique(mydata[:, 1]):
d = mydata[mydata[:, 1] == i]
list(map(f, lookups))
This resolves to the same output list you received previously and the code runs almost twice as quick (at least on my machine).

Getting wrong results with np.argpartition, while selecting maximum n values from an array

so I was using this answer on 'How do I get indices of N maximum values in a NumPy array?' question. I used it in my ML model in which it outputs Logsoftmax layer values and I was thinking to get top 4 classes in each. In most of the cases, it sorted and gave values correctly but in a very few cases, I see partially unsorted results like this
arr = np.array([-3.0302, -2.7103, -7.4844, -3.4761, -5.3009, -5.2121, -3.7549, -4.7834,
-5.8870, -3.4839, -5.0104, -3.0992, -4.8823, -0.3319, -6.8084])
ind = np.argpartition(arr, -4)[-4:]
print(arr[ind])
and the output is
[-3.0992 -3.0302 -0.3319 -2.7103]
which is unsorted, it has to output the maximum values at last but it is not seen in this case. I checked with other examples and it is doing all fine. Like
arr = np.array([45, 35, 67.345, -34.5555, 66, -0.23655, 11.0001, 0.234444444])
ind = np.argpartition(arr, -4)[-4:]
print(arr[ind])
output
[35. 45. 66. 67.345]
What could be the reason? Did I miss anything?
If you're not planning on actually utilizing the sorted indices, why not just use np.sort?
>>> arr = np.array([-3.0302, -2.7103, -7.4844, -3.4761, -5.3009, -5.2121, -3.7549,
-4.7834, -5.8870, -3.4839, -5.0104, -3.0992, -4.8823, -0.3319, -6.8084])
>>> np.sort(arr)[-4:]
array([-3.0992, -3.0302, -2.7103, -0.3319])
Alternatively, as read here you could use a range for your kth option on np.argpartition:
np.argpartition(arr, range(0, -4, -1))[-4:]
array([-3.0992, -3.0302, -2.7103, -0.3319])

Sum of elements of numpy array not same as total

I'm trying to count number of pairs and save them in two different histograms, one saves the pair in an array where the parent objects are split and the other one just saves the total, that means I have a loop that looks like this:
for k in range(N_parents):
pair_hist[k, bin] +=1
total_pair_hist[bin] +=1
where both pair_hist and total_pair as defined as,
pair_hist = np.zeros((N_parents, bins.shape[0]), dtype = np.uint64)
total_pair_hist = np.zeros(bins.shape[0], dtype = np.uint64)
I'd expect that summing the elements of pair_hist across all parents (axis=0), I'd get the total histogram. The funny thing is, if I take the sum of pair_hist:
onehalo_sum_ind = np.sum(pair_hist, axis = 0)
I don't get exactly total_pair_hist, but something slightly different:
total_pair_hist = [ 287248245 448773033 695820015 1070797576 1634146741 2466680801
3667159080 5334307986 7524739978 10206208064 13237161068 16466436715
19231751113 20949333183 21254336387 19497450101 16459529579 13038604111
9783826702 7006904025 4813946458 3207605915 2097437543 1355158303
869077173 555036759 353732683 225171870 143179912 0]
pair_hist = [ 287267022 448887401 696415932 1073435699 1644677789 2503693266
3784008845 5665555755 8380564635 12201977310 17382403650 23929909625
31103373709 36859534246 38146287402 33454446858 25689430007 18142721164
12224099624 8035266046 5211441720 3353187036 2147027818 1370663213
873519714 556182465 353995293 225224668 143189173 0]
Any idea of what's going on? Thank you in advance :)
Sorry for the late reply, but I didn't have time to work on it before. The problem was caused by numba. I was using it with the parallel=True flag to parallelise one of the loops and that caused the error.

How to get match result by given range using regular expression?

I'm stucking with my code to get all return match by given range. My data sample is:
comment
0 [intj74, you're, whipping, people, is, a, grea...
1 [home, near, kcil2, meniaga, who, intj47, a, l...
2 [thematic, budget, kasi, smooth, sweep]
3 [budget, 2, intj69, most, people, think, of, e...
I want to get the result as: (where the given range is intj1 to intj75)
comment
0 [intj74]
1 [intj47]
2 [nan]
3 [intj69]
My code is:
df.comment = df.comment.apply(lambda x: [t for t in x if t=='intj74'])
df.ix[df.comment.apply(len) == 0, 'comment'] = [[np.nan]]
I'm not sure how to use regular expression to find the range for t=='range'. Or any other idea to do this?
Thanks in advance,
Pandas Python Newbie
you could replace [t for t in x if t=='intj74'] with, e.g.,
[t for t in x if re.match('intj[0-9]+$', t)]
or even
[t for t in x if re.match('intj[0-9]+$', t)] or [np.nan]
which would also handle the case if there are no matches (so that one wouldn't need to check for that explicitly using df.ix[df.comment.apply(len) == 0, 'comment'] = [[np.nan]]) The "trick" here is that an empty list evaluates to False so that the or in that case returns its right operand.
I am new to pandas as well. You might have initialized your DataFrame differently. Anyway, this is what I have:
import pandas as pd
data = {
'comment': [
"intj74, you're, whipping, people, is, a",
"home, near, kcil2, meniaga, who, intj47, a",
"thematic, budget, kasi, smooth, sweep",
"budget, 2, intj69, most, people, think, of"
]
}
print(df.comment.str.extract(r'(intj\d+)'))

Format a python list and search for patterns

I am getting rows from a spreadsheet with mixtures of numbers, text and dates
I want to find elements within the list, some numbers and some text
for example
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(str, sg)
#sg = map(unicode, sg) #option?
if any("-1" in s for s in sg):
#do something if matched
I don't feel this is the correct way to do this, I am also trying to match stuff like -1.5 and -1.5C and other unexpected characters like OPEN15 compared to 15
I have also looked at
sg.index("-1")
If positive then its a match (Only good for direct matches)
Some help would be appreciated
If you want to call a function for each case, I would do it this way:
def stub1(elem):
#do something for match of type '-1'
return
def stub2(elem):
#do something for match of type 'SD4'
return
def stub3(elem):
#do something for match of type 'OPEN15'
return
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(unicode, sg)
patterns = {u"-1":stub1, u"SD4": stub2, u"OPEN15": stub3} # add more if you want
for elem in sg:
for k, stub in patterns.iteritems():
if k in elem:
stub(elem)
break
Where stub1, stub2, ... are the fonctions that contains the code for each case.
It will be called (max 1 time per strings) if the string contains a matching substring.
What do you mean by "I don't feel this is the correct way to do this" ? Are you not getting the result you expect ? Is it too slow ?
Maybe, you can organize your data by columns instead of rows and have a more specific filters. If you are looking for speed, I'd suggest using the numpy module which has a very intersting function called select()
Scipy select example
By transforming all your rows in a numpy array, you can test several columns in one pass. This function is amazingly efficient and powerful ! Basically it's used like this:
import numpy as np
a = array(...)
conds = [a < 10, a % 3 == 0, a > 25]
actions = [a + 100, a / 3, a * 10]
result = np.select(conds, actions, default = 0)
All values in a will be transformed as follow:
A value 100 will be added to any value of a which is smaller than 10
Any value in a which is a multiple of 3, will be divided by 3
Any value above 25 will be multiplied by 10
Any other value, not matching the previous conditions, will be set to 0
Bot conds and actions are lists, and must have the same number of arguments. The first element in conds has its action set as the first element of actions.
It could be used to determine the index in a vector for a particular value (eventhough this should be done using the nonzero() numpy function).
a = array(....)
conds = [a <= target, a > target]
actions = [1, 0]
index = select(conds, actions).sum()
This is probably a stupid way of getting an index, but it demonstrates how we can use select()... and it works :-)

Categories

Resources