python fastest way to match strings with huge data size

python fastest way to match strings with huge data size - python

I have a huge table data (or record array) with elements:
tbdata[i]['a'], tbdata[i]['b'], tbdata[i]['c']
which are all integers, and i is a random number between 0 and 1 million (the size of the table).
I also have a list called Name whose elements are all names (900 names in total) of files, such as '/Users/Desktop/Data/spe-3588-55184-0228.jpg' (modified), all containing three numbers.
Now I want to select those data from my tbdata whose elements mentioned above all match the three numbers in the names of list Name. Here's the code I originally wrote:
Data = []
for k in range(0, len(tbdata)):
for i in range(0, len(NameA5)):
if Name[i][43:47] == str(tbdata[k]['a']) and\
Name[i][48:53] == str(tbdata[k]['b']) and\
Name[i][55:58] == str(tbdata[k]['c']):
Data.append(tbdata[k])
Python ran for the whole night and still haven't finished, since either the size of data is huge or my algorithm is too slow...I'm wondering what's the fastest way to complete such a task? Thanks!

You can construct a lookup tree like this:
a2b2c = {}
for name in NameA5:
a = int(name[43:47])
b = int(name[48:53])
c = int(name[55:58])
if a not in a2b2c2name:
a2b2c2name[a] = {}
if b not in a2b2c2name[a]:
a2b2c2name[a][b] = {}
a2b2c2name[a][b][c] = True
for k in range(len(tbdata)):
a = tbdata[k]['a']
b = tbdata[k]['b']
c = tbdata[k]['c']
if a in a2b2c2name and b in a2b2c2name[a] and c in a2b2c2name[a][b]:
Data.append(tbdata[k])

Related

How to efficient find existent key-values of 2-dimensional dictionary in python which are between 4 values?

I have a little Problem in Python. I got a 2 dimensional dictionary. Lets call it dict[x,y] now. x and y are integers. I try to only select the key-pair-values, which match between 4 points. Function should look like this:
def search(topleft_x, topleft_y, bottomright_x, bottomright_y):
For example: search(20, 40, 200000000, 300000000)
Now are Dictionary-items should be returned that match to:
20 < x < 20000000000
AND 40 < y < 30000000000
Most of the key-pair-values in this huge matrix are not set (see picture - this is why i cant just iterate).
This function should return a shorted dictionary. In the example shown in the picture, it would be a new dictionary with the 3 green circled values. Is there any simple solution to realize this?
I recently used 2-for-loops. In this example they would look like this:
def search():
for x in range(20, 2000000000):
for y in range(40, 3000000000):
try:
#Do something
except:
#Well item just doesnt exist
Of course this is highly inefficient. So my question is: How to Boost up this simple thing in Python? In C# i used Linq for stuff like this... What to use in python?
Thanks for help!
Example Picture

You dont go over random number ranges and ask 4million times for forgiveness - you use 2 number range to specify your "filters" and go only over existing keys in the dictionary that fall into those ranges:
# get fancy on filtering if you like, I used explicit conditions and continues for clearity
def search(d:dict,r1:range, r2:range)->dict:
d2 = {}
for x in d: # only use existing keys in d - not 20k that might be in
if x not in r1: # skip it - not in range r1
continue
d2[x] = {}
for y in d[x]: # only use existing keys in d[x] - not 20k that might be in
if y not in r2: # skip it - not in range r2
continue
d2[x][y] = "found: " + d[x][y][:] # take it, its in both ranges
return d2
d = {}
d[20] = {99: "20",999: "200",9999: "2000",99999: "20000",}
d[9999] = { 70:"70",700:"700",7000:"7000",70000:"70000"}
print(search(d,range(10,30), range(40,9000)))
Output:
{20: {99: 'found: 20', 999: 'found: 200'}}
It might be useful to take a look at modules providing sparse matrices.

Dataflow job is timing out. Having issues comparing two collections, and appending the values of one to another.

Hoping someone can help me here. I have two bigquery tables that I read into 2 different p collections, p1 and p2. I essentially want to update product based on a type II transformation that keeps track of history (previous values in the nested column in product) and appends new values from dwsku.
The idea is to check every row in each collection. If there is a match based on some table values (between p1 and p2), then check product's nested data to see if it contains all values in p1 (based on it's sku number and brand). If it does not contain the most recent data from p2 then take a copy of the format of the current nested data in product, and fit the new data into it. Take this nested format and add it to the existing nested products in product.
def process_changes(element, productdata):
for data in productdata:
if element['sku_number'] == data['sku_number'] and element['brand'] == data['brand']:
logging.info('Processing Product: ' + str(element['sku_number']) + ' brand:' + str(element['brand']))
datatoappend = []
for nestline in data['product']:
logging.info('Nested Data: ' + nestline['product'])
if nestline['in_use'] == 'Y' and (nestline['sku_description'] != element['sku_description'] or nestline['department_id'] != element['department_id'] or nestline['department_description'] != element['department_description']
or nestline['class_id'] != element['class_id'] or nestline['class_description'] != element['class_description'] or nestline['sub_class_id'] != element['sub_class_id'] or nestline['sub_class_description'] != element['sub_class_description'] ):
logging.info('we found a sku we need to update')
logging.info('sku is ' + data['sku_number'])
newline = nestline.copy()
logging.info('most recent nested product element turned off...')
nestline['in_use'] = 'N'
nestline['expiration_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
logging.info(nestline)
logging.info('inserting most recent change in dwsku inside nest')
newline['sku_description'] = element['sku_description']
newline['department_id'] = element['department_id']
newline['department_description'] = element['department_description']
newline['class_id'] = element['class_id']
newline['class_description'] = element['class_description']
newline['sub_class_id'] = element['sub_class_id']
newline['sub_class_description'] = element['sub_class_description']
newline['in_use'] = 'Y'
newline['effective_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_time'] = "%s:%s:%s" % (curdate.hour, curdate.minute, curdate.second)
nestline['expiration_date'] = "9999-01-01"
datatoappend.append(newline)
else:
logging.info('Nothing changed for sku ' + str(data['sku_number']))
for dt in datatoappend:
logging.info('processed sku ' + str(element['sku_number']))
logging.info('adding the changes (if any)')
data['product'].append(dt)
return data
changed_product = p1 | beam.FlatMap(process_changes, AsIter(p2))
Afterwards I want to add all values in p1 not in p2 in a nested format as seen in nestline.
Any help would be appreciated as I'm wondering why my job is taking hours to run with nothing to show. Even the output logs in dataflow UI don't show anything.
Thanks in advance!

This can be quite expensive if side input PCollection p2 is large. From your code snippets it's not clear how PCollection p2 is constructed. But if it is, for example, a text file that is if size 62.7MB, processing it per element can be pretty expensive. Can you consider using CoGroupByKey: https://beam.apache.org/documentation/programming-guide/#cogroupbykey
Also note that from a FlatMap, you are supposed to return a iterator of elements from the processing method. Seems like you are returning a dictionary('data') which probably is incorrect.

How to control all combination subset against userinput using itertool in Python 2.7

I want to write a code to get all combinations against 5 user input sets where each output subset only matches <= 3 elements from any of the input sets.
Example:
userInput1=(a,b,c,d,e)
userInput2=(c,d,e,f,g)
userInput3=(f,g,h,i,j)
userInput4=(g,h,i,j,k)
userInput5=(k,l,m,n,o)
# Turn 5 lists into 1 large list with no duplicates
allEntries = list(set(userInput1 + userInput2 + userInput3 + userInput4 + userInput5 ))
# Generate all possible list combinations
allCombinations = list(itertools.combinations( allEntries,5))
print "All combinations:"
for subset in allCombinations:
?????????
print subset
How do I do this check to limit the overlap? For instance, (g,i,j,k,o) fails because it shares 4 elements with userInput4.
E.g. - all combination
(a,c,j,l,o)
(k,b,a,m,n)

This isn't a simple solution with itertools. However, you do have the correct start. Now, check each list as you produce it:
check_set = [
set(userinput1),
set(userinput2),
set(userinput3),
set(userinput4),
set(userinput5)
]
for five in itertools.combinations( allEntries,5):
five_set = set(five)
# If there are no overlaps of more than 3 elements,
# accept the solution.
if !any(len(five_set.intersection(user_set)) > 3
for user_set in check_set):
print five
# ... or whatever you do to save the good combination.

Importing big tecplot block files in python as fast as possible

I want to import in python some ascii file ( from tecplot, software for cfd post processing).
Rules for those files are (at least, for those that I need to import):
The file is divided in several section
Each section has two lines as header like:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
Each section has a set of variable given by the first line. When a section ends, a new section starts with two similar lines.
For each variable there are I*J*K values.
Each variable is a continous block of values.
There are a fixed number of values per row (6).
When a variable ends, the next one starts in a new line.
Variables are "IJK ordered data".The I-index varies the fastest; the J-index the next fastest; the K-index the slowest. The I-index should be the inner loop, the K-index shoould be the outer loop, and the J-index the loop in between.
Here is an example of data:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
-3.9999999E+00 -3.3327306E+00 -2.7760824E+00 -2.3117116E+00 -1.9243209E+00 -1.6011492E+00
[...]
0.0000000E+00 #fin first variable
-4.3532482E-02 -4.3584235E-02 -4.3627592E-02 -4.3663762E-02 -4.3693815E-02 -4.3718831E-02 #second variable, 'y'
[...]
1.0738781E-01 #end of second variable
[...]
[...]
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen" #next zone
ZONE T="Window(s) : E_W_Block0003_ALL", I=17, J=17, K=25, F=BLOCK
I am quite new at python and I have written a code to import the data to a dictionary, writing the variables as 3D numpy.array . Those files could be very big, (up to Gb). How can I make this code faster? (or more generally, how can I import such files as fast as possible)?
import re
from numpy import zeros, array, prod
def vectorr(I, J, K):
"""function"""
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
a = open('E:\u.dat')
filelist = a.readlines()
NumberCol = 6
count = 0
data = dict()
leng = len(filelist)
countzone = 0
while count < leng:
strVARIABLES = re.findall('VARIABLES', filelist[count])
variables = re.findall(r'"(.*?)"', filelist[count])
countzone = countzone+1
data[countzone] = {key:[] for key in variables}
count = count+1
strI = re.findall('I=....', filelist[count])
strI = re.findall('\d+', strI[0])
I = int(strI[0])
##
strJ = re.findall('J=....', filelist[count])
strJ = re.findall('\d+', strJ[0])
J = int(strJ[0])
##
strK = re.findall('K=....', filelist[count])
strK = re.findall('\d+', strK[0])
K = int(strK[0])
data[countzone]['indmax'] = array([I, J, K])
pr = prod(data[countzone]['indmax'])
lin = pr // NumberCol
if pr%NumberCol != 0:
lin = lin+1
vect = vectorr(I, J, K)
for key in variables:
init = zeros((I, J, K))
for ii in range(0, lin):
count = count+1
temp = map(float, filelist[count].split())
for iii in range(0, len(temp)):
init.itemset(tuple(vect[ii*6+iii]), temp[iii])
data[countzone][key] = init
count = count+1
Ps. In python, no cython or other languages

Converting a large bunch of strings to numbers is always going to be a little slow, but assuming the triple-nested for-loop is the bottleneck here maybe changing it to the following gives you a sufficient speedup:
# add this line to your imports
from numpy import fromstring
# replace the nested for-loop with:
count += 1
for key in variables:
str_vector = ' '.join(filelist[count:count+lin])
ar = fromstring(str_vector, sep=' ')
ar = ar.reshape((I, J, K), order='F')
data[countzone][key] = ar
count += lin
Unfortunately at the moment I only have access to my smartphone (no pc) so I can't test how fast this is or even if it works correctly or at all!
Update
Finally I got around to doing some testing:
My code contained a small error, but it does seem to work correctly now.
The code with the proposed changes runs about 4 times faster than the original
Your code spends most of its time on ndarray.itemset and probably loop overhead and float conversion. Unfortunately cProfile doesn't show this in much detail..
The improved code spends about 70% of time in numpy.fromstring, which, in my view, indicates that this method is reasonably fast for what you can achieve with Python / NumPy.
Update 2
Of course even better would be to iterate over the file instead of loading everything all at once. In this case this is slightly faster (I tried it) and significantly reduces memory use. You could also try to use multiple CPU cores to do the loading and conversion to floats, but then it becomes difficult to have all the data under one variable. Finally a word of warning: the fromstring method that I used scales rather bad with the length of the string. E.g. from a certain string length it becomes more efficient to use something like np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).

If you use regular expressions here, there's two things that I would change:
Compile REs which are used more often (which applies to all REs in your example, I guess). Do regex=re.compile("<pattern>") on them, and use the resulting object with match=regex.match(), as described in the Python documentation.
For the I, J, K REs, consider reducing two REs to one, using the grouping feature (also described above), by searching for a pattern of the form "I=(\d+)", and grabbing the part matched inside the parentheses using regex.group(1). Taking this further, you can define a single regex to capture all three variables in one step.
At least for starting the sections, REs seem a bit overkill: There's no variation in the string you need to look for, and string.find() is sufficient and probably faster in that case.
EDIT: I just saw you use grouping already for the variables...

Divide a array into multiple (individual) arrays based on a bin size in python

I am reading a file that contains values like this:
-0.68285 -6.919616
-0.7876 -14.521115
-0.64072 -43.428411
-0.05368 -11.561341
-0.43144 -34.768892
-0.23268 -10.793603
-0.22216 -50.341101
-0.41152 -90.083377
-0.01288 -84.265557
-0.3524 -24.253145
How do i split this into individual arrays based on the value in column 1 with a bin width of 0.1?
i want my output something like this:
array1=[[-0.05368, -11.561341],[-0.01288, -84.265557]]
array2=[[-0.23268, -10.79360] ,[-0.22216, -50.341101]]
array3=[[-0.3524, -24.253145]]
array4=[[-0.43144, -34.768892], [-0.41152, -90.083377]]
array5=[[-0.68285, -6.919616],[-0.64072, -43.428411]]
array6=[[-0.7876, -14.521115]]

Here's a simple solution using Python's round function and dictionary class:
lines = open('some_file.txt').readlines()
dictionary = {}
for line in lines:
nums = line[:-1].split(' ') #remove the newline and split the columns
k = round(float(nums[0]), 1) #round the first column to get the bucket
if k not in dictionary:
dictionary[k] = [] #add an empty bucket
dictionary[k].append([float(nums[0]), float(nums[1])])
#add the numbers to the bucket
print dictionary
To get a particular bucket (like .3), just do:
x = dictionary[0.3]
or
x = dictionary.get(0.3, [])
if you just want an empty list returned for empty buckets.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python fastest way to match strings with huge data size - python

Related

How to efficient find existent key-values of 2-dimensional dictionary in python which are between 4 values?

Dataflow job is timing out. Having issues comparing two collections, and appending the values of one to another.

How to control all combination subset against userinput using itertool in Python 2.7

Importing big tecplot block files in python as fast as possible

Divide a array into multiple (individual) arrays based on a bin size in python

Categories

Resources