Loading Large File in Python - python

I'm using Python 2.6.2 [GCC 4.3.3] running on Ubuntu 9.04. I need to read a big datafile (~1GB, >3 million lines) , line by line using a Python script.
I tried the methods below, I find it uses a very large space of memory (~3GB)
for line in open('datafile','r').readlines():
process(line)
or,
for line in file(datafile):
process(line)
Is there a better way to load a large file line by line, say
a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or
b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?
Several suggestions gave the methods I mentioned above and already tried, I'm trying to see if there is a better way to handle this. My search has not been fruitful so far. I appreciate your help.
p/s I have done some memory profiling using Heapy and found no memory leaks in the Python code I am using.
Update 20 August 2012, 16:41 (GMT+1)
Tried both approach as suggested by J.F. Sebastian, mgilson and IamChuckB, (datafile is a variable)
with open(datafile) as f:
for line in f:
process(line)
Also,
import fileinput
for line in fileinput.input([datafile]):
process(line)
Strangely both of them uses ~3GB of memory, my datafile size in this test is 765.2MB consisting of 21,181,079 lines. I see the memory get incremented along the time (around 40-80MB steps) before stabilizing at 3GB.
An elementary doubt,
Is it necessary to flush the line after usage?
I did memory profiling using Heapy to understand this better.
Level 1 Profiling
Partition of a set of 36043 objects. Total size = 5307704 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15934 44 1301016 25 1301016 25 str
1 50 0 628400 12 1929416 36 dict of __main__.NodeStatistics
2 7584 21 620936 12 2550352 48 tuple
3 781 2 590776 11 3141128 59 dict (no owner)
4 90 0 278640 5 3419768 64 dict of module
5 2132 6 255840 5 3675608 69 types.CodeType
6 2059 6 247080 5 3922688 74 function
7 1716 5 245408 5 4168096 79 list
8 244 1 218512 4 4386608 83 type
9 224 1 213632 4 4600240 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
============================================================
Level 2 Profiling for Level 1-Index 0
Partition of a set of 15934 objects. Total size = 1301016 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 13 274232 21 274232 21 '.co_code'
1 2132 13 189832 15 464064 36 '.co_filename'
2 2024 13 114120 9 578184 44 '.co_lnotab'
3 247 2 110672 9 688856 53 "['__doc__']"
4 347 2 92456 7 781312 60 '.func_doc', '[0]'
5 448 3 27152 2 808464 62 '[1]'
6 260 2 15040 1 823504 63 '[2]'
7 201 1 11696 1 835200 64 '[3]'
8 188 1 11080 1 846280 65 '[0]'
9 157 1 8904 1 855184 66 '[4]'
<4717 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 2 Profiling for Level 1-Index 2
Partition of a set of 7584 objects. Total size = 620936 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 26 188160 30 188160 30 '.co_names'
1 2096 28 171072 28 359232 58 '.co_varnames'
2 2078 27 157608 25 516840 83 '.co_consts'
3 261 3 21616 3 538456 87 '.__mro__'
4 331 4 21488 3 559944 90 '.__bases__'
5 296 4 20216 3 580160 93 '.func_defaults'
6 55 1 3952 1 584112 94 '.co_freevars'
7 47 1 3456 1 587568 95 '.co_cellvars'
8 35 0 2560 0 590128 95 '[0]'
9 27 0 1952 0 592080 95 '.keys()[0]'
<189 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 3
Partition of a set of 781 objects. Total size = 590776 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 0 98584 17 98584 17 "['locale_alias']"
1 29 4 35768 6 134352 23 '[180]'
2 28 4 34720 6 169072 29 '[90]'
3 30 4 34512 6 203584 34 '[270]'
4 27 3 33672 6 237256 40 '[0]'
5 25 3 26968 5 264224 45 "['data']"
6 1 0 24856 4 289080 49 "['windows_locale']"
7 64 8 20224 3 309304 52 "['inters']"
8 64 8 17920 3 327224 55 "['galog']"
9 64 8 17920 3 345144 58 "['salog']"
<84 more rows. Type e.g. '_.more' to view.>
============================================================
Level 3 Profiling for Level 2-Index 0, Level 1-Index 0
Partition of a set of 2132 objects. Total size = 274232 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 100 274232 100 274232 100 '.co_code'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 2
Partition of a set of 1995 objects. Total size = 188160 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 100 188160 100 188160 100 '.co_names'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 3
Partition of a set of 1 object. Total size = 98584 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 100 98584 100 98584 100 "['locale_alias']"
Still troubleshooting this.
Do share with me if you have faced this before.
Thanks for your help.
Update 21 August 2012, 01:55 (GMT+1)
mgilson, the process function is used to post process a Network Simulator 2 (NS2) tracefile. Some of the lines in the tracefile is shared as below. I am using numerous objects, counters, tuples, and dictionaries in the python script to learn how a wireless network performs.
s 1.231932886 _25_ AGT --- 0 exp 10 [0 0 0 0 Y Y] ------- [25:0 0:0 32 0 0]
s 1.232087886 _25_ MAC --- 0 ARP 86 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776108 _42_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776625 _34_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776633 _9_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776658 _0_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232856942 _35_ MAC --- 0 ARP 28 [0 ffffffff 64 806 Y Y] ------- [REQUEST 100/25 0/0]
s 1.232871658 _0_ MAC --- 0 ARP 86 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
r 1.233096712 _29_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097047 _4_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097050 _26_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097051 _1_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233109522 _25_ MAC --- 0 ARP 28 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
s 1.233119522 _25_ MAC --- 0 ACK 38 [0 1 67 0 Y Y]
r 1.233236204 _17_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
r 1.233236463 _20_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
D 1.233236694 _18_ MAC COL 0 ARP 86 [0 ffffffff 65 806 67 1] ------- [REQUEST 101/25 0/0]
The aim of doing 3 level profiling using Heapy is to assist me to narrow down which object(s) is eating up much of the memory. As you can see, unfortunately I could not see which one specifically need a tweaking as its too generic. Example I know though "dict of main.NodeStatistics" has only 50 objects out of 36043 (0.1%) objects, yet it takes up 12% of the total memory used to run the script, I am unable to find which specific dictionary I would need to look into.
I tried implementing David Eyk's suggestion as below (snippet), trying to manually garbage collect at every 500,000 lines,
import gc
for i,line in enumerate(file(datafile)):
if (i%500000==0):
print '-----------This is line number', i
collected = gc.collect()
print "Garbage collector: collected %d objects." % (collected)
Unfortunately, the memory usage is still at 3GB and the output (snippet) is as below,
-----------This is line number 0
Garbage collector: collected 0 objects.
-----------This is line number 500000
Garbage collector: collected 0 objects.
Implemented martineau's suggestion, I see the memory usage is now 22MB from the earlier 3GB! Something that I had looking forward to achieve. The strange thing is the below,
I did the same memory profiling as before,
Level 1 Profiling
Partition of a set of 35474 objects. Total size = 5273376 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15889 45 1283640 24 1283640 24 str
1 50 0 628400 12 1912040 36 dict of __main__.NodeStatistics
2 7559 21 617496 12 2529536 48 tuple
3 781 2 589240 11 3118776 59 dict (no owner)
4 90 0 278640 5 3397416 64 dict of module
5 2132 6 255840 5 3653256 69 types.CodeType
6 2059 6 247080 5 3900336 74 function
7 1716 5 245408 5 4145744 79 list
8 244 1 218512 4 4364256 83 type
9 224 1 213632 4 4577888 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
Comparing the previous memory profiling output with the above, str has reduced 45 objects (17376 bytes), tuple has reduced 25 objects (3440 bytes) and dict(no owner) though no object change, it has reduced 1536 bytes of the memory size. All other objects are the same including dict of main.NodeStatistics. The total number of objects are 35474. The small reduction in object (0.2%) produced 99.3% of memory saving (22MB from 3GB). Very strange.
If you realize, though I know the place the memory starvation is occurring, I am yet able to narrow down which one causing the bleed.
Will continue to investigate this.
Thanks to all the pointers, using this opportunity to learn much on python as I ain't an expert. Appreciate your time taken to assist me.
Update 23 August 2012, 00:01 (GMT+1) -- SOLVED
I continued debugging using the minimalistic code per martineau's suggestion. I began to add codes in the process function and observe the memory bleeding.
I find the memory starts to bleed when I add a class as below,
class PacketStatistics(object):
def __init__(self):
self.event_id = 0
self.event_source = 0
self.event_dest = 0
...
I am using 3 classes with 136 counters.
Discussed this issue with my friend Gustavo Carneiro, he suggested to use slot to replace dict.
I converted the class as below,
class PacketStatistics(object):
__slots__ = ('event_id', 'event_source', 'event_dest',...)
def __init__(self):
self.event_id = 0
self.event_source = 0
self.event_dest = 0
...
When I converted all the 3 classes, the memory usage of 3GB before now became 504MB. A whopping 80% of memory usage saving!!
The below is the memory profiling after the dict to slot convertion.
Partition of a set of 36157 objects. Total size = 4758960 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15966 44 1304424 27 1304424 27 str
1 7592 21 624776 13 1929200 41 tuple
2 780 2 587424 12 2516624 53 dict (no owner)
3 90 0 278640 6 2795264 59 dict of module
4 2132 6 255840 5 3051104 64 types.CodeType
5 2059 6 247080 5 3298184 69 function
6 1715 5 245336 5 3543520 74 list
7 225 1 232344 5 3775864 79 dict of type
8 244 1 223952 5 3999816 84 type
9 166 0 190096 4 4189912 88 dict of class
<101 more rows. Type e.g. '_.more' to view.>
The dict of __main__.NodeStatistics is not in the top 10 anymore.
I am happy with the result and glad to close this issue.
Thanks for all your guidance. Truly appreciate it.
rgds
Saravanan K

with open('datafile') as f:
for line in f:
process(line)
This works because files are iterators yielding 1 line at a time until there are no more lines to yield.

The fileinput module will let you read it line by line without loading the entire file into memory. pydocs
import fileinput
for line in fileinput.input(['myfile']):
do_something(line)
Code example taken from yak.net

#mgilson 's answer is correct. The simple solution bears official mention though (#HerrKaputt mentioned this in a comment)
file = open('datafile')
for line in file:
process(line)
file.close()
This is simple, pythonic, and understandable. If you don't understand how with works just use this.
As the other poster mentioned this does not create a large list like file.readlines(). Rather it pulls off one line at a time in the way that is traditional to unix files/pipes.

If the file is JSON, XML, CSV, genomics or any other well-known format, there are specialized readers which use C code directly and are far more optimized for both speed and memory than parsing in native Python - avoid parsing it natively whenever possible.
But in general, tips from my experience:
Python's multiprocessing package is fantastic for managing subprocesses, all memory leaks go away when the subprocess ends.
run the reader subprocess as a multiprocessing.Process and use a multiprocessing.Pipe(duplex=True) to communicate (send the filename and any other args, then read its stdout)
read in small (but not tiny) chunks, say 64Kb-1Mb. Better for memory usage, also for responsiveness wrt other running processes/subprocesses

Related

Pytorch memory leak when threading

I get significant memory leak when running Pytorch model to evaluate images from dataset.
Every new image evaluation is started in a new thread.
It doesn't matter if the code waits for the thread to finish or not. When the threads are not used (just evaluate function is called), there is no any leak. I've tried to delete the thread variable every iteration, but that doesn't help.
Here is the code:
hidden_sizes = [6336, 1000]
class Net(torch.nn.Module):
def __init__ (self):
super(Net, self).__init__()
self.fc1 = torch.nn.Linear(hidden_sizes[0], hidden_sizes[1])
def forward(self, x):
x= x.view(-1,hidden_sizes[0])
x = torch.nn.functional.log_softmax(self.fc1(x), dim=1)
return x
#---------------------------------------------------------------
def newThread(i):
image=cv2.imread(pathx+filenames[i], cv2.IMREAD_GRAYSCALE)
images = tran(image)
images = tran1(images)
images = images.unsqueeze(0)
#Run model
with torch.no_grad():
logps = model(images)
ps = torch.exp(logps)
probab = list(ps.numpy()[0])
pred_label = probab.index(max(probab))
#---------------------------------------------------------------
model = Net ()
model.load_state_dict(torch.load("test_memory_leak.pt"))
#normalize image
tran = transforms.ToTensor()
tran1 = transforms.Normalize((0.5,), (0.5,))
pathx="images\\"
filenames=os.listdir(pathx)
for i in range(len(filenames)):
thread1 = threading.Thread(target = newThread, args = (i, ))
thread1.start()
thread1.join()
What could be the reason for that?
UPD: Tried to detect memory leaks with guppy, but the reason still isn't clear. Here are some stats: First table is program memory usage on the beginning, the second one is when the memory usage increased x2.4 fold (up to 480mb) after analyzing 1000 images:
Partition of a set of 260529 objects. Total size = 33587422 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 72188 28 9511536 28 9511536 28 str
1 70921 27 5418336 16 14929872 44 tuple
2 32926 13 2526536 8 17456408 52 bytes
3 16843 6 2434008 7 19890416 59 types.CodeType
4 2384 1 2199952 7 22090368 66 type
5 14785 6 2010760 6 24101128 72 function
6 4227 2 1631384 5 25732512 77 dict (no owner)
7 794 0 1399928 4 27132440 81 dict of module
8 2384 1 1213816 4 28346256 84 dict of type
9 38 0 704064 2 29050320 86 dict of torch.tensortype
<575 more rows. Type e.g. '_.more' to view.>
-------------------------------------------
Partition of a set of 265841 objects. Total size = 34345930 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 72203 27 9523346 28 9523346 28 str
1 70924 27 5418584 16 14941930 44 tuple
2 32928 12 2530722 7 17472652 51 bytes
3 16844 6 2434152 7 19906804 58 types.CodeType
4 2384 1 2200488 6 22107292 64 type
5 14786 6 2010896 6 24118188 70 function
6 4232 2 1637736 5 25755924 75 dict (no owner)
7 794 0 1399928 4 27155852 79 dict of module
8 2384 1 1213816 4 28369668 83 dict of type
9 265 0 840672 2 29210340 85 set
<577 more rows. Type e.g. '_.more' to view.>

Python assign some part of the list to another list

I have a dataset like below:
In this dataset first column represents the id of a person, the last column is label of this person and rest of the columns are features of the person.
101 166 633.0999756 557.5 71.80000305 60.40000153 2.799999952 1 1 -1
101 133 636.2000122 504.3999939 71 56.5 2.799999952 1 2 -1
105 465 663.5 493.7000122 82.80000305 66.40000153 3.299999952 10 3 -1
105 133 635.5999756 495.6000061 89 72 3.599999905 9 6 -1
105 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 -1
105 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 -1
105 99 615.5999756 575.7000122 80 67 3.200000048 0 0 -1
120 399 617.7000122 583.5 95.80000305 82.40000153 3.799999952 8 10 1
120 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 1
120 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 1
120 99 615.5999756 575.7000122 80 67 3.200000048 0 0 1
My aim is to classify these people, and I want to use leave one out person method as a split method. So I need to choose one person and his all data as a test data and the rest of the data for training. But when try to choose the test data I implemented list assignment operation but it gave an error. This is my code:
`import numpy as np
datasets=["raw_fixationData.txt"]
file_name_array=[101,105,120]
for data in datasets:
data = np.genfromtxt(data,delimiter="\t")
data=data[1:,:]
num_line=len(data[:,1])-1
num_feat=len(data[1,:])-2
label=num_feat+1
X = data[0:num_line+1,1:label]
y = data[0:num_line+1,label]
test_prtcpnt=[]; test_prtcpnt_label=[]; train_prtcpnt=[]; train_prtcpnt_label=[];
for i in range(len(file_name_array)):
m=0; # test index
n=0 #train index
for j in range(num_line):
if X[j,0]==file_name_array[i]:
test_prtcpnt[m,0:10]=X[j,0:10];
test_prtcpnt_label[m]=y[j];
m=m+1;
else:
train_prtcpnt[n,0:10]=X[j,0:10];
train_prtcpnt_label[n]=y[j];
n=n+1; `
This code give me this error test_prtcpnt[m,0:10]=X[j,0:10]; TypeError: list indices must be integers or slices, not tuple
How could I solve this problem?
I think that you are misusing Python's slice notation. Please refer to the following stack overflow post on slicing:
Explain Python's slice notation
In this case, the Python interpreter seems to be interpreting test_prtcpnt[m,0:10] as a tuple. Is it possible that you meant to say the following:
test_prtcpnt[0:10]=X[0:10]

Number of values lying in a specified range

I have a data frame like the one below:
NC_011163.1:1
NC_011163.1:22
NC_011163.1:44
NC_011163.1:65
NC_011163.1:73
NC_011163.1:87
NC_011163.1:104
NC_011163.1:130
NC_011163.1:151
NC_011163.1:172
NC_011163.1:194
NC_011163.1:210
NC_011163.1:235
NC_011163.1:255
NC_011163.1:295
NC_011163.1:320
NC_011163.1:445
NC_011163.1:520
I would like to scan the data frame using a window of 210 and count number of values lying in every 210 window.
Desired output:
output: Values
NC_011163.1:1-210 12
NC_011163.1:211-420 4
NC_011163.1:421-630 2
I'd greatly appreciate your inputs to solve this problem.
Thanks
V
If you use python and Pandas, you can do:
with your data in a dataframe df:
NC N
0 NC_011163.1 1
1 NC_011163.1 22
2 NC_011163.1 44
3 NC_011163.1 65
4 NC_011163.1 73
5 NC_011163.1 87
6 NC_011163.1 104
7 NC_011163.1 130
8 NC_011163.1 151
9 NC_011163.1 172
10 NC_011163.1 194
11 NC_011163.1 210
12 NC_011163.1 235
13 NC_011163.1 255
14 NC_011163.1 295
15 NC_011163.1 320
16 NC_011163.1 445
17 NC_011163.1 520
df.groupby([df.NC, pd.cut(df.N, range(0,631,210))]).count()
N
NC N
NC_011163.1 (0, 210] 12
(210, 420] 4
(420, 630] 2
Where:
pd.cut(df.N, range(0, 631, 210)) returns in which bins are the value in the column N. bins are defined by the range, which creates 3 bins: [0, 210, 420, 630].
Then you groupby on:
the NC number (so it matches your output but here is useless as there is only one group, but I guess you'll have other chromosomes, hence it will perform the operation per chromosome)
the bins you've just made
count the number of element in each group.
$ cat tst.awk
BEGIN { FS=":"; OFS="\t"; endOfRange=210 }
{
key = $1
bucket = int((($2-1)/endOfRange)+1)
cnt[bucket]++
maxBucket = (bucket > maxBucket ? bucket : maxBucket)
}
END {
for (bucket=1; bucket<=maxBucket; bucket++) {
print key ":" ((bucket-1)*endOfRange)+1 "-" bucket*endOfRange, cnt[bucket]+0
}
}
$ awk -f tst.awk file
NC_011163.1:1-210 12
NC_011163.1:211-420 4
NC_011163.1:421-630 2
Note that this will work even if you have some ranges with no values in your input data (it will print the range with a count of zero) and it will always print the ranges in numerical order (output order when using the in operator is "random"):
$ cat file
NC_011163.1:1
NC_011163.1:22
NC_011163.1:520
$ awk -f tst.awk file
NC_011163.1:1-210 2
NC_011163.1:211-420 0
NC_011163.1:421-630 1
awk -v t=210 'BEGIN{FS=":";t++}{++a[int($2/t)]}
END{for(x in a){printf "%s:%s\t%d\n",$1,t*x"-"(x+1)*t,a[x]}}' file
will give this output:
NC_011163.1:0-211 12
NC_011163.1:211-422 4
NC_011163.1:422-633 2
You don't need to find out what is the max value, how many sections/ranges you have in result. This command does it for you.
The codes are easy to understand too I think, most codes are for the output format.

Filtering records in Pandas python - syntax error

I have a pandas data frame that looks like this:
duration distance speed hincome fi_cost type
0 359 1601 4 3 40.00 cycling
1 625 3440 6 3 86.00 cycling
2 827 4096 5 3 102.00 cycling
3 1144 5704 5 2 143.00 cycling
If I use the following I export a new csv that pulls only those records with a distance less than 5000.
distance_1 = all_results[all_results.distance < 5000]
distance_1.to_csv('./distance_1.csv',",")
Now, I wish to export a csv with values from 5001 to 10000. I can't seem to get the syntax right...
distance_2 = all_results[10000 > all_results.distance < 5001]
distance_2.to_csv('./distance_2.csv',",")
Unfortunately because of how Python chained comparisons work, we can't use the 50 < x < 100 syntax when x is some vectorlike quantity. You have several options.
You could create two boolean Series and use & to combine them:
>>> all_results[(all_results.distance > 3000) & (all_results.distance < 5000)]
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
Use between to create a boolean Series and then use that to index (note that it's inclusive by default, though):
>>> all_results[all_results.distance.between(3000, 5000)] # inclusive by default
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
Or finally you could use .query:
>>> all_results.query("3000 < distance < 5000")
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
5001 < all_results.distance < 10000

seek a better design suggestion for a trial-and-error mechanism in python?

See below data matrix get from sensors, just INT numbers, nothing specical.
A B C D E F G H I J K
1 25 0 25 66 41 47 40 12 69 76 1
2 17 23 73 97 99 39 84 26 0 44 45
3 34 15 55 4 77 2 96 92 22 18 71
4 85 4 71 99 66 42 28 41 27 39 75
5 65 27 28 95 82 56 23 44 97 42 38
…
10 95 13 4 10 50 78 4 52 51 86 20
11 71 12 32 9 2 41 41 23 31 70
12 54 31 68 78 55 19 56 99 67 34 94
13 47 68 79 66 10 23 67 42 16 11 96
14 25 12 88 45 71 87 53 21 96 34 41
The horizontal A to K is the sensor name, and vertical is the data from sensor by the timer manner.
Now I want to analysis those data with trial-and-error methods, I defined some concepts to explain what I want:
o source
source is all the raw data I get
o entry
a entry is a set of all A to K sensor, take the vertical 1st row for example: the entry is
25 0 25 66 41 47 40 12 69 76 1
o rules
a rule is a "suppose" function with assert value return, so far just "true" or "false".
For example, I suppose the sensor A, E and F value will never be same in one enrty, if one entry with A=E=F, it will tigger violation and this rule function will return false.
o range:
a range is function for selecting vertical entry, for example, the first 5 entries
Then, the basic idea is:
o source + range = subsource(s)
o subsource + rules = valiation(s)
The finally I want to get a list may looks like this:
rangeID ruleID violation
1 1 Y
2 1 N
3 1 Y
1 2 N
2 2 N
3 2 Y
1 3 N
2 3 Y
3 3 Y
But the problem is the rule and range I defined here will getting very complicated soon if you looks deeper, they have too much possible combinations, take "A=E=F" for example, one can define "B=E=F","C=E=F","C>F" ......
So soon I need a rule/range generator which may accept those "core parameters" such as "A=E=F" as input parameter even using regex string later. That is too complicated just defeated me, leave alone I may need to persistence rules unique ID, data storage problem, rules self nest combination problem ......
So my questions are:
Anyone knows if there's some module/soft fit for this kind of trial-and-error calculation or the rules defination I want?
Anyone can share me a better rules/range design I described?
Thanks for any hints.
Rgs,
KC
If I understand what you're asking correctly, I probably wouldn't even venture down the Numbpy path as I don't think given your description that it's really required. Here's a sample implementation of how I might go about solving the specific issue that you presented:
l = [\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1},\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1}\
]
r = ['a=g=i', 'a=b', 'a=c']
res = []
# test all given rules
for n in range(0, len(r)):
# i'm assuming equality here - you'd have to change this to accept other operators if needed
c = r[n].split('=')
vals = []
# build up a list of values given our current rule
for e in c:
vals.append(l[0][e])
# using len(set(v)) gives us the number of distinct values
res.append({'rangeID': 0, 'ruleID':n, 'violation':'Y' if len(set(vals)) == 1 else 'N'})
print res
Output:
[{'violation': 'N', 'ruleID': 0, 'rangeID': 0}, {'violation': 'N', 'ruleID': 1, 'rangeID': 0}, {'violation': 'Y', 'ruleID': 2, 'rangeID': 0}]
http://ideone.com/zbTZr
There are a few assumptions made here (such as equality being the only operator in use in your rules) and some functionality left out (such as parsing your input to the list of dicts I used, but I'm hopeful that you can figure that out on your own.
Of course, there could be a Numpy-based solution that's simpler than this that I'm just not thinking of at the moment (it's late and I'm going to bed now ;)), but hopefully this helps you out anyway.
Edit:
Woops, missed something else (forgot to add it in prior to posting) - I only test the first element in l (the given range).. You'd just want to stick that in another for loop rather than using that hard-coded 0 index.
You want to look at Numpy matrix for data structures like matrix etc. It exposes a list of functions that work on matrix manipulation.
As for rule / range generator I am afraid you will have to build your own domain specific language to achieve that.

Categories

Resources