Number of values lying in a specified range - python

I have a data frame like the one below:
NC_011163.1:1
NC_011163.1:22
NC_011163.1:44
NC_011163.1:65
NC_011163.1:73
NC_011163.1:87
NC_011163.1:104
NC_011163.1:130
NC_011163.1:151
NC_011163.1:172
NC_011163.1:194
NC_011163.1:210
NC_011163.1:235
NC_011163.1:255
NC_011163.1:295
NC_011163.1:320
NC_011163.1:445
NC_011163.1:520
I would like to scan the data frame using a window of 210 and count number of values lying in every 210 window.
Desired output:
output: Values
NC_011163.1:1-210 12
NC_011163.1:211-420 4
NC_011163.1:421-630 2
I'd greatly appreciate your inputs to solve this problem.
Thanks
V

If you use python and Pandas, you can do:
with your data in a dataframe df:
NC N
0 NC_011163.1 1
1 NC_011163.1 22
2 NC_011163.1 44
3 NC_011163.1 65
4 NC_011163.1 73
5 NC_011163.1 87
6 NC_011163.1 104
7 NC_011163.1 130
8 NC_011163.1 151
9 NC_011163.1 172
10 NC_011163.1 194
11 NC_011163.1 210
12 NC_011163.1 235
13 NC_011163.1 255
14 NC_011163.1 295
15 NC_011163.1 320
16 NC_011163.1 445
17 NC_011163.1 520
df.groupby([df.NC, pd.cut(df.N, range(0,631,210))]).count()
N
NC N
NC_011163.1 (0, 210] 12
(210, 420] 4
(420, 630] 2
Where:
pd.cut(df.N, range(0, 631, 210)) returns in which bins are the value in the column N. bins are defined by the range, which creates 3 bins: [0, 210, 420, 630].
Then you groupby on:
the NC number (so it matches your output but here is useless as there is only one group, but I guess you'll have other chromosomes, hence it will perform the operation per chromosome)
the bins you've just made
count the number of element in each group.

$ cat tst.awk
BEGIN { FS=":"; OFS="\t"; endOfRange=210 }
{
key = $1
bucket = int((($2-1)/endOfRange)+1)
cnt[bucket]++
maxBucket = (bucket > maxBucket ? bucket : maxBucket)
}
END {
for (bucket=1; bucket<=maxBucket; bucket++) {
print key ":" ((bucket-1)*endOfRange)+1 "-" bucket*endOfRange, cnt[bucket]+0
}
}
$ awk -f tst.awk file
NC_011163.1:1-210 12
NC_011163.1:211-420 4
NC_011163.1:421-630 2
Note that this will work even if you have some ranges with no values in your input data (it will print the range with a count of zero) and it will always print the ranges in numerical order (output order when using the in operator is "random"):
$ cat file
NC_011163.1:1
NC_011163.1:22
NC_011163.1:520
$ awk -f tst.awk file
NC_011163.1:1-210 2
NC_011163.1:211-420 0
NC_011163.1:421-630 1

awk -v t=210 'BEGIN{FS=":";t++}{++a[int($2/t)]}
END{for(x in a){printf "%s:%s\t%d\n",$1,t*x"-"(x+1)*t,a[x]}}' file
will give this output:
NC_011163.1:0-211 12
NC_011163.1:211-422 4
NC_011163.1:422-633 2
You don't need to find out what is the max value, how many sections/ranges you have in result. This command does it for you.
The codes are easy to understand too I think, most codes are for the output format.

Related

How to replace a classic for loop with df.iterrows()?

I have a huge data frame.
I am using a for loop in the below sample code:
for i in range(1, len(df_A2C), 1):
A2C_TT= df_A2C.loc[(df_A2C['TO_ID'] == i)].sort_values('DURATION_H').head(1)
if A2C_TT.size > 0:
print (A2C_TT)
This is working fine but I want to use df.iterrows() since it will help me to automaticall avoid empty frame issues.
I want to iterate through TO_ID and looking for minimum values accordingly.
How should I replace my classical i loop counter with df.iterrows()?
Sample Data:
FROM_ID TO_ID DURATION_H DIST_KM
1 7 0.528555556 38.4398
2 26 0.512511111 37.38515
3 71 0.432452778 32.57571
4 83 0.599486111 39.26188
5 98 0.590516667 35.53107
6 108 1.077794444 76.79874
7 139 0.838972222 58.86963
8 146 1.185088889 76.39174
9 158 0.625872222 45.6373
10 208 0.500122222 31.85239
11 209 0.530916667 29.50249
12 221 0.945444444 62.69099
13 224 1.080883333 66.06291
14 240 0.734269444 48.1778
15 272 0.822875 57.5008
16 349 1.171163889 76.43536
17 350 1.080097222 71.16137
18 412 0.503583333 38.19685
19 416 1.144961111 74.35502
As far as I understand your question, you want to group your data by To_ID and select the row where Duration_H is the smallest? Is that right?
df.loc[df.groupby('TO_ID').DURATION_H.idxmin()]
here is one way about it
# run the loop for as many unique TO_ID you have
# instead of iterrows, which runs for all the DF or running to the size of DF
for idx in np.unique(df['TO_ID']):
A2C_TT= df.loc[(df['TO_ID'] == idx)].sort_values('DURATION_H').head(1)
print (A2C_TT)
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808
here is another way about it
df.loc[df['DURATION_H'].eq(df.groupby('TO_ID')['DURATION_H'].transform(min))]
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808

select specific lines with specific condition from a large text file with python

I have a text file which has 5 columns and more than 2 million lines as follows,
134 1 2 6 3.45388e-06
135 1 2 7 3.04626e-06
136 1 2 8 4.69364e-06
137 1 2 9 4.21627e-06
138 1 2 10 2.38822e-06
139 1 2 11 1.91216e-06
...
140 1 3 2 5.23876e-06
141 1 3 3 2.83415e-06
142 1 3 7 2.32097e-06
143 1 3 9 6.26283e-06
144 1 3 16 4.22556e-06
...
145 2 1 2 3.67182e-06
146 2 1 4 4.61481e-06
147 2 1 6 1.1703e-06
...
148 2 2 7 4.61242e-06
149 2 2 21 1.84259e-06
150 2 2 22 4.31435e-06
...
151 2 3 23 4.34518e-06
152 2 3 24 3.76576e-06
153 2 3 25 2.61061e-06
...
154 3 1 2 4.07107e-06
155 3 1 7 4.83971e-06
156 3 1 8 3.43919e-06
...
157 3 2 29 6.27991e-06
158 3 2 30 7.44213e-06
159 3 2 31 9.56985e-06
...
160 3 3 32 1.38377e-05
161 3 3 33 1.62724e-05
162 3 3 34 9.85653e-06
...
Second column shows the number of each layer and in each layer I have a matrix which columns 3 and 4 are numbers of row and columns in that layer and the last column is my data
First I want to put a condition that if second column is equal to a number (for example 3), print all the lines of this condition in another file, and I am doing that with this code:
with open ('C:\ Python \ projects\A.txt') as f, open('D:\ Python \New_ projects\NEW.txt', 'w') as f2:
# read the data from layer 120 and write these data in New.txt file
for f in islice (f,393084, 409467):
f2.write(f)
a = np.loadtxt('D:\ Python \New_ projects\NEW.txt
But the problem is that for each layer I have to go to the file and find the first and last number of lines for each layer and put it in islice , which takes so long because the file is so big
What can I do that I just say for column[2] = 4 save the lines in New text??
***** and After that I need another condition that for Column[2] = 4 , if 20 <=column[3] <= 50 and
80 <=column[4] <= 120 --> save these lines in another file..
use grep function and create another file with filtered data, grep is fast enough to work on files
As mentioned grep could be a fast non-python solution, using regex, for example:
grep -E '^[0-9]+\s+[0-9]+\s+2\s' testgrep.txt > output.txt
which save in the file output.txt all the lines with a 2 on the third column. See https://regex101.com/r/j5EfEE/1 for the details about the pattern.

Python assign some part of the list to another list

I have a dataset like below:
In this dataset first column represents the id of a person, the last column is label of this person and rest of the columns are features of the person.
101 166 633.0999756 557.5 71.80000305 60.40000153 2.799999952 1 1 -1
101 133 636.2000122 504.3999939 71 56.5 2.799999952 1 2 -1
105 465 663.5 493.7000122 82.80000305 66.40000153 3.299999952 10 3 -1
105 133 635.5999756 495.6000061 89 72 3.599999905 9 6 -1
105 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 -1
105 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 -1
105 99 615.5999756 575.7000122 80 67 3.200000048 0 0 -1
120 399 617.7000122 583.5 95.80000305 82.40000153 3.799999952 8 10 1
120 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 1
120 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 1
120 99 615.5999756 575.7000122 80 67 3.200000048 0 0 1
My aim is to classify these people, and I want to use leave one out person method as a split method. So I need to choose one person and his all data as a test data and the rest of the data for training. But when try to choose the test data I implemented list assignment operation but it gave an error. This is my code:
`import numpy as np
datasets=["raw_fixationData.txt"]
file_name_array=[101,105,120]
for data in datasets:
data = np.genfromtxt(data,delimiter="\t")
data=data[1:,:]
num_line=len(data[:,1])-1
num_feat=len(data[1,:])-2
label=num_feat+1
X = data[0:num_line+1,1:label]
y = data[0:num_line+1,label]
test_prtcpnt=[]; test_prtcpnt_label=[]; train_prtcpnt=[]; train_prtcpnt_label=[];
for i in range(len(file_name_array)):
m=0; # test index
n=0 #train index
for j in range(num_line):
if X[j,0]==file_name_array[i]:
test_prtcpnt[m,0:10]=X[j,0:10];
test_prtcpnt_label[m]=y[j];
m=m+1;
else:
train_prtcpnt[n,0:10]=X[j,0:10];
train_prtcpnt_label[n]=y[j];
n=n+1; `
This code give me this error test_prtcpnt[m,0:10]=X[j,0:10]; TypeError: list indices must be integers or slices, not tuple
How could I solve this problem?
I think that you are misusing Python's slice notation. Please refer to the following stack overflow post on slicing:
Explain Python's slice notation
In this case, the Python interpreter seems to be interpreting test_prtcpnt[m,0:10] as a tuple. Is it possible that you meant to say the following:
test_prtcpnt[0:10]=X[0:10]

Filtering records in Pandas python - syntax error

I have a pandas data frame that looks like this:
duration distance speed hincome fi_cost type
0 359 1601 4 3 40.00 cycling
1 625 3440 6 3 86.00 cycling
2 827 4096 5 3 102.00 cycling
3 1144 5704 5 2 143.00 cycling
If I use the following I export a new csv that pulls only those records with a distance less than 5000.
distance_1 = all_results[all_results.distance < 5000]
distance_1.to_csv('./distance_1.csv',",")
Now, I wish to export a csv with values from 5001 to 10000. I can't seem to get the syntax right...
distance_2 = all_results[10000 > all_results.distance < 5001]
distance_2.to_csv('./distance_2.csv',",")
Unfortunately because of how Python chained comparisons work, we can't use the 50 < x < 100 syntax when x is some vectorlike quantity. You have several options.
You could create two boolean Series and use & to combine them:
>>> all_results[(all_results.distance > 3000) & (all_results.distance < 5000)]
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
Use between to create a boolean Series and then use that to index (note that it's inclusive by default, though):
>>> all_results[all_results.distance.between(3000, 5000)] # inclusive by default
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
Or finally you could use .query:
>>> all_results.query("3000 < distance < 5000")
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
5001 < all_results.distance < 10000

Loading Large File in Python

I'm using Python 2.6.2 [GCC 4.3.3] running on Ubuntu 9.04. I need to read a big datafile (~1GB, >3 million lines) , line by line using a Python script.
I tried the methods below, I find it uses a very large space of memory (~3GB)
for line in open('datafile','r').readlines():
process(line)
or,
for line in file(datafile):
process(line)
Is there a better way to load a large file line by line, say
a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or
b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?
Several suggestions gave the methods I mentioned above and already tried, I'm trying to see if there is a better way to handle this. My search has not been fruitful so far. I appreciate your help.
p/s I have done some memory profiling using Heapy and found no memory leaks in the Python code I am using.
Update 20 August 2012, 16:41 (GMT+1)
Tried both approach as suggested by J.F. Sebastian, mgilson and IamChuckB, (datafile is a variable)
with open(datafile) as f:
for line in f:
process(line)
Also,
import fileinput
for line in fileinput.input([datafile]):
process(line)
Strangely both of them uses ~3GB of memory, my datafile size in this test is 765.2MB consisting of 21,181,079 lines. I see the memory get incremented along the time (around 40-80MB steps) before stabilizing at 3GB.
An elementary doubt,
Is it necessary to flush the line after usage?
I did memory profiling using Heapy to understand this better.
Level 1 Profiling
Partition of a set of 36043 objects. Total size = 5307704 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15934 44 1301016 25 1301016 25 str
1 50 0 628400 12 1929416 36 dict of __main__.NodeStatistics
2 7584 21 620936 12 2550352 48 tuple
3 781 2 590776 11 3141128 59 dict (no owner)
4 90 0 278640 5 3419768 64 dict of module
5 2132 6 255840 5 3675608 69 types.CodeType
6 2059 6 247080 5 3922688 74 function
7 1716 5 245408 5 4168096 79 list
8 244 1 218512 4 4386608 83 type
9 224 1 213632 4 4600240 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
============================================================
Level 2 Profiling for Level 1-Index 0
Partition of a set of 15934 objects. Total size = 1301016 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 13 274232 21 274232 21 '.co_code'
1 2132 13 189832 15 464064 36 '.co_filename'
2 2024 13 114120 9 578184 44 '.co_lnotab'
3 247 2 110672 9 688856 53 "['__doc__']"
4 347 2 92456 7 781312 60 '.func_doc', '[0]'
5 448 3 27152 2 808464 62 '[1]'
6 260 2 15040 1 823504 63 '[2]'
7 201 1 11696 1 835200 64 '[3]'
8 188 1 11080 1 846280 65 '[0]'
9 157 1 8904 1 855184 66 '[4]'
<4717 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 2 Profiling for Level 1-Index 2
Partition of a set of 7584 objects. Total size = 620936 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 26 188160 30 188160 30 '.co_names'
1 2096 28 171072 28 359232 58 '.co_varnames'
2 2078 27 157608 25 516840 83 '.co_consts'
3 261 3 21616 3 538456 87 '.__mro__'
4 331 4 21488 3 559944 90 '.__bases__'
5 296 4 20216 3 580160 93 '.func_defaults'
6 55 1 3952 1 584112 94 '.co_freevars'
7 47 1 3456 1 587568 95 '.co_cellvars'
8 35 0 2560 0 590128 95 '[0]'
9 27 0 1952 0 592080 95 '.keys()[0]'
<189 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 3
Partition of a set of 781 objects. Total size = 590776 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 0 98584 17 98584 17 "['locale_alias']"
1 29 4 35768 6 134352 23 '[180]'
2 28 4 34720 6 169072 29 '[90]'
3 30 4 34512 6 203584 34 '[270]'
4 27 3 33672 6 237256 40 '[0]'
5 25 3 26968 5 264224 45 "['data']"
6 1 0 24856 4 289080 49 "['windows_locale']"
7 64 8 20224 3 309304 52 "['inters']"
8 64 8 17920 3 327224 55 "['galog']"
9 64 8 17920 3 345144 58 "['salog']"
<84 more rows. Type e.g. '_.more' to view.>
============================================================
Level 3 Profiling for Level 2-Index 0, Level 1-Index 0
Partition of a set of 2132 objects. Total size = 274232 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 100 274232 100 274232 100 '.co_code'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 2
Partition of a set of 1995 objects. Total size = 188160 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 100 188160 100 188160 100 '.co_names'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 3
Partition of a set of 1 object. Total size = 98584 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 100 98584 100 98584 100 "['locale_alias']"
Still troubleshooting this.
Do share with me if you have faced this before.
Thanks for your help.
Update 21 August 2012, 01:55 (GMT+1)
mgilson, the process function is used to post process a Network Simulator 2 (NS2) tracefile. Some of the lines in the tracefile is shared as below. I am using numerous objects, counters, tuples, and dictionaries in the python script to learn how a wireless network performs.
s 1.231932886 _25_ AGT --- 0 exp 10 [0 0 0 0 Y Y] ------- [25:0 0:0 32 0 0]
s 1.232087886 _25_ MAC --- 0 ARP 86 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776108 _42_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776625 _34_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776633 _9_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776658 _0_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232856942 _35_ MAC --- 0 ARP 28 [0 ffffffff 64 806 Y Y] ------- [REQUEST 100/25 0/0]
s 1.232871658 _0_ MAC --- 0 ARP 86 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
r 1.233096712 _29_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097047 _4_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097050 _26_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097051 _1_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233109522 _25_ MAC --- 0 ARP 28 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
s 1.233119522 _25_ MAC --- 0 ACK 38 [0 1 67 0 Y Y]
r 1.233236204 _17_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
r 1.233236463 _20_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
D 1.233236694 _18_ MAC COL 0 ARP 86 [0 ffffffff 65 806 67 1] ------- [REQUEST 101/25 0/0]
The aim of doing 3 level profiling using Heapy is to assist me to narrow down which object(s) is eating up much of the memory. As you can see, unfortunately I could not see which one specifically need a tweaking as its too generic. Example I know though "dict of main.NodeStatistics" has only 50 objects out of 36043 (0.1%) objects, yet it takes up 12% of the total memory used to run the script, I am unable to find which specific dictionary I would need to look into.
I tried implementing David Eyk's suggestion as below (snippet), trying to manually garbage collect at every 500,000 lines,
import gc
for i,line in enumerate(file(datafile)):
if (i%500000==0):
print '-----------This is line number', i
collected = gc.collect()
print "Garbage collector: collected %d objects." % (collected)
Unfortunately, the memory usage is still at 3GB and the output (snippet) is as below,
-----------This is line number 0
Garbage collector: collected 0 objects.
-----------This is line number 500000
Garbage collector: collected 0 objects.
Implemented martineau's suggestion, I see the memory usage is now 22MB from the earlier 3GB! Something that I had looking forward to achieve. The strange thing is the below,
I did the same memory profiling as before,
Level 1 Profiling
Partition of a set of 35474 objects. Total size = 5273376 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15889 45 1283640 24 1283640 24 str
1 50 0 628400 12 1912040 36 dict of __main__.NodeStatistics
2 7559 21 617496 12 2529536 48 tuple
3 781 2 589240 11 3118776 59 dict (no owner)
4 90 0 278640 5 3397416 64 dict of module
5 2132 6 255840 5 3653256 69 types.CodeType
6 2059 6 247080 5 3900336 74 function
7 1716 5 245408 5 4145744 79 list
8 244 1 218512 4 4364256 83 type
9 224 1 213632 4 4577888 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
Comparing the previous memory profiling output with the above, str has reduced 45 objects (17376 bytes), tuple has reduced 25 objects (3440 bytes) and dict(no owner) though no object change, it has reduced 1536 bytes of the memory size. All other objects are the same including dict of main.NodeStatistics. The total number of objects are 35474. The small reduction in object (0.2%) produced 99.3% of memory saving (22MB from 3GB). Very strange.
If you realize, though I know the place the memory starvation is occurring, I am yet able to narrow down which one causing the bleed.
Will continue to investigate this.
Thanks to all the pointers, using this opportunity to learn much on python as I ain't an expert. Appreciate your time taken to assist me.
Update 23 August 2012, 00:01 (GMT+1) -- SOLVED
I continued debugging using the minimalistic code per martineau's suggestion. I began to add codes in the process function and observe the memory bleeding.
I find the memory starts to bleed when I add a class as below,
class PacketStatistics(object):
def __init__(self):
self.event_id = 0
self.event_source = 0
self.event_dest = 0
...
I am using 3 classes with 136 counters.
Discussed this issue with my friend Gustavo Carneiro, he suggested to use slot to replace dict.
I converted the class as below,
class PacketStatistics(object):
__slots__ = ('event_id', 'event_source', 'event_dest',...)
def __init__(self):
self.event_id = 0
self.event_source = 0
self.event_dest = 0
...
When I converted all the 3 classes, the memory usage of 3GB before now became 504MB. A whopping 80% of memory usage saving!!
The below is the memory profiling after the dict to slot convertion.
Partition of a set of 36157 objects. Total size = 4758960 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15966 44 1304424 27 1304424 27 str
1 7592 21 624776 13 1929200 41 tuple
2 780 2 587424 12 2516624 53 dict (no owner)
3 90 0 278640 6 2795264 59 dict of module
4 2132 6 255840 5 3051104 64 types.CodeType
5 2059 6 247080 5 3298184 69 function
6 1715 5 245336 5 3543520 74 list
7 225 1 232344 5 3775864 79 dict of type
8 244 1 223952 5 3999816 84 type
9 166 0 190096 4 4189912 88 dict of class
<101 more rows. Type e.g. '_.more' to view.>
The dict of __main__.NodeStatistics is not in the top 10 anymore.
I am happy with the result and glad to close this issue.
Thanks for all your guidance. Truly appreciate it.
rgds
Saravanan K
with open('datafile') as f:
for line in f:
process(line)
This works because files are iterators yielding 1 line at a time until there are no more lines to yield.
The fileinput module will let you read it line by line without loading the entire file into memory. pydocs
import fileinput
for line in fileinput.input(['myfile']):
do_something(line)
Code example taken from yak.net
#mgilson 's answer is correct. The simple solution bears official mention though (#HerrKaputt mentioned this in a comment)
file = open('datafile')
for line in file:
process(line)
file.close()
This is simple, pythonic, and understandable. If you don't understand how with works just use this.
As the other poster mentioned this does not create a large list like file.readlines(). Rather it pulls off one line at a time in the way that is traditional to unix files/pipes.
If the file is JSON, XML, CSV, genomics or any other well-known format, there are specialized readers which use C code directly and are far more optimized for both speed and memory than parsing in native Python - avoid parsing it natively whenever possible.
But in general, tips from my experience:
Python's multiprocessing package is fantastic for managing subprocesses, all memory leaks go away when the subprocess ends.
run the reader subprocess as a multiprocessing.Process and use a multiprocessing.Pipe(duplex=True) to communicate (send the filename and any other args, then read its stdout)
read in small (but not tiny) chunks, say 64Kb-1Mb. Better for memory usage, also for responsiveness wrt other running processes/subprocesses

Categories

Resources