Optimise Code to improve performance and reduce Execution time - python

I have a perfectly working code. But, when I run a large CSV file (around 2GB) it takes about 15-20 minutes for the complete execution of the code. Is there a way I could optimise my below code to take less time to finsh execution and thus improve performance?
from csv import reader, writer
import pandas as pd
path = (r"data.csv")
data = pd.read_csv(path, header=None)
last_column = data.iloc[: , -1]
arr = [i+1 for i in range(len(last_column)-1) if (last_column[i] == 1 and last_column[i+1] == 0)]
ch_0_6 = []
ch_7_14 = []
ch_16_22 = []
with open(path, 'r') as read_obj:
csv_reader = reader(read_obj)
rows = list(csv_reader)
for j in arr:
# Channel 1-7
ch_0_6_init = [int(rows[j][k]) for k in range(1,8)]
bin_num = ''.join([str(x) for x in ch_0_6_init])
dec_num = int(f'{bin_num}', 2)
ch_0_6.append(dec_num)
ch_0_6_init = []
# Channel 8-15
ch_7_14_init = [int(rows[j][k]) for k in range(8,16)]
bin_num = ''.join([str(x) for x in ch_7_14_init])
dec_num = int(f'{bin_num}', 2)
ch_7_14.append(dec_num)
ch_7_14_init = []
# Channel 16-22
ch_16_22_init = [int(rows[j][k]) for k in range(16,23)]
bin_num = ''.join([str(x) for x in ch_16_22_init])
dec_num = int(f'{bin_num}', 2)
ch_16_22.append(dec_num)
ch_16_22_init = []
Sample Data:
0.0114,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,1
0.0112,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0
0.0115,0,1,0,1,1,1,0,1,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1
0.0117,0,1,0,1,1,1,0,1,0,0,1,0,0,0,1,1,1,0,1,0,0,0,0
0.0118,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,1,1,0,1,0,0,0,1
Join the binary digits to form a decimal number depending upon the channels chosen.

Using just the csv module, you could try the following type approach:
from csv import reader, writer
ch_0_6 = []
ch_7_14 = []
ch_16_22 = []
with open('data.csv', 'r') as f_input:
csv_input = reader(f_input)
last_row = ['0']
for row in csv_input:
if last_row[-1] == '1' and row[-1] == '0':
ch_0_6.append(int(''.join(row[1:8]), 2))
ch_7_14.append(int(''.join(row[8:16]), 2))
ch_16_22.append(int(''.join(row[16:23]), 2))
last_row = row
print(ch_0_6)
print(ch_7_14)
print(ch_16_22)
For your example data this would display:
[32, 46]
[1, 145]
[104, 104]
As noted, your original approach was reading the whole file twice into memory. The first pass was just to determine which rows to parse. This can be done whilst reading by keeping track of the previous row in the loop. This alone should result in a significant speed up.
The conversion from binary list elements into decimal elements is also a bit more efficient.
This approach would also work on much larger file sizes.

Related

Matching multiple array value to row in csv file slow

I have a numpy array consisting of about 1200 arrays containing 10 values each. np.shape = 1200, 10. Each element has a value between 0 and 5,7 million.
Next I have a .csv file with 3800 lines. Every line contains 2 values. The first value indicates a range the second value is an identifier. The first and last 5 rows of the .csv file:
509,47222
1425,47220
2404,47219
4033,47218
6897,47202
...,...
...,...
...,...
5793850,211
5794901,186
5795820,181
5796176,43
5796467,33
The first columns goes up until it reaches 5,7 million. For each value in the numpy array I want to check the first column of the .csv file. I have for example the value 3333, this means the identifier belonging to 3333 is 47218. Each row indicates that from the first column of the row before till the first column of this row, eg: 2404 - 4033 the identifier is 47218.
Now I want to get the identifier for each value in the numpy array, then I want to safe the identifier and the frequency of which this identifier is found in the numpy array. Which means I need to loop 3800 times over a csv file of 12000 lines and subsequently ++ an integer. This process takes about 30 seconds which is way too long.
This is the code I am currently using:
numpy_file = np.fromfile(filename, dtype=np.int32)
#some code to format numpy_file correctly
with open('/identifer_file.csv') as read_file:
csv_reader = csv.reader(read_file, delimiter=',')
csv_reader = list(csv_reader)
identifier_dict = {}
for numpy_array in numpy_file:
for numpy_value in numpy_array:
#there are 12000 numpy_value in numpy_file
for row in csv_reader:
last_identifier = 0
if numpy_value <= int(row[0]):
last_identifier = int(row[1])
#adding the frequency of the identifier in numpy_file to a dict
if last_identifier in identifier_dict:
identifier_dict[last_identifier] += 1
else:
identifier_dict[last_identifier] = 1
else:
continue
break
for x, y in identifier_dict.items():
if(y > 40):
print("identifier: {} amount of times found: {}".format(x, y))
What algorithm should I implement to speed up this process?
Edit
I have tried folding the numpy array to a 1D array, so it has 12000 values. This has no real affect on the speed. Latest test was 33 seconds
Setup:
import numpy as np
import collections
np.random.seed(100)
numpy_file = np.random.randint(0, 5700000, (1200,10))
#'''range, identifier'''
read_file = io.StringIO('''509,47222
1425,47220
2404,47219
4033,47218
6897,47202
5793850,211
5794901,186
5795820,181
5796176,43
5796467,33''')
csv_reader = csv.reader(read_file, delimiter=',')
csv_reader = list(csv_reader)
# your example code put in a function and adapted for the setup above
def original(numpy_file,csv_reader):
identifier_dict = {}
for numpy_array in numpy_file:
for numpy_value in numpy_array:
#there are 12000 numpy_value in numpy_file
for row in csv_reader:
last_identifier = 0
if numpy_value <= int(row[0]):
last_identifier = int(row[1])
#adding the frequency of the identifier in numpy_file to a dict
if last_identifier in identifier_dict:
identifier_dict[last_identifier] += 1
else:
identifier_dict[last_identifier] = 1
else:
continue
break
# for x, y in identifier_dict.items():
# if(y > 40):
# print("identifier: {} amount of times found: {}".format(x, y))
return identifier_dict
Three solutions each vectorizing some of the operations. The first function consumes the least memory, the last consumes the most memory.
def first(numpy_file,r):
'''compare each value in the array to the entire first column of the csv'''
alternate = collections.defaultdict(int)
for value in np.nditer(numpy_file):
comparison = value < r[:,0]
identifier = r[:,1][comparison.argmax()]
alternate[identifier] += 1
return alternate
def second(numpy_file,r):
'''compare each row of the array to the first column of csv'''
alternate = collections.defaultdict(int)
for row in numpy_file:
comparison = row[...,None] < r[:,0]
indices = comparison.argmax(-1)
id_s = r[:,1][indices]
for thing in id_s:
#adding the frequency of the identifier in numpy_file to a dict
alternate[thing] += 1
return alternate
def third(numpy_file,r):
'''compare the whole array to the first column of csv'''
alternate = collections.defaultdict(int)
other = collections.Counter()
comparison = numpy_file[...,None] < r[:,0]
indices = comparison.argmax(-1)
id_s = r[:,1][indices]
other = collections.Counter(map(int,np.nditer(id_s)))
return other
The functions require the csv file be read into a numpy array:
read_file.seek(0) #io.StringIO object from setup
csv_reader = csv.reader(read_file, delimiter=',')
r = np.array([list(map(int,thing)) for thing in csv_reader])
one = first(numpy_file, r)
two = second(numpy_file,r)
three = third(numpy_file,r)
assert zero == one
assert zero == two
assert zero == three

python: multiple .dat's in multiple arrays

I'm trying to sort some data into (np.)arrays and get stuck with a problem.
I have 1000 .dat files and I need to put the data from them in 1000 different arrays. Further, every array should contain data depend on coordinates [i] [j] [k] (this part I've done already and the code looks like this (this is kind of "short" version):
with open('177500.dat', newline='') as csvfile:
f = csv.reader(csvfile, delimiter=' ')
for row in f:
<some code which works pretty good>
cV = [[[[] for k in range(kMax)] for j in range(jMax)] for i in range(iMax)]
with open('177500.dat', newline='') as csvfile:
f = csv.reader(csvfile, delimiter=' ')
<some code which works also good>
values = np.array([np.float64(row[i]) for i in range(3, rowLen)])
cV[int(row[0])][int(row[1])][int(row[2])] = values
After this, i can print cV [i] [j] [k] and I get all data which is contained in one .dat file at the coordinates [i] [j] [k].
And now I need to create cV [i] [j] [k] [n] to get the data from the specific file number n at the coordinates [i] [j] [k]. And I absolutely don't know how can I tell python to put the data into the "right" place.
I tried some things like this:
for m in range(160000,182501,2500):
with open ('output/%d.dat' % m, newline='') as csvfile:
<bla bla code>
cV = [[[[[] for k in range(kMax)] for j in range(jMax)] for i in range(iMax)] for n in range(tMax)]
if len(row) == rowLen:
values = [np.array([np.float64(row[i]) for i in range (3, rowLen)]) for n in range(tMax)]
for n in range(tMax):
cV[int(row[0])][int(row[1])][int(row[2])][int(n)] = values[n]
But this surely didn't work because python don't know what the hack should be this [n] after the values.
So, how can I tell pyhton to put this [i] [j] [k] data from the file nr. n in the array cV [i] [j] [k] [n]?
Thanks in advance
C.
P.S. I didn't post the whole code because I don't think it is necessary. All arrays are created properly, but the thing which isn't working ist the data in them.
I think building arrays like this is going to make things more complicated for you. It would be easier to build a dictionary using tuples as keys. In the example file you sent me, each (x, y, z) pair was repeated twice, making me think that each file contains data on two iterations of a total solution of 2000 iterations. Dictionaries must have unique keys, so for each file I have implemented another counter, timestep, that can increment when collating data from a single file.
Now, if I wanted coords (1, 2, 3) on the 3rd timestep, I could do simulation[(1, 2, 3, 3)].
import csv
import numpy as np
'''
Made the assumptions that:
-Each file contains two iterations from a simulation of 2000 iterations
-Each file is numbered sequentially. Each time the same (x, y, z) coords are
discovered, it represents the next timestep in simulation
Accessing data is via a tuple key (x, y, z, n) with n being timestep
'''
simulation = {}
file_count = 1
timestep = 1
num_files = 2
for x in range(1, num_files + 1):
with open('sim_file_{}.dat'.format(file_count), 'r') as infile:
second_read = False
reader = csv.reader(infile, delimiter=' ')
for row in reader:
item = [float(x) for x in row]
if row:
if (not second_read and not
any(simulation.get((item[0], item[1], item[2], timestep), []))):
timestep += 1
second_read = True
simulation[(item[0], item[1], item[2], timestep)] = np.array(item[3:])
file_count += 1
timestep += 1
second_read = False

Increase Python code speed

Does anyone have a clue how to increase the speed of this part of python code?
It was designed to deal with small files (with just a few lines, and for this is very fast) but i want to run it with big files (with ~50Gb, and millions of lines).
The main goal of this code is to get stings from a file (.txt) and search for these in a input file printing the the number of times that these occurred in the output file.
Here is the code: infile, seqList and out are determined by the optparse as Options in the beginning of the code (not shown)
def novo (infile, seqList, out) :
uDic = dict()
rDic = dict()
nmDic = dict()
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
for k in lines:
l1 = k.split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
if l1[4] == "U":
for k in uDic.keys():
if k == l3:
uDic[k] += 1
if l1[4] == "R":
for j in rDic.keys():
if j == l3:
rDic[j] += 1
if l1[4] == "NM":
for h in nmDic.keys():
if h == l3:
nmDic[h] += 1
f = open(out, "w")
f.write("Sample"+"\t"+"R"+"\t"+"U"+"\t"+"NM"+"\t"+"TOTAL"+"\t"+"%R"+"\t"+"%U"+"\t"+"%NM"+"\n")
for i in samples:
U = int()
R = int()
NM = int ()
for k, j in uDic.items():
if k == i:
U = j
for o, p in rDic.items():
if o == i:
R = p
for y,u in nmDic.items():
if y == i:
NM = u
TOTAL = int(U + R + NM)
try:
f.write(i+"\t"+str(R)+"\t"+str(U)+"\t"+str(NM)+"\t"+str(TOTAL)+"\t"+str(float(R) / TOTAL)+"\t"+str(float(U) / TOTAL)+"\t"+str(float(NM) / TOTAL$
except:
continue
f.close()
With processing 50 GB files, the question is not how to make it faster, but how to make it runnable
at all.
The main problem is, you will run out of memory and shall modify the code to be processing the files
without having all the files in memory, but rather having in memory onle a line, which is needed.
Following code from your question is reading all the lines form two files:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
You shall defer reading the lines till the latest possible moment like this:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
Note: did you want to use line.strip() or line.split()?
This way, you do not have to keep all the content in memory.
There are many more options for optimization, but this one will let you to take off and run.
It would make it much easier if you provided some sample inputs. Because you haven't I haven't tested this, but the idea is simple - iterate through each file only once, using iterators rather than reading the whole file into memory. Use the efficient collections.Counter object to handle the counting and minimise inner looping:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
If you convert your script into functions (it makes profiling easier), and then see what it does when you code profile it : I suggest using runsnake : runsnakerun
I would try replacing your loops with list & dictionary comprehensions:
For example, instead of
for i in samples:
uDict[i.replace(" ","")] = 0
Try:
udict = {i.replace(" ",""):0 for i in samples}
and similarly for the other dicts
I don't really follow what's going on in your "for k in lines" loop, but you only need l3 (and l2) when you have certain values for l1[4]. Why not check for those values before splitting and replacing?
Lastly, instead of looping through all the keys of a dict to see if a given element is in that dict, try:
if x in myDict:
myDict[x] = ....
for example:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
can be replaced with:
if l3 in uDic:
uDic[l3] += 1
Other than that, try profiling.
1)Look into profilers and adjust the code that is taking the most time.
2)You could try optimizing some methods with Cython - use data from profiler to modify correct thing
3)It looks like you can use a counter instead of a dict for the output file, and a set for the input file -look into them.
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4) If you are reading 50GB of memory you won't be able to store it all in RAM (I'm assuming who knows what kind of computer you have), so generators should save your memory and time.
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)

Printing to a .csv file from a Random List

When I create a random List of numbers like so:
columns = 10
rows = 10
for x in range(rows):
a_list = []
for i in range(columns):
a_list.append(str(random.randint(1000000,99999999)))
values = ",".join(str(i) for i in a_list)
print values
then all is well.
But when I attempt to send the output to a file, like so:
sys.stdout = open('random_num.csv', 'w')
for i in a_list:
print ", ".join(map(str, a_list))
it is only the last row that is output 10 times. How do I write the entire list to a .csv file ?
In your first example, you're creating a new list for every row. (By the way, you don't need to convert them to strs twice).
In your second example, you print the last list you had created previously. Move the output into the first loop:
columns = 10
rows = 10
with open("random_num.csv", "w") as outfile:
for x in range(rows):
a_list = [random.randint(1000000,99999999) for i in range(columns)]
values = ",".join(str(i) for i in a_list)
print values
outfile.write(values + "\n")
Tim's answer works well, but I think you are trying to print to terminal and the file in different places.
So with minimal modifications to your code, you can use a new variable all_list
import random
import sys
all_list = []
columns = 10
rows = 10
for x in range(rows):
a_list = []
for i in range(columns):
a_list.append(str(random.randint(1000000,99999999)))
values = ",".join(str(i) for i in a_list)
print values
all_list.append(a_list)
sys.stdout = open('random_num.csv', 'w')
for a_list in all_list:
print ", ".join(map(str, a_list))
The csv module takes care of a bunch the the crap needed for dealing with csv files.
As you can see below, you don't need to worry about conversion to strings or adding line-endings.
import csv
columns = 10
rows = 10
with open("random_num.csv", "wb") as outfile:
writer = csv.writer(outfile)
for x in range(rows):
a_list = [random.randint(1000000,99999999) for i in range(columns)]
writer.writerow(a_list)

nested for loop in python not working

We basically have a large xcel file and what im trying to do is create a list that has the maximum and minimum values of each column. there are 13 columns which is why the while loop should stop once it hits 14. the problem is once the counter is increased it does not seem to iterate through the for loop once. Or more explicitly,the while loop only goes through the for loop once yet it does seem to loop in that it increases the counter by 1 and stops at 14. it should be noted that the rows in the input file are strings of numbers which is why I convert them to tuples and than check to see if the value in the given position is greater than the column_max or smaller than the column_min. if so I reassign either column_max or column_min.Once this is completed the column_max and column_min are appended to a list( l ) andthe counter,(position), is increased to repeat the next column. Any help will be appreciated.
input_file = open('names.csv','r')
l= []
column_max = 0
column_min = 0
counter = 0
while counter<14:
for row in input_file:
row = row.strip()
row = row.split(',')
row = tuple(row)
if (float(row[counter]))>column_max:
column_max = float(row[counter])
elif (float(row[counter]))<column_min:
column_min = float(row[counter])
else:
column_min=column_min
column_max = column_max
l.append((column_max,column_min))
counter = counter + 1
I think you want to switch the order of your for and while loops.
Note that there is a slightly better way to do this:
with open('yourfile') as infile:
#read first row. Set column min and max to values in first row
data = [float(x) for x in infile.readline().split(',')]
column_maxs = data[:]
column_mins = data[:]
#read subsequent rows getting new min/max
for line in infile:
data = [float(x) for x in line.split(',')]
for i,d in enumerate(data):
column_maxs[i] = max(d,column_maxs[i])
column_mins[i] = min(d,column_mins[i])
If you have enough memory to hold the file in memory at once, this becomes even easier:
with open('yourfile') as infile:
data = [map(float,line.split(',')) for line in infile]
data_transpose = zip(*data)
col_mins = [min(x) for x in data_transpose]
col_maxs = [max(x) for x in data_transpose]
Once you have consumed the file, it has been consumed. Thus iterating over it again won't produce anything.
>>> for row in input_file:
... print row
1,2,3,...
4,5,6,...
etc.
>>> for row in input_file:
... print row
>>> # Nothing gets printed, the file is consumed
That is the reason why your code is not working.
You then have three main approaches:
Read the file each time (inefficient in I/O operations);
Load it into a list (inefficient for large files, as it stores the whole file in memory);
Rework the logic to operate line by line (quite feasible and efficient, though not as brief in code as loading it all into a two-dimensional structure and transposing it and using min and max may be).
Here is my technique for the third approach:
maxima = [float('-inf')] * 13
minima = [float('inf')] * 13
with open('names.csv') as input_file:
for row in input_file:
for col, value in row.split(','):
value = float(value)
maxima[col] = max(maxima[col], value)
minima[col] = min(minima[col], value)
# This gets the value you called ``l``
combined_max_and_min = zip(maxima, minima)

Categories

Resources