Matching multiple array value to row in csv file slow - python

I have a numpy array consisting of about 1200 arrays containing 10 values each. np.shape = 1200, 10. Each element has a value between 0 and 5,7 million.
Next I have a .csv file with 3800 lines. Every line contains 2 values. The first value indicates a range the second value is an identifier. The first and last 5 rows of the .csv file:
509,47222
1425,47220
2404,47219
4033,47218
6897,47202
...,...
...,...
...,...
5793850,211
5794901,186
5795820,181
5796176,43
5796467,33
The first columns goes up until it reaches 5,7 million. For each value in the numpy array I want to check the first column of the .csv file. I have for example the value 3333, this means the identifier belonging to 3333 is 47218. Each row indicates that from the first column of the row before till the first column of this row, eg: 2404 - 4033 the identifier is 47218.
Now I want to get the identifier for each value in the numpy array, then I want to safe the identifier and the frequency of which this identifier is found in the numpy array. Which means I need to loop 3800 times over a csv file of 12000 lines and subsequently ++ an integer. This process takes about 30 seconds which is way too long.
This is the code I am currently using:
numpy_file = np.fromfile(filename, dtype=np.int32)
#some code to format numpy_file correctly
with open('/identifer_file.csv') as read_file:
csv_reader = csv.reader(read_file, delimiter=',')
csv_reader = list(csv_reader)
identifier_dict = {}
for numpy_array in numpy_file:
for numpy_value in numpy_array:
#there are 12000 numpy_value in numpy_file
for row in csv_reader:
last_identifier = 0
if numpy_value <= int(row[0]):
last_identifier = int(row[1])
#adding the frequency of the identifier in numpy_file to a dict
if last_identifier in identifier_dict:
identifier_dict[last_identifier] += 1
else:
identifier_dict[last_identifier] = 1
else:
continue
break
for x, y in identifier_dict.items():
if(y > 40):
print("identifier: {} amount of times found: {}".format(x, y))
What algorithm should I implement to speed up this process?
Edit
I have tried folding the numpy array to a 1D array, so it has 12000 values. This has no real affect on the speed. Latest test was 33 seconds

Setup:
import numpy as np
import collections
np.random.seed(100)
numpy_file = np.random.randint(0, 5700000, (1200,10))
#'''range, identifier'''
read_file = io.StringIO('''509,47222
1425,47220
2404,47219
4033,47218
6897,47202
5793850,211
5794901,186
5795820,181
5796176,43
5796467,33''')
csv_reader = csv.reader(read_file, delimiter=',')
csv_reader = list(csv_reader)
# your example code put in a function and adapted for the setup above
def original(numpy_file,csv_reader):
identifier_dict = {}
for numpy_array in numpy_file:
for numpy_value in numpy_array:
#there are 12000 numpy_value in numpy_file
for row in csv_reader:
last_identifier = 0
if numpy_value <= int(row[0]):
last_identifier = int(row[1])
#adding the frequency of the identifier in numpy_file to a dict
if last_identifier in identifier_dict:
identifier_dict[last_identifier] += 1
else:
identifier_dict[last_identifier] = 1
else:
continue
break
# for x, y in identifier_dict.items():
# if(y > 40):
# print("identifier: {} amount of times found: {}".format(x, y))
return identifier_dict
Three solutions each vectorizing some of the operations. The first function consumes the least memory, the last consumes the most memory.
def first(numpy_file,r):
'''compare each value in the array to the entire first column of the csv'''
alternate = collections.defaultdict(int)
for value in np.nditer(numpy_file):
comparison = value < r[:,0]
identifier = r[:,1][comparison.argmax()]
alternate[identifier] += 1
return alternate
def second(numpy_file,r):
'''compare each row of the array to the first column of csv'''
alternate = collections.defaultdict(int)
for row in numpy_file:
comparison = row[...,None] < r[:,0]
indices = comparison.argmax(-1)
id_s = r[:,1][indices]
for thing in id_s:
#adding the frequency of the identifier in numpy_file to a dict
alternate[thing] += 1
return alternate
def third(numpy_file,r):
'''compare the whole array to the first column of csv'''
alternate = collections.defaultdict(int)
other = collections.Counter()
comparison = numpy_file[...,None] < r[:,0]
indices = comparison.argmax(-1)
id_s = r[:,1][indices]
other = collections.Counter(map(int,np.nditer(id_s)))
return other
The functions require the csv file be read into a numpy array:
read_file.seek(0) #io.StringIO object from setup
csv_reader = csv.reader(read_file, delimiter=',')
r = np.array([list(map(int,thing)) for thing in csv_reader])
one = first(numpy_file, r)
two = second(numpy_file,r)
three = third(numpy_file,r)
assert zero == one
assert zero == two
assert zero == three

Related

How to create a 2d nested list from a text file using python?

I'm a beginner programmer, and I'm trying to figure out how to create a 2d nested list (grid) from a particular text file. For example, the text file would look like this:
3
3
150
109
80
892
123
982
0
98
23
The first two lines in the text file would be used to create the grid, meaning that it is 3x3. The next 9 lines would be used to populate the grid, with the first 3 making up the first row, the next 3 making up the middle row, and the final 3 making up the last row. So the nested list would look like this:
[[150, 109, 80] [892, 123, 982] [0, 98, 23]]
How do I go about doing this? I was able to make a list of all of the contents, but I can't figure out how to use the first 2 lines to define the size of the inner lists within the outer list:
lineContent = []
innerList = ?
for lines in open('document.txt','r'):
value = int(lines)
lineContent.append(value)
From here, where do I go to turn it into a nested list using the given values on the first 2 lines?
Thanks in advance.
You can make this quite neat using list comprehension.
def txt_grid(your_txt):
with open(your_txt, 'r') as f:
# Find columns and rows
columns = int(f.readline())
rows = int(f.readline())
your_list = [[f.readline().strip() for i in range(rows)] for j in range(columns)]
return your_list
print(txt_grid('document.txt'))
strip() just clears the newline characters (\n) from each line before storing them in the list.
Edit: A modified version with logic for if your txt file didn't have enough rows for the defined dimensions.
def txt_grid(your_txt):
with open(your_txt, 'r') as f:
# Find columns and rows
columns = int(f.readline())
rows = int(f.readline())
dimensions = columns * rows
# Test to see if there are enough rows, creating grid if there are
nonempty_lines = len([line.strip("\n") for line in f]) # This ignores the first two lines as they have already been written
if nonempty_lines < dimensions:
# Either raise an error
# raise ValueError("Insufficient non-empty rows in text file for given dimensions")
# Or return something that's not a list
your_list = None
else:
# Creating grid
your_list = [[f.readline().strip() for i in range(rows)] for j in range(columns)]
return your_list
print(txt_grid('document.txt'))
def parse_txt(filepath):
lineContent = []
with open(filepath, 'r') as txt: # The with statement closes the txt file after its been used
nrows = int(txt.readline())
ncols = int(txt.readline())
for i in range(nrows): # For each row
row = []
for j in range(ncols): # Grab each value in the row
row.append(int(txt.readline()))
lineContent.append(row)
return lineContent
grid_2d = parse_txt('document.txt')
lineContent = []
innerList = []
for lines in open('testQuestion.txt', 'r'):
value = int(lines)
lineContent.append(value)
rowSz = lineContent[0] # row size
colSz = lineContent[1] # column size
del lineContent[0], lineContent[0] # makes line contents just the values in the matrix, could also just start currentLine at 2, notice 0 index is repeated because 1st element was deleted
assert rowSz * colSz == len(lineContent), 'not enough values for array' # to ensure there are enough entries to complete array of rowSz * colSz elements
arr = []
currentLine = 0
for x in range(rowSz):
arr.append([])
for y in range(colSz):
arr[x].append(lineContent[currentLine])
currentLine += 1
print(arr)

Processing data in text files

I have multiple text file in a directory. Each of these files contains 7 columns and 20 rows. The last column has 0 values in all rows at the beginning.
What i want to do is: I want to use the first three column of each txt file (line by line) to make some calćulation and store the result in the 7th column respectively line by line.
To clarify the structure of one txt file:
642.29 710.87 154.24 -0.50384 -0.17085 0.067804 0
641.57 711.98 154.42 -0.50681 -0.16978 0.06784 0
640.82 713.14 154.58 -0.50944 -0.1711 0.068266 0
639.72 714.53 154.59 -0.50496 -0.19229 0.057764 0
638.99 715.79 154.75 -0.50728 -0.18873 0.057795 0
638.18 717.13 154.96 -0.51024 -0.18653 0.057893 0
After the calculations are done the last column becomes with the new values as following and the txt file should be stored with the new values:
642.29 710.87 154.24 -0.50384 -0.17085 0.067804 0
641.57 711.98 154.42 -0.50681 -0.16978 0.06784 1.3352527850560352
640.82 713.14 154.58 -0.50944 -0.1711 0.068266 2.725828205520504
639.72 714.53 154.59 -0.50496 -0.19229 0.057764 3.1632005923493804
638.99 715.79 154.75 -0.50728 -0.18873 0.057795 3.237582509147674
638.18 717.13 154.96 -0.51024 -0.18653 0.057893 3.044767452434894
I did the process for one file. But how can i do it for multiple files? Open each file automatically, do some calculations on that file and store it.
Thanks
My code for one file:
import numpy as np
import os
Capture_period= 10
Marker_frames= 2000
Sampling_time = Capture_period/Marker_frames
coords = []
vel_list = [0]
ins_vel_list=[0]
# Define a function to calculate the euclidean distance
def Euclidean_Distance(a, b):
a = np.array(a)
b = np.array(b)
return np.linalg.norm(a-b)
def process(contents):
contents = first_source_data.tolist()
# Extract the xyz coordiantes
for i, item in enumerate(contents):
coords.append([[float(x) for x in item[0:3]], i+1])
print(coords)
rang=range(len(coords))
for i in rang:
if i !=rang[-1]:
Eucl_distance = Euclidean_Distance(coords[i][0], coords[i+1][0])
vel = ((Eucl_distance / (Sampling_time*100)))# + " cm/sec"
vel_list.append(vel)
ins_vel=(vel_list[i]+vel_list[i+1])/2
ins_vel_list.append(ins_vel)
continue
#del ins_vel_list[:]
#print(ins_vel_list)
from glob import glob
filepaths = glob("/home/experiment/*.txt")
for path in filepaths:
print(path)
process(path)
Problems:
The first 4 lines in each file are not read!
The append list must be reseted before the new file
You can create three text files with the 7 columns and whatever rows to test it.
Each file consists of coordinates of motion (xyz) and (theta_x, theta_y,theta_z) and the last column is the instantaneous velocity which the average of the average velocities.
The first component of the last column should equal in all files to zero (because a t the staring time the velocity is zero).
Any helps or solutions is appreciated!
Put your code in a function and make the function accept the path as argument, then call the function in a for loop iterating over the list of files.
E.g.:
from glob import glob
import numpy as np
from scipy.spatial import distance
CAPTURE_PERIOD = 10
MARKER_FRAMES = 2000
SAMPLING_TIME = CAPTURE_PERIOD/MARKER_FRAMES
def get_euclidean_distance(a, b):
a = np.array(a)
b = np.array(b)
return np.linalg.norm(a - b)
def make_chunks(lst, n):
for i in range(0, len(lst), n):
yield lst[i : i+n]
def write_in_chunks(f, lst, n):
for chunk in make_chunks(lst, n):
f.write(" ".join(str(val) for val in chunk) + "\n")
def process_file(filepath):
"""Process a single file.
:param filepath: path to the file to process
"""
# Load data from file
with open(filepath) as datafile:
contents = [line.split() for line in datafile]
# Define an empty list for coordinates
coords = []
# Set the first component in the velocity vector to 0 in velocity_list
vel_list = [0]
inst_vel_list=[0]
# Extract the xyz coordiantes
for i, item in enumerate(contents):
coords.append([[float(x) for x in item[0:3]], i+1])
# Calculate the euclidean distance and the speed using the previous coordinates system
rang = range(len(coords))
for i in rang:
if i != rang[-1]:
eucl_distance = get_euclidean_distance(coords[i][0], coords[i+1][0])
vel = ((eucl_distance / (SAMPLING_TIME*100)))# + " cm/sec"
vel_list.append(vel)
inst_vel = (vel_list[i]+vel_list[i+1])/2
inst_vel_list.append(inst_vel)
continue
for i, item in enumerate(contents):
item[-1] = inst_vel_list[i]
contents = np.ravel(contents)
with open(filepath, "w") as f:
write_in_chunks(f, contents, 7)
if __name__ == "__main__":
filepaths = glob("/home/experiment/*.txt")
for path in filepaths:
process(path)

Find max and extract data from a list

I have a text file with twenty car prices and its serial number there are 50 lines in this file. I would like to find the max car price and its serial for every 10 lines.
priceandserial.txt
102030 4000.30
102040 5000.40
102080 5500.40
102130 4000.30
102140 5000.50
102180 6000.50
102230 2000.60
102240 4000.30
102280 6000.30
102330 9000.70
102340 1000.30
102380 3000.30
102430 4000.80
102440 5000.30
102480 7000.30
When I tried Python's builtin max function I get 102480 as the max value.
x = np.loadtxt('carserial.txt', unpack=True)
print('Max:', np.max(x))
Desired result:
102330 9000.70
102480 7000.30
There are 50 lines in file, therefore I should have a 5 line result with serial and max prices of each 10 lines.
Respectfully, I think the first solution is over-engineered. You don't need numpy or math for this task, just a dictionary. As you loop through, you update the dictionary if the latest value is greater than the current value, and do nothing if it isn't. Everything 10th item, you append the values from the dictionary to an output list and reset the buffer.
with open('filename.txt', 'r') as opened_file:
data = opened_file.read()
rowsplitdata = data.split('\n')
colsplitdata = [u.split(' ') for u in rowsplitdata]
x = [[int(j[0]), float(j[1])] for j in colsplitdata]
output = []
buffer = {"max":0, "index":0}
count = 0
#this assumes x is a list of lists, not a numpy array
for u in x:
count += 1
if u[1] > buffer["max"]:
buffer["max"] = u[1]
buffer["index"] = u[0]
if count == 10:
output.append([buffer["index"], buffer["max"]])
buffer = {"max":0, "index":0}
count = 0
#append the remainder of the buffer in case you didn't get to ten in the final pass
output.append([buffer["index"], buffer["max"]])
output
[[102330, 9000.7], [102480, 7000.3]]
You should iterate over it and for each 10 lines extract the maximum:
import math
# New empty list for colecting the results
max_list=[]
#iterate thorught x supposing
for i in range(math.ceil(len(x)/10)):
### append only 10 elments if i+10 is not superior to the lenght of the array
if i+11<len(x):
max_list=max_list.append(np.max(x[i:i+11]))
### if it is superior, then append all the remaining elements
else:
max_list=max_list.append(np.max(x[i:]))
This should do your job.
number_list = [[],[]]
with open('filename.txt', 'r') as opened_file:
for line in opened_file:
if len(line.split()) == 0:
continue
else:
a , b = line.split(" ")
number_list[0].append(a)
number_list[1].append(b)
col1_max, col2_max = max(number_list[0]), max(number_list[1])
col1_max, col2_max
Just change the filename. col1_max, col2_max have the respective column's max value. You can edit the code to accommodate more columns.
You can transpose your input first, then use np.split and for each submatrix you calculate its max.
x = np.genfromtxt('carserial.txt', unpack=True).T
print(x)
for submatrix in np.split(x,len(x)//10):
print(max(submatrix,key=lambda l:l[1]))
working example

Unable to pull data from a file and place into two arrays

The code uses the matrix and arrpow functions to calculate the fibonacci numbers for the elements in my list, num. Oddly, right after a.append(float(row[0])) is completed, the error I get is
IndexError: list index out of range
Which is obviously coming from b.append.
Here's the file I want to pull from
import time
import math
import csv
import matplotlib.pyplot as plt
def arrpow(arr, n):
yarr=arr
if n<1:
pass
if n==1:
return arr
yarr = arrpow(arr, n//2)
yarr = [[yarr[0][0]*yarr[0][0]+yarr[0][1]*yarr[1][0],yarr[0][0]*yarr[0][1]+yarr[0][1]*yarr[1][1]],
[yarr[1][0]*yarr[0][0]+yarr[1][1]*yarr[1][0],yarr[1][0]*yarr[0][1]+yarr[1][1]*yarr[1][1]]]
if n%2:
yarr=[[yarr[0][0]*arr[0][0]+yarr[0][1]*arr[1][0],yarr[0][0]*arr[0][1]+yarr[0][1]*arr[1][1]],
[yarr[1][0]*arr[0][0]+yarr[1][1]*arr[1][0],yarr[1][0]*arr[0][1]+yarr[1][1]*arr[1][1]]]
return yarr
def matrix(n):
arr= [[1,1],[1,0]]
f=arrpow(arr,n-1)[0][0]
return f
num = [10,100,1000,10000,100000,1000000]
with open('matrix.dat', 'w') as h:
for i in num:
start_time = 0
start_time = time.time()
run = matrix(i)
h.write(str(math.log10(i)))
h.write('\n')
h.write((str(math.log10(time.time()-start_time))))
h.write('\n')
a = []
b = []
with open('matrix.dat','r+') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
for row in plots:
a.append(float(row[0]))
b.append(float(row[1]))
plt.plot(a,b,label = " ")
row = ['1.0']
So row is a list with 1 value. row[1] is trying to access the second index of a list with 1 value. That is why you are getting an error.
When you are constructing matrix.dat, you do not add a comma for the CSV reader to separate the data. So when it tries to read the file, the whole thing is converted into a 1-element array. Attempting to access the second element throws an error because it doesn't exist.
Solution: Replace \n on line 34 with a comma (,).

nested for loop in python not working

We basically have a large xcel file and what im trying to do is create a list that has the maximum and minimum values of each column. there are 13 columns which is why the while loop should stop once it hits 14. the problem is once the counter is increased it does not seem to iterate through the for loop once. Or more explicitly,the while loop only goes through the for loop once yet it does seem to loop in that it increases the counter by 1 and stops at 14. it should be noted that the rows in the input file are strings of numbers which is why I convert them to tuples and than check to see if the value in the given position is greater than the column_max or smaller than the column_min. if so I reassign either column_max or column_min.Once this is completed the column_max and column_min are appended to a list( l ) andthe counter,(position), is increased to repeat the next column. Any help will be appreciated.
input_file = open('names.csv','r')
l= []
column_max = 0
column_min = 0
counter = 0
while counter<14:
for row in input_file:
row = row.strip()
row = row.split(',')
row = tuple(row)
if (float(row[counter]))>column_max:
column_max = float(row[counter])
elif (float(row[counter]))<column_min:
column_min = float(row[counter])
else:
column_min=column_min
column_max = column_max
l.append((column_max,column_min))
counter = counter + 1
I think you want to switch the order of your for and while loops.
Note that there is a slightly better way to do this:
with open('yourfile') as infile:
#read first row. Set column min and max to values in first row
data = [float(x) for x in infile.readline().split(',')]
column_maxs = data[:]
column_mins = data[:]
#read subsequent rows getting new min/max
for line in infile:
data = [float(x) for x in line.split(',')]
for i,d in enumerate(data):
column_maxs[i] = max(d,column_maxs[i])
column_mins[i] = min(d,column_mins[i])
If you have enough memory to hold the file in memory at once, this becomes even easier:
with open('yourfile') as infile:
data = [map(float,line.split(',')) for line in infile]
data_transpose = zip(*data)
col_mins = [min(x) for x in data_transpose]
col_maxs = [max(x) for x in data_transpose]
Once you have consumed the file, it has been consumed. Thus iterating over it again won't produce anything.
>>> for row in input_file:
... print row
1,2,3,...
4,5,6,...
etc.
>>> for row in input_file:
... print row
>>> # Nothing gets printed, the file is consumed
That is the reason why your code is not working.
You then have three main approaches:
Read the file each time (inefficient in I/O operations);
Load it into a list (inefficient for large files, as it stores the whole file in memory);
Rework the logic to operate line by line (quite feasible and efficient, though not as brief in code as loading it all into a two-dimensional structure and transposing it and using min and max may be).
Here is my technique for the third approach:
maxima = [float('-inf')] * 13
minima = [float('inf')] * 13
with open('names.csv') as input_file:
for row in input_file:
for col, value in row.split(','):
value = float(value)
maxima[col] = max(maxima[col], value)
minima[col] = min(minima[col], value)
# This gets the value you called ``l``
combined_max_and_min = zip(maxima, minima)

Categories

Resources