Store data in an array from a loop - python

I have two set of datas which I would like to multiply one by each other, and store the result in an array for each value.
For now I have this:
import csv
from mpdaf.obj import Spectrum, WaveCoord
import matplotlib.pyplot as plt
import pandas as pd
from csv import reader
file_path = input("Enter full transmission curve path : ")
with open(file_path, 'rw') as f:
data = list(reader(f, delimiter=","))
wavelength = [i[0] for i in data]
percentage = [float(str(i[1]).replace(',','.')) for i in data]
spectrum = input("Full spectrum path : ")
spe = Spectrum(filename=spectrum, ext=0)
data_flux = spe.data
flux_array = []
for i in percentage:
for j in data_flux:
flux = i*j
flux_array.append(flux)
print(flux_array)
Like this it take the first i then multiply it by all the j then takes the next i etc etc ...
I would like to just multiply the first i by the first j, then store the value in the array, then multiply the 2nd i by the second j and store the value etc ...

It is as the error message says: your indices i and j are floats, not integers. When you write for i in percentage:, i takes on every value in the percentage list. Instead, you might want to iterate through a range. Here's an example to illustrate the difference:
percentage = [50.0, 60.0, 70.0]
for i in percentage:
print(i)
# 50.0
# 60.0
# 70.0
for i in range(len(percentage)):
print(i)
# 0
# 1
# 2
To iterate through a list of indices, you probably want to iterate through a range:
for i in range(len(percentage)):
for j in range(len(data_flux)):
flux = percentage[i]*data_flux[j]
flux_array.append(flux)
This will iterate through the integers of each list, starting at 0 and ending at the maximum index of the list.

Related

Processing data in text files

I have multiple text file in a directory. Each of these files contains 7 columns and 20 rows. The last column has 0 values in all rows at the beginning.
What i want to do is: I want to use the first three column of each txt file (line by line) to make some calćulation and store the result in the 7th column respectively line by line.
To clarify the structure of one txt file:
642.29 710.87 154.24 -0.50384 -0.17085 0.067804 0
641.57 711.98 154.42 -0.50681 -0.16978 0.06784 0
640.82 713.14 154.58 -0.50944 -0.1711 0.068266 0
639.72 714.53 154.59 -0.50496 -0.19229 0.057764 0
638.99 715.79 154.75 -0.50728 -0.18873 0.057795 0
638.18 717.13 154.96 -0.51024 -0.18653 0.057893 0
After the calculations are done the last column becomes with the new values as following and the txt file should be stored with the new values:
642.29 710.87 154.24 -0.50384 -0.17085 0.067804 0
641.57 711.98 154.42 -0.50681 -0.16978 0.06784 1.3352527850560352
640.82 713.14 154.58 -0.50944 -0.1711 0.068266 2.725828205520504
639.72 714.53 154.59 -0.50496 -0.19229 0.057764 3.1632005923493804
638.99 715.79 154.75 -0.50728 -0.18873 0.057795 3.237582509147674
638.18 717.13 154.96 -0.51024 -0.18653 0.057893 3.044767452434894
I did the process for one file. But how can i do it for multiple files? Open each file automatically, do some calculations on that file and store it.
Thanks
My code for one file:
import numpy as np
import os
Capture_period= 10
Marker_frames= 2000
Sampling_time = Capture_period/Marker_frames
coords = []
vel_list = [0]
ins_vel_list=[0]
# Define a function to calculate the euclidean distance
def Euclidean_Distance(a, b):
a = np.array(a)
b = np.array(b)
return np.linalg.norm(a-b)
def process(contents):
contents = first_source_data.tolist()
# Extract the xyz coordiantes
for i, item in enumerate(contents):
coords.append([[float(x) for x in item[0:3]], i+1])
print(coords)
rang=range(len(coords))
for i in rang:
if i !=rang[-1]:
Eucl_distance = Euclidean_Distance(coords[i][0], coords[i+1][0])
vel = ((Eucl_distance / (Sampling_time*100)))# + " cm/sec"
vel_list.append(vel)
ins_vel=(vel_list[i]+vel_list[i+1])/2
ins_vel_list.append(ins_vel)
continue
#del ins_vel_list[:]
#print(ins_vel_list)
from glob import glob
filepaths = glob("/home/experiment/*.txt")
for path in filepaths:
print(path)
process(path)
Problems:
The first 4 lines in each file are not read!
The append list must be reseted before the new file
You can create three text files with the 7 columns and whatever rows to test it.
Each file consists of coordinates of motion (xyz) and (theta_x, theta_y,theta_z) and the last column is the instantaneous velocity which the average of the average velocities.
The first component of the last column should equal in all files to zero (because a t the staring time the velocity is zero).
Any helps or solutions is appreciated!
Put your code in a function and make the function accept the path as argument, then call the function in a for loop iterating over the list of files.
E.g.:
from glob import glob
import numpy as np
from scipy.spatial import distance
CAPTURE_PERIOD = 10
MARKER_FRAMES = 2000
SAMPLING_TIME = CAPTURE_PERIOD/MARKER_FRAMES
def get_euclidean_distance(a, b):
a = np.array(a)
b = np.array(b)
return np.linalg.norm(a - b)
def make_chunks(lst, n):
for i in range(0, len(lst), n):
yield lst[i : i+n]
def write_in_chunks(f, lst, n):
for chunk in make_chunks(lst, n):
f.write(" ".join(str(val) for val in chunk) + "\n")
def process_file(filepath):
"""Process a single file.
:param filepath: path to the file to process
"""
# Load data from file
with open(filepath) as datafile:
contents = [line.split() for line in datafile]
# Define an empty list for coordinates
coords = []
# Set the first component in the velocity vector to 0 in velocity_list
vel_list = [0]
inst_vel_list=[0]
# Extract the xyz coordiantes
for i, item in enumerate(contents):
coords.append([[float(x) for x in item[0:3]], i+1])
# Calculate the euclidean distance and the speed using the previous coordinates system
rang = range(len(coords))
for i in rang:
if i != rang[-1]:
eucl_distance = get_euclidean_distance(coords[i][0], coords[i+1][0])
vel = ((eucl_distance / (SAMPLING_TIME*100)))# + " cm/sec"
vel_list.append(vel)
inst_vel = (vel_list[i]+vel_list[i+1])/2
inst_vel_list.append(inst_vel)
continue
for i, item in enumerate(contents):
item[-1] = inst_vel_list[i]
contents = np.ravel(contents)
with open(filepath, "w") as f:
write_in_chunks(f, contents, 7)
if __name__ == "__main__":
filepaths = glob("/home/experiment/*.txt")
for path in filepaths:
process(path)

Python, Automated Array Formation Data Retrieval Error

I am attempting to create a 2d array, and then subsequently pull data from the array and insert data at a specific point in the array. Below is some code that I wrote for creating the 2D Array:
from array import *
import math, random
TDArrayBuilder = []
TDArray = []
for yrunner in range(3):
for xrunner in range(3):
TDArrayBuilder.append(random.randint(0,1))
TDArray.insert(yrunner, [TDArrayBuilder])
TDArrayBuilder = []
print(TDArray[0][2])
The Error that this is spitting out is as follows:
Traceback (most recent call last):
File "C:/TestFile.py", line 13, in
print(TDArray[0][2])
IndexError: list index out of range
I also wrote some code previous to this regarding finding and printing the minimum values and maximum values in a 2D array, it was easily able to print the value at the specified location. I'm pretty sure this is just because I used numpy, but I would still like to do this without numpy.
Example code:
import numpy as np #required Import
import math
#preset matrix data
location = [] #Used for locations in searching
arr = np.array([[11, 12, 13],[14, 15, 16],[17, 15, 11],[12, 14, 15]]) #Data matrix
result = np.where(arr == (np.amax(arr))) #Find the position(s) of the lowest data or the highest data, change the np.amax to npamin for max or min respectively
listofCoordinates = list(zip(result[0], result[1])) #Removes unnecessary stuff from the list
for cord in listofCoordinates: #takes the coordinate data out, individually
for char in cord: #loop used to separate the characters in the coordinate data
location.append(char) #Stores these characters in a locator array
length = (len(location)) #Takes the length of location and stores it
length = int(math.floor((length / 2))) #Floors out the length / 2, and changes it to an int instead of a float
for printer in range(length): #For loop to iterate over the location list
ycoord = location[(printer*2)] #Finds the row, or y coord, of the variable
xcoord = location[((printer*2)+1)] #Finds the column, or x coord of the variable
print(arr[ycoord][xcoord]) #Prints the data, specific to the location of the variables
Summary:
I would like to be able to retrieve data from a 2d array, and I don't know how to do that (regarding the first code). I made a file using numpy and it worked, however, I would prefer not to use it for this operation as of current. anything would help
from random import randint
TDArray = list()
for yrunner in range(3):
TDArrayBuilder = list()
for xrunner in range(3):
TDArrayBuilder.append(randint(0, 1))
TDArray.insert(yrunner, TDArrayBuilder)
print(TDArray)
print(TDArray[0][2])
or
TDArray = [[randint(0, 1) for _ in range(3)] for _ in range(3)]
print(TDArray)
print(TDArray[0][2])

Find max and extract data from a list

I have a text file with twenty car prices and its serial number there are 50 lines in this file. I would like to find the max car price and its serial for every 10 lines.
priceandserial.txt
102030 4000.30
102040 5000.40
102080 5500.40
102130 4000.30
102140 5000.50
102180 6000.50
102230 2000.60
102240 4000.30
102280 6000.30
102330 9000.70
102340 1000.30
102380 3000.30
102430 4000.80
102440 5000.30
102480 7000.30
When I tried Python's builtin max function I get 102480 as the max value.
x = np.loadtxt('carserial.txt', unpack=True)
print('Max:', np.max(x))
Desired result:
102330 9000.70
102480 7000.30
There are 50 lines in file, therefore I should have a 5 line result with serial and max prices of each 10 lines.
Respectfully, I think the first solution is over-engineered. You don't need numpy or math for this task, just a dictionary. As you loop through, you update the dictionary if the latest value is greater than the current value, and do nothing if it isn't. Everything 10th item, you append the values from the dictionary to an output list and reset the buffer.
with open('filename.txt', 'r') as opened_file:
data = opened_file.read()
rowsplitdata = data.split('\n')
colsplitdata = [u.split(' ') for u in rowsplitdata]
x = [[int(j[0]), float(j[1])] for j in colsplitdata]
output = []
buffer = {"max":0, "index":0}
count = 0
#this assumes x is a list of lists, not a numpy array
for u in x:
count += 1
if u[1] > buffer["max"]:
buffer["max"] = u[1]
buffer["index"] = u[0]
if count == 10:
output.append([buffer["index"], buffer["max"]])
buffer = {"max":0, "index":0}
count = 0
#append the remainder of the buffer in case you didn't get to ten in the final pass
output.append([buffer["index"], buffer["max"]])
output
[[102330, 9000.7], [102480, 7000.3]]
You should iterate over it and for each 10 lines extract the maximum:
import math
# New empty list for colecting the results
max_list=[]
#iterate thorught x supposing
for i in range(math.ceil(len(x)/10)):
### append only 10 elments if i+10 is not superior to the lenght of the array
if i+11<len(x):
max_list=max_list.append(np.max(x[i:i+11]))
### if it is superior, then append all the remaining elements
else:
max_list=max_list.append(np.max(x[i:]))
This should do your job.
number_list = [[],[]]
with open('filename.txt', 'r') as opened_file:
for line in opened_file:
if len(line.split()) == 0:
continue
else:
a , b = line.split(" ")
number_list[0].append(a)
number_list[1].append(b)
col1_max, col2_max = max(number_list[0]), max(number_list[1])
col1_max, col2_max
Just change the filename. col1_max, col2_max have the respective column's max value. You can edit the code to accommodate more columns.
You can transpose your input first, then use np.split and for each submatrix you calculate its max.
x = np.genfromtxt('carserial.txt', unpack=True).T
print(x)
for submatrix in np.split(x,len(x)//10):
print(max(submatrix,key=lambda l:l[1]))
working example

Matching multiple array value to row in csv file slow

I have a numpy array consisting of about 1200 arrays containing 10 values each. np.shape = 1200, 10. Each element has a value between 0 and 5,7 million.
Next I have a .csv file with 3800 lines. Every line contains 2 values. The first value indicates a range the second value is an identifier. The first and last 5 rows of the .csv file:
509,47222
1425,47220
2404,47219
4033,47218
6897,47202
...,...
...,...
...,...
5793850,211
5794901,186
5795820,181
5796176,43
5796467,33
The first columns goes up until it reaches 5,7 million. For each value in the numpy array I want to check the first column of the .csv file. I have for example the value 3333, this means the identifier belonging to 3333 is 47218. Each row indicates that from the first column of the row before till the first column of this row, eg: 2404 - 4033 the identifier is 47218.
Now I want to get the identifier for each value in the numpy array, then I want to safe the identifier and the frequency of which this identifier is found in the numpy array. Which means I need to loop 3800 times over a csv file of 12000 lines and subsequently ++ an integer. This process takes about 30 seconds which is way too long.
This is the code I am currently using:
numpy_file = np.fromfile(filename, dtype=np.int32)
#some code to format numpy_file correctly
with open('/identifer_file.csv') as read_file:
csv_reader = csv.reader(read_file, delimiter=',')
csv_reader = list(csv_reader)
identifier_dict = {}
for numpy_array in numpy_file:
for numpy_value in numpy_array:
#there are 12000 numpy_value in numpy_file
for row in csv_reader:
last_identifier = 0
if numpy_value <= int(row[0]):
last_identifier = int(row[1])
#adding the frequency of the identifier in numpy_file to a dict
if last_identifier in identifier_dict:
identifier_dict[last_identifier] += 1
else:
identifier_dict[last_identifier] = 1
else:
continue
break
for x, y in identifier_dict.items():
if(y > 40):
print("identifier: {} amount of times found: {}".format(x, y))
What algorithm should I implement to speed up this process?
Edit
I have tried folding the numpy array to a 1D array, so it has 12000 values. This has no real affect on the speed. Latest test was 33 seconds
Setup:
import numpy as np
import collections
np.random.seed(100)
numpy_file = np.random.randint(0, 5700000, (1200,10))
#'''range, identifier'''
read_file = io.StringIO('''509,47222
1425,47220
2404,47219
4033,47218
6897,47202
5793850,211
5794901,186
5795820,181
5796176,43
5796467,33''')
csv_reader = csv.reader(read_file, delimiter=',')
csv_reader = list(csv_reader)
# your example code put in a function and adapted for the setup above
def original(numpy_file,csv_reader):
identifier_dict = {}
for numpy_array in numpy_file:
for numpy_value in numpy_array:
#there are 12000 numpy_value in numpy_file
for row in csv_reader:
last_identifier = 0
if numpy_value <= int(row[0]):
last_identifier = int(row[1])
#adding the frequency of the identifier in numpy_file to a dict
if last_identifier in identifier_dict:
identifier_dict[last_identifier] += 1
else:
identifier_dict[last_identifier] = 1
else:
continue
break
# for x, y in identifier_dict.items():
# if(y > 40):
# print("identifier: {} amount of times found: {}".format(x, y))
return identifier_dict
Three solutions each vectorizing some of the operations. The first function consumes the least memory, the last consumes the most memory.
def first(numpy_file,r):
'''compare each value in the array to the entire first column of the csv'''
alternate = collections.defaultdict(int)
for value in np.nditer(numpy_file):
comparison = value < r[:,0]
identifier = r[:,1][comparison.argmax()]
alternate[identifier] += 1
return alternate
def second(numpy_file,r):
'''compare each row of the array to the first column of csv'''
alternate = collections.defaultdict(int)
for row in numpy_file:
comparison = row[...,None] < r[:,0]
indices = comparison.argmax(-1)
id_s = r[:,1][indices]
for thing in id_s:
#adding the frequency of the identifier in numpy_file to a dict
alternate[thing] += 1
return alternate
def third(numpy_file,r):
'''compare the whole array to the first column of csv'''
alternate = collections.defaultdict(int)
other = collections.Counter()
comparison = numpy_file[...,None] < r[:,0]
indices = comparison.argmax(-1)
id_s = r[:,1][indices]
other = collections.Counter(map(int,np.nditer(id_s)))
return other
The functions require the csv file be read into a numpy array:
read_file.seek(0) #io.StringIO object from setup
csv_reader = csv.reader(read_file, delimiter=',')
r = np.array([list(map(int,thing)) for thing in csv_reader])
one = first(numpy_file, r)
two = second(numpy_file,r)
three = third(numpy_file,r)
assert zero == one
assert zero == two
assert zero == three

Unable to pull data from a file and place into two arrays

The code uses the matrix and arrpow functions to calculate the fibonacci numbers for the elements in my list, num. Oddly, right after a.append(float(row[0])) is completed, the error I get is
IndexError: list index out of range
Which is obviously coming from b.append.
Here's the file I want to pull from
import time
import math
import csv
import matplotlib.pyplot as plt
def arrpow(arr, n):
yarr=arr
if n<1:
pass
if n==1:
return arr
yarr = arrpow(arr, n//2)
yarr = [[yarr[0][0]*yarr[0][0]+yarr[0][1]*yarr[1][0],yarr[0][0]*yarr[0][1]+yarr[0][1]*yarr[1][1]],
[yarr[1][0]*yarr[0][0]+yarr[1][1]*yarr[1][0],yarr[1][0]*yarr[0][1]+yarr[1][1]*yarr[1][1]]]
if n%2:
yarr=[[yarr[0][0]*arr[0][0]+yarr[0][1]*arr[1][0],yarr[0][0]*arr[0][1]+yarr[0][1]*arr[1][1]],
[yarr[1][0]*arr[0][0]+yarr[1][1]*arr[1][0],yarr[1][0]*arr[0][1]+yarr[1][1]*arr[1][1]]]
return yarr
def matrix(n):
arr= [[1,1],[1,0]]
f=arrpow(arr,n-1)[0][0]
return f
num = [10,100,1000,10000,100000,1000000]
with open('matrix.dat', 'w') as h:
for i in num:
start_time = 0
start_time = time.time()
run = matrix(i)
h.write(str(math.log10(i)))
h.write('\n')
h.write((str(math.log10(time.time()-start_time))))
h.write('\n')
a = []
b = []
with open('matrix.dat','r+') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
for row in plots:
a.append(float(row[0]))
b.append(float(row[1]))
plt.plot(a,b,label = " ")
row = ['1.0']
So row is a list with 1 value. row[1] is trying to access the second index of a list with 1 value. That is why you are getting an error.
When you are constructing matrix.dat, you do not add a comma for the CSV reader to separate the data. So when it tries to read the file, the whole thing is converted into a 1-element array. Attempting to access the second element throws an error because it doesn't exist.
Solution: Replace \n on line 34 with a comma (,).

Categories

Resources