python: multiple .dat's in multiple arrays

python: multiple .dat's in multiple arrays - python

I'm trying to sort some data into (np.)arrays and get stuck with a problem.
I have 1000 .dat files and I need to put the data from them in 1000 different arrays. Further, every array should contain data depend on coordinates [i] [j] [k] (this part I've done already and the code looks like this (this is kind of "short" version):
with open('177500.dat', newline='') as csvfile:
f = csv.reader(csvfile, delimiter=' ')
for row in f:
<some code which works pretty good>
cV = [[[[] for k in range(kMax)] for j in range(jMax)] for i in range(iMax)]
with open('177500.dat', newline='') as csvfile:
f = csv.reader(csvfile, delimiter=' ')
<some code which works also good>
values = np.array([np.float64(row[i]) for i in range(3, rowLen)])
cV[int(row[0])][int(row[1])][int(row[2])] = values
After this, i can print cV [i] [j] [k] and I get all data which is contained in one .dat file at the coordinates [i] [j] [k].
And now I need to create cV [i] [j] [k] [n] to get the data from the specific file number n at the coordinates [i] [j] [k]. And I absolutely don't know how can I tell python to put the data into the "right" place.
I tried some things like this:
for m in range(160000,182501,2500):
with open ('output/%d.dat' % m, newline='') as csvfile:
<bla bla code>
cV = [[[[[] for k in range(kMax)] for j in range(jMax)] for i in range(iMax)] for n in range(tMax)]
if len(row) == rowLen:
values = [np.array([np.float64(row[i]) for i in range (3, rowLen)]) for n in range(tMax)]
for n in range(tMax):
cV[int(row[0])][int(row[1])][int(row[2])][int(n)] = values[n]
But this surely didn't work because python don't know what the hack should be this [n] after the values.
So, how can I tell pyhton to put this [i] [j] [k] data from the file nr. n in the array cV [i] [j] [k] [n]?
Thanks in advance
C.
P.S. I didn't post the whole code because I don't think it is necessary. All arrays are created properly, but the thing which isn't working ist the data in them.

I think building arrays like this is going to make things more complicated for you. It would be easier to build a dictionary using tuples as keys. In the example file you sent me, each (x, y, z) pair was repeated twice, making me think that each file contains data on two iterations of a total solution of 2000 iterations. Dictionaries must have unique keys, so for each file I have implemented another counter, timestep, that can increment when collating data from a single file.
Now, if I wanted coords (1, 2, 3) on the 3rd timestep, I could do simulation[(1, 2, 3, 3)].
import csv
import numpy as np
'''
Made the assumptions that:
-Each file contains two iterations from a simulation of 2000 iterations
-Each file is numbered sequentially. Each time the same (x, y, z) coords are
discovered, it represents the next timestep in simulation
Accessing data is via a tuple key (x, y, z, n) with n being timestep
'''
simulation = {}
file_count = 1
timestep = 1
num_files = 2
for x in range(1, num_files + 1):
with open('sim_file_{}.dat'.format(file_count), 'r') as infile:
second_read = False
reader = csv.reader(infile, delimiter=' ')
for row in reader:
item = [float(x) for x in row]
if row:
if (not second_read and not
any(simulation.get((item[0], item[1], item[2], timestep), []))):
timestep += 1
second_read = True
simulation[(item[0], item[1], item[2], timestep)] = np.array(item[3:])
file_count += 1
timestep += 1
second_read = False

Related

How to store textfile data as variables and then calculate values using those variables?

I am new to python and I have a textfile that looks like this:
0 10
0.01 10.0000001
0.02 10.00000113
0.03 10.00000468
0.04 10.0000128
which are the first few values for time and velocity respectively.
I want to read that textfile into Python and use those values to create variables for time and velocity in order to find the acceleration.
So far I have:
t = []
v = []
with open('data.txt', 'r') as f:
for line in f:
first, second = line.split()
t.append(first)
v.append(second)
print(t)
print(v)
And now I am unsure where to go next.
Ideally I want to calculate the acceleration at the corresponding times and then write that into a new textfile that has [time, acceleration], looking something like:
0 acceleration_value1
0.01 acceleration_value2
0.02 acceleration_value3
0.03 acceleration_value4
0.04 acceleration_value5

Most of your code is already there, however it is lacking the conversion to float from the string read from file. Besides that, a simple loop should do the job.
t = []
v = []
with open('data.txt', 'r') as f:
for line in f:
first, second = line.split()
t.append(float(first)) # note the 'float' conversion here
v.append(float(second))
# Now, we will use an array to store the acceleration values
# The first element in this array will be zero, so:
a = [0]
# To calculate the acceleration, we will do delta(v)/delta(t),
# To calculate any delta, we will iterate each array, and pick
# the "current" and "previous" element. For this reason, our index
# will start at 1 skipping the first (otherwise we'll go to -1)
# these steps could be combined, but I feel it helps beginners to see
# the process
for i in range(1, len(v)): # since v,t have same length, just pick either
delta_v = v[i] - v[i-1]
delta_t = t[i] - t[i-1]
acc = delta_v / delta_t
a.append(acc)
# Now we can print
# 'zip' combines the two arrays as columns of a matrix,
# so now "x" picks a value from "t", and "y" from "a"
# as we iterate
for x, y in zip(t, a):
print("%s %s" % (x,y))
# or save to file:
with open("acceleration.txt", 'w') as f:
for x, y in zip(t, a):
f.write("%s %s\n" % (x,y))

This should give a list of the acceleration for each timestamp
t = []
v = []
with open('data.txt', 'r') as f:
for line in f:
first, second = line.split()
t.append(first)
v.append(second)
acc = []
for i in range(len(t)):
if i == 0:
acc.append(0)
else:
acc.append((float(v[i]) - float(v[i-1])) / (float(t[i]) - float(t[i-1])))
print(acc)

Find max and extract data from a list

I have a text file with twenty car prices and its serial number there are 50 lines in this file. I would like to find the max car price and its serial for every 10 lines.
priceandserial.txt
102030 4000.30
102040 5000.40
102080 5500.40
102130 4000.30
102140 5000.50
102180 6000.50
102230 2000.60
102240 4000.30
102280 6000.30
102330 9000.70
102340 1000.30
102380 3000.30
102430 4000.80
102440 5000.30
102480 7000.30
When I tried Python's builtin max function I get 102480 as the max value.
x = np.loadtxt('carserial.txt', unpack=True)
print('Max:', np.max(x))
Desired result:
102330 9000.70
102480 7000.30
There are 50 lines in file, therefore I should have a 5 line result with serial and max prices of each 10 lines.

Respectfully, I think the first solution is over-engineered. You don't need numpy or math for this task, just a dictionary. As you loop through, you update the dictionary if the latest value is greater than the current value, and do nothing if it isn't. Everything 10th item, you append the values from the dictionary to an output list and reset the buffer.
with open('filename.txt', 'r') as opened_file:
data = opened_file.read()
rowsplitdata = data.split('\n')
colsplitdata = [u.split(' ') for u in rowsplitdata]
x = [[int(j[0]), float(j[1])] for j in colsplitdata]
output = []
buffer = {"max":0, "index":0}
count = 0
#this assumes x is a list of lists, not a numpy array
for u in x:
count += 1
if u[1] > buffer["max"]:
buffer["max"] = u[1]
buffer["index"] = u[0]
if count == 10:
output.append([buffer["index"], buffer["max"]])
buffer = {"max":0, "index":0}
count = 0
#append the remainder of the buffer in case you didn't get to ten in the final pass
output.append([buffer["index"], buffer["max"]])
output
[[102330, 9000.7], [102480, 7000.3]]

You should iterate over it and for each 10 lines extract the maximum:
import math
# New empty list for colecting the results
max_list=[]
#iterate thorught x supposing
for i in range(math.ceil(len(x)/10)):
### append only 10 elments if i+10 is not superior to the lenght of the array
if i+11<len(x):
max_list=max_list.append(np.max(x[i:i+11]))
### if it is superior, then append all the remaining elements
else:
max_list=max_list.append(np.max(x[i:]))

This should do your job.
number_list = [[],[]]
with open('filename.txt', 'r') as opened_file:
for line in opened_file:
if len(line.split()) == 0:
continue
else:
a , b = line.split(" ")
number_list[0].append(a)
number_list[1].append(b)
col1_max, col2_max = max(number_list[0]), max(number_list[1])
col1_max, col2_max
Just change the filename. col1_max, col2_max have the respective column's max value. You can edit the code to accommodate more columns.

You can transpose your input first, then use np.split and for each submatrix you calculate its max.
x = np.genfromtxt('carserial.txt', unpack=True).T
print(x)
for submatrix in np.split(x,len(x)//10):
print(max(submatrix,key=lambda l:l[1]))
working example

Using '' with open, reader '' functions

I got a problem with this code:
import csv
with open('gios-pjp-data.csv', 'r') as data:
l = []
reader = csv.reader(data, delimiter=';')
next(reader)
next(reader) # I need to skip 2 lines here and dont know how to do it in other way
l.append(# here is my problem that I will describe below)
So this file contains about 350 lines with 4 columns and
each one is built like this:
Date ; float number ; float number ; float number
Something like this:
2017-01-01;56.7;167.2;236.9
Now, I dont know how to build a function that would append first float number and third float number to the list on condition that its value is >200.
Do you have any suggestions?

List comprehentions if you don't have too many items in the file.
l = [x[1], x[3] for x in reader if x[1] > 200]
Or a similar function that would yield each line, if you have a huge number of entries.
def getitems():
for x in reader:
if x[1] > 200:
yield x[1], x[3]
l = getitems() # this is now an iterator, more memory efficient.
l = list(l) # now its a list

open multiple files in a loop and sum them based on index (python)

I am new to python, and even more new to SO, so i hope i can get this question right.
I have a list of lists (3D) in which all elements are paths to a certain text file containing data. Like so (this is just an example):
lists = [[['/home/text01.txt', '/home/text02.txt', '/home/text03.txt'],
['/home/text04.txt', '/home/text05.txt', '/home/text06.txt'],
['/home/text07.txt', '/home/text08.txt', '/home/text09.txt']],
[['/home/text10.txt', '/home/text11.txt', '/home/text12.txt'],
['/home/text13.txt', '/home/text14.txt', '/home/text15.txt'],
['/home/text16.txt', '/home/text17.txt', '/home/text18.txt']],
[['/home/text19.txt', '/home/text20.txt', '/home/text21.txt'],
['/home/text22.txt', '/home/text23.txt', '/home/text24.txt'],
['/home/text25.txt', '/home/text26.txt', '/home/text27.txt']]]
I need to sum the the values from these files, based on their index, so i would get:
result1 = text01 + text10 + text19
result2 = text02 + text11 + text20
result3 = ...
So each element of a sublist is summed with the corresponding ones in all sublists.
So far i am able to get the right results if i write one if loop for each j:
n=0
for i, j, k in (itertools.product(range(len(lists[0])), range(len(lists[0])), range(len(lists[0][0])))):
if i == 0 and j == 0:
f1 = np.loadtxt(lists[i][j][k], comments='#', skiprows=1, usecols=(2,3,5,6))
f2 = np.loadtxt(lists[i+1][j][k], comments='#', skiprows=1, usecols=(2,3,5,6))
f3 = np.loadtxt(lists[i+2][j][k], comments='#', skiprows=1, usecols=(2,3,5,6))
f = f1 + f2 + f3
np.savetxt(path0 + 'result%i' %n, f, fmt='%f %f %f %f %f')
n = n + 1
then would follow
if i == 0 and j == 1:
...
if i == 0 and j == 2:
...
which works fine if i would have only a few files.
But i would like to generalize this for a large number of files, and i don't know how to open and sum files in one loop based on indices so i get the right result.
I feel like even this version of the code is crude and non-pythonic, but it worked, and that's all i got.
Any suggestions on improvements and on how to proceed from here are highly appreciated!

I don't think you can avoid using for cycles, since you need to read the .txt file each time, other than that, sometihing like this should work as expected
n=0
# Thinking of you lists as a 3d matrix, start by cycling on the rows
for j in range(len(lists[0])):
# then on the columns
for i in range(len(lists[0][0])):
# initialize f here
f = 0
# finally on the depth
for k in range(len(lists)):
# update f.
f += np.loadtxt(lists[k][j][i], comments='#', skiprows=1, usecols=(2,3,5,6))
#save f
np.savetxt(path0 + 'result%i' %n, f, fmt='%f %f %f %f %f')
# update n
n += 1
As a side not, you could concatenate your lists and save a for cycle

Increase Python code speed

Does anyone have a clue how to increase the speed of this part of python code?
It was designed to deal with small files (with just a few lines, and for this is very fast) but i want to run it with big files (with ~50Gb, and millions of lines).
The main goal of this code is to get stings from a file (.txt) and search for these in a input file printing the the number of times that these occurred in the output file.
Here is the code: infile, seqList and out are determined by the optparse as Options in the beginning of the code (not shown)
def novo (infile, seqList, out) :
uDic = dict()
rDic = dict()
nmDic = dict()
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
for k in lines:
l1 = k.split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
if l1[4] == "U":
for k in uDic.keys():
if k == l3:
uDic[k] += 1
if l1[4] == "R":
for j in rDic.keys():
if j == l3:
rDic[j] += 1
if l1[4] == "NM":
for h in nmDic.keys():
if h == l3:
nmDic[h] += 1
f = open(out, "w")
f.write("Sample"+"\t"+"R"+"\t"+"U"+"\t"+"NM"+"\t"+"TOTAL"+"\t"+"%R"+"\t"+"%U"+"\t"+"%NM"+"\n")
for i in samples:
U = int()
R = int()
NM = int ()
for k, j in uDic.items():
if k == i:
U = j
for o, p in rDic.items():
if o == i:
R = p
for y,u in nmDic.items():
if y == i:
NM = u
TOTAL = int(U + R + NM)
try:
f.write(i+"\t"+str(R)+"\t"+str(U)+"\t"+str(NM)+"\t"+str(TOTAL)+"\t"+str(float(R) / TOTAL)+"\t"+str(float(U) / TOTAL)+"\t"+str(float(NM) / TOTAL$
except:
continue
f.close()

With processing 50 GB files, the question is not how to make it faster, but how to make it runnable
at all.
The main problem is, you will run out of memory and shall modify the code to be processing the files
without having all the files in memory, but rather having in memory onle a line, which is needed.
Following code from your question is reading all the lines form two files:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
You shall defer reading the lines till the latest possible moment like this:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
Note: did you want to use line.strip() or line.split()?
This way, you do not have to keep all the content in memory.
There are many more options for optimization, but this one will let you to take off and run.

It would make it much easier if you provided some sample inputs. Because you haven't I haven't tested this, but the idea is simple - iterate through each file only once, using iterators rather than reading the whole file into memory. Use the efficient collections.Counter object to handle the counting and minimise inner looping:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]

If you convert your script into functions (it makes profiling easier), and then see what it does when you code profile it : I suggest using runsnake : runsnakerun

I would try replacing your loops with list & dictionary comprehensions:
For example, instead of
for i in samples:
uDict[i.replace(" ","")] = 0
Try:
udict = {i.replace(" ",""):0 for i in samples}
and similarly for the other dicts
I don't really follow what's going on in your "for k in lines" loop, but you only need l3 (and l2) when you have certain values for l1[4]. Why not check for those values before splitting and replacing?
Lastly, instead of looping through all the keys of a dict to see if a given element is in that dict, try:
if x in myDict:
myDict[x] = ....
for example:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
can be replaced with:
if l3 in uDic:
uDic[l3] += 1
Other than that, try profiling.

1)Look into profilers and adjust the code that is taking the most time.
2)You could try optimizing some methods with Cython - use data from profiler to modify correct thing
3)It looks like you can use a counter instead of a dict for the output file, and a set for the input file -look into them.
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4) If you are reading 50GB of memory you won't be able to store it all in RAM (I'm assuming who knows what kind of computer you have), so generators should save your memory and time.
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: multiple .dat's in multiple arrays - python

Related

How to store textfile data as variables and then calculate values using those variables?

Find max and extract data from a list

Using '' with open, reader '' functions

open multiple files in a loop and sum them based on index (python)

Increase Python code speed

Categories

Resources