I need to multiply two huge vectors (30720,1)* (1,30720) so this will give a 30720*30720 matrix . I am using numpy.dot to multiply them but it is taking a very long time.
with float64 data, the result size is about 7 Go, so it doesn't fit in a lot of PC RAM. But you have only 30720² # 1e9 multiplications to do, which take a few seconds.
A way to avoid the memory issues is to cut the result in reasonable chunks, with sizes < 1Go, and save the partial results in files with binary protocol for speed, with adds to control what happens :
n=3
div=10240
a=rand(n*div,1)
b=rand(1,n*div)
import pickle
def calculate(i,j):
u=dot(a[i*div:(i+1)*div,:],b[:,j*div:(j+1)*div])
return u
def save(i,j,u):
with open('data'+str(i)+str(j)+'.pk','wb') as f :
pickle.dump(u,f)
def timecount(f,args):
t0=time.time()
res=f(*args)
return res,time.time()-t0
def multidot():
tcalc,tsave=0,0
for i in range(n):
for j in range(n):
print (i,j)
u,dt=timecount(calculate,(i,j))
tcalc+=dt
_,dt=timecount(save,(i,j,u))
tsave+=dt
print('dot time',tcalc)
print('save time',tsave)
Then the run :
In [64]: multidot()
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
dot time 4.697121858596802
save time 29.11250686645508
So you have no problem with dot, only with memory issues.
To read back your data, read it, chunk by chunk, like that:
with open('data00.pk','rb') as f : u=pickle.load(f)
Don't forget to del data*.pk after this run, it takes 6Go on your disk ;)
Related
I tried to handle this code as I have a big file with the size 3 GB "mydata.dat" with 161991000 lines. Code is for calculating the distance between two points using DensityPeakCluster. number of points 18000
sample of the file like as
1 2 26.23
1 3 44.49
1 4 47.17
and so on until
1 18000 23.5
then
2 3 25.2
2 4 15.2
until 2 18000 0.25 and so on until 17999 18000 0.25
block one for the code is
class Graph(defaultdict):
def __init__(self, input_file, sep=" ", header=False, undirect=True):
super(Graph, self).__init__(dict)
self.edges_num = 0
with open(input_file) as f:
if header:
f.readline()
for line in f:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
def edges(self):
edges_list = []
for node1 in self:
for node2 in self[node1]:
edges_list.append((node1, node2))
return edges_list
block 2 of the code as the code is long to write it here
def edges_weight(self):
weight_list = []
for edge in self.edges():
node1, node2 = edge
weight_list.append([node1, node2, self[node1][node2]])
weight_list = sorted(weight_list, key=lambda x:x[2])
return weight_list
def get_weight(self, node1, node2):
return self[node1][node2]
def get_weights(self):
weights = []
for edge in self.edges():
weights.append(self.get_weight(edge[0], edge[1]))
return weights
if __name__=="__main__":
input_file = "./data/mydata.dat"
percent = 2.0
output_file = "./data/results"
G = Graph(input_file)
position = round(G.number_of_edges()*percent/100)
dc = G.edges_weight()[position][2]
print("average percentage of neighbours (hard coded): {}".format(percent))
print("Computing Rho with gaussian kernel of radius: {}".format(dc))
nodes = G.nodes()
for i in range(G.number_of_nodes()-1):
for j in range(i+1, G.number_of_nodes()):
node_i = nodes[i]
node_j = nodes[j]
dist_ij = G.get_weight(node_i, node_j)
what happened to me
1- I got killed so I tried to make reading from the file as
bigfile = open(input_file,'r')
tmp_lines = bigfile.readlines(1024*1024)
for line in tmp_lines:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
2- but got
dist_ij = G.get_weight(node_i, node_j) in get_weight
return self[node1][node2]
KeyError: '6336'
3- I tried to use google colab but didn't work as RAM is 12 GB and didn't enough for me .. i asked for buying a neW RAM but the problem still was I couldn't manage the code well so the RAM will be less for processing .. i'm stuck in this problem and couldn't know what should I do ?
**1- My problem is how to deal with a big file as I have ? what is the way that I should use to handle this size?
2- if I use NumPy to load the file can this decrease usage of memory?**
The most straight forward answer is to not load the whole file at once. This can even be done one line at a time. For example, suppose you wanted the sum:
filename = 'file.dat'
lines = (int(line.split(' ')[2]) for line in open(filename))
print(sum(lines))
Here we did not load all the lines into memory. We instead opened a file pointer and started a python generator. The generator holds the function "int(line.split(' ')[2])" and only executes that function when each line is called. The initiation of needing to call each line is started by the sum(), and sum only calls each line one at a time as needed, never loading more than one line into memory at a time. Hence, when we execute that line we start to add up all the values on the lines from the generator and keep a running total. The point is that the code uses no memory RAM (aside from the kernel overhead).
This could be done a piece at a time as well. Load all the zeros.
filename = 'file.dat'
lines = (line.split(' ') for line in open(filename))
zeros = (line for line in lines if line[0]=='0' or line[1]=='0')
print(sum(c for a,b,c in zeros))
This can of course be slower than loading some or all of the file into memory. Moreover you have to consider how many times you want to iterate over the file like this. It is preferred to only iterate over the lines a few times, gathering all the calculations you want. You then probably want to save those answers because re-iterating over the file again takes more time.
In considering loading the file into memory, you need to double check what exactly you want to load and how. For example, do you want to load the values 1 2 in the line 1 2 26.23? If not, then strip those out to take up less memory. For example
import numpy as np
filename = 'file.dat'
values = (float(line.split(' ')[2]) for line in open(filename))
X = np.fromiter(values,dtype='float32',count=161991000)
By specifying the count we told python EXACTLY how much memory to allocate in advance (instead of having python re-adjust the array every time it needs more memory). With a count of that size and dtype of float32, we know that this data will take up exactly 647.97mb in RAM. So, be careful not to write any operations that duplicate this data. If you write something that makes 5 copies of this that will eat up RAM quickly.
I think this gives you an idea of how to manage memory. :-)
The Problem:
I need a generic approach for the following problem. For one of many files, I have been able to grab a large block of text which takes the form:
Index
1 2 3 4 5 6
eigenvalues: -15.439 -1.127 -0.616 -0.616 -0.397 0.272
1 H 1 s 0.00077 -0.03644 0.03644 0.08129 -0.00540 0.00971
2 H 1 s 0.00894 -0.06056 0.06056 0.06085 0.04012 0.03791
3 N s 0.98804 -0.11806 0.11806 -0.11806 0.15166 0.03098
4 N s 0.09555 0.16636 -0.16636 0.16636 -0.30582 -0.67869
5 N px 0.00318 -0.21790 -0.50442 0.02287 0.27385 0.37400
7 8 9 10 11 12
eigenvalues: 0.373 0.373 1.168 1.168 1.321 1.415
1 H 1 s -0.77268 0.00312 -0.00312 -0.06776 0.06776 0.69619
2 H 1 s -0.52651 -0.03358 0.03358 0.02777 -0.02777 0.78110
3 N s -0.06684 0.06684 -0.06684 -0.01918 0.01918 0.01918
4 N s 0.23960 -0.23960 0.23961 -0.87672 0.87672 0.87672
5 N px 0.01104 -0.52127 -0.24407 -0.67837 -0.35571 -0.01102
13 14 15
eigenvalues: 1.592 1.592 2.588
1 H 1 s 0.01433 0.01433 -0.94568
2 H 1 s -0.18881 -0.18881 1.84419
3 N s 0.00813 0.00813 0.00813
4 N s 0.23298 0.23298 0.23299
5 N px -0.08906 0.12679 -0.01711
The problem is that I need extract only the coefficients, and I need to be able to reformat the table so that the coefficients can be read in rows not columns. The resulting array would have the form:
[[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.21790]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[-0.00540, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.37400]
[-0.77268, -0.52651, -0.06684, 0.23960, 0.01104]
[0.00312, -0.03358, 0.06684, -0.23960, -0.52127
...
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]]
This would be manageable for me if it wasn't for the fact that the number of columns changes with different files.
What I have tried:
I had earlier managed to get the eigenvalues by:
eigenvalues = []
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
if 'eigenvalues' in line:
eigenvalues.append(line.split()[1:])
flatten = [item for sublist in eigenvalues for item in sublist]
$ ['-15.439', '-1.127', '-0.616', '-0.616', '-0.397', '0.272', '0.373', '0.373', '1.168', '1.168', '1.321', '1.415', '1.592', '1.592', '2.588']
So attempting several variants of this, and in the most recent approach I tried:
dir = {}
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
for i in range(1, number_of_coefficients+1):
if str(i) in line.split()[0]:
if line.split()[1].isdigit() == False:
if line.split()[3] in ['s', 'px', 'py', 'pz']:
dir[str(i)].append(line.split()[4:])
else:
dir[str(i)].append(line.split()[3:])
Which seemed to get me close, however, I got a strange duplication of numbers in random orders.
The idea was that I would then be able to convert the dictionary into the array.
Please HELP!!
EDIT:
The letters in the 3rd and sometimes 4th column are also variable (changing from, s, px, py, pz).
Here's one way to do it. This approach has a few noteworthy aspects.
First -- and this is key -- it processes the data section-by-section rather than line by line. To do that, you have to write some code to read the input lines and then yield them to the rest of the program in meaningful sections. Quite often, this preliminary step will radically simplify a parsing problem.
Second, once we have a section's worth of "rows" of coefficients, the other challenge is to reorient the data -- specifically to transpose it. I figured that someone smarter than I had already figured out a slick way to do this in Python, and StackOverflow did not disappoint.
Third, there are various ways to grab the coefficients from a section of input lines, but this type of fixed-width, report-style data output has a useful characteristic that can help with parsing: everything is vertically aligned. So rather than thinking of a clever way to grab the coefficients, we just grab the columns of interest -- line[20:].
import sys
def get_section(fh):
# Takes an open file handle.
# Yields each section of lines having coefficients.
lines = []
start = False
for line in fh:
if 'eigenvalues' in line:
start = True
if lines:
yield lines
lines = []
elif start:
lines.append(line)
if 'px' in line:
start = False
if lines:
yield lines
def main():
coeffs = []
with open(sys.argv[1]) as fh:
for sect in get_section(fh):
# Grab the rows from a section.
rows = [
[float(c) for c in line[20:].split()]
for line in sect
]
# Transpose them. See https://stackoverflow.com/questions/6473679
transposed = list(map(list, zip(*rows)))
# Add to the list-of-lists of coefficients.
coeffs.extend(transposed)
# Check.
for cs in coeffs:
print(cs)
main()
Output:
[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.2179]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[0.08129, 0.06085, -0.11806, 0.16636, 0.02287]
[-0.0054, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.374]
[-0.77268, -0.52651, -0.06684, 0.2396, 0.01104]
[0.00312, -0.03358, 0.06684, -0.2396, -0.52127]
[-0.00312, 0.03358, -0.06684, 0.23961, -0.24407]
[-0.06776, 0.02777, -0.01918, -0.87672, -0.67837]
[0.06776, -0.02777, 0.01918, 0.87672, -0.35571]
[0.69619, 0.7811, 0.01918, 0.87672, -0.01102]
[0.01433, -0.18881, 0.00813, 0.23298, -0.08906]
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]
I am writing a program for which I need up to 100,000 lines of integer pairs from sys.stdin on which to do calculations. My whole program, consisting of reading this input and performing calculations on the integers of each input line has to take a maximum of 1 second. The problem is that, just going through all the lines of input takes way more than 1 second! In the case of 100,000 lines, it takes roughly 10 seconds.
My question is, is this performance to be expected for this amount of lines?
The input is in the format:
100000 5 100000
72324 563
56487 2252
866 19750
65532 69349
96171 56840
70287 14094
76381 14722
48359 38831
74431 12611
29994 66230
92169 20726
39565 38429
59416 2360
45470 40781
...
Where the rightmost integer on the first line indicates the number of lines to come.
To read this input, I'm using the following code:
import time
from sys import stdin, stderr
def read():
row = stdin.readline().split()
n, k, q = int(row[0]), int(row[1]), int(row[2])
start = time.clock()
for i in range(q):
line = stdin.readline().split()
# Do some calculation on the integers of this line...
end = time.clock()
print("Reading time: " + str(end-start))
read()
Am I missing something here? The limit of 1 second is due to this being a school project of calculating Q number of distances between two nodes in a K-ary tree.
Thanks in advance.
"""Some simulations to predict the future portfolio value based on past distribution. x is
a numpy array that contains past returns.The interpolated_returns are the returns
generated from the cdf of the past returns to simulate future returns. The portfolio
starts with a value of 100. portfolio_value is filled up progressively as
the program goes through every loop. The value is multiplied by the returns in that
period and a dollar is removed."""
portfolio_final = []
for i in range(10000):
portfolio_value = [100]
rand_values = np.random.rand(600)
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
I couldn't find a way to write this code using numpy. I was having a look at iterations using nditer but I was unable to move ahead with that.
I guess the easiest way to figure out how you can vectorize your stuff would be to look at the equations that govern your evolution and see how your portfolio actually iterates, finding patterns that could be vectorized instead of trying to vectorize the code you already have. You would have noticed that the cumprod actually appears quite often in your iterations.
Nevertheless you can find the semi-vectorized code below. I included your code as well such that you can compare the results. I also included a simple loop version of your code which is much easier to read and translatable into mathematical equations. So if you share this code with somebody else I would definitely use the simple loop option. If you want some fancy-pants vectorizing you can use the vector version. In case you need to keep track of your single steps you can also add an array to the simple loop option and append the pv at every step.
Hope that helps.
Edit: I have not tested anything for speed. That's something you can easily do yourself with timeit.
import numpy as np
from scipy.special import erf
# Prepare simple return model - Normal distributed with mu &sigma = 0.01
x = np.linspace(-10,10,100)
cdf_values = 0.5*(1+erf((x-0.01)/(0.01*np.sqrt(2))))
# Prepare setup such that every code snippet uses the same number of steps
# and the same random numbers
nSteps = 600
nIterations = 1
rnd = np.random.rand(nSteps)
# Your code - Gives the (supposedly) correct results
portfolio_final = []
for i in range(nIterations):
portfolio_value = [100]
rand_values = rnd
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
# Using vectors
portfolio_final = []
for i in range(nIterations):
portfolio_values = np.ones(nSteps)*100.0
rcp = np.cumprod(np.interp(rnd,cdf_values,x) + 1)
portfolio_values = rcp * (portfolio_values - np.cumsum(1.0/rcp))
portfolio_final.append(portfolio_values[-1])
print (np.mean(portfolio_final))
# Simple loop
portfolio_final = []
for i in range(nIterations):
pv = 100
rets = np.interp(rnd,cdf_values,x) + 1
for i in range(nSteps):
pv = pv * rets[i] - 1
portfolio_final.append(pv)
print (np.mean(portfolio_final))
Forget about np.nditer. It does not improve the speed of iterations. Only use if you intend to go one and use the C version (via cython).
I'm puzzled about that inner loop. What is it supposed to be doing special? Why the loop?
In tests with simulated values these 2 blocks of code produce the same thing:
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio[j-1])
portfolio_value[j] = portfolio_value[j]-1
interpolated_returns = (interpolated_returns+1)*portfolio - 1
portfolio_value = portfolio_value + interpolated_returns.tolist()
I assuming that interpolated_returns and portfolio are 1d arrays of the same length.
Dear all,
I am beginner in Python. I am looking for the best way to do the following in Python: let's assume I have three text files, each one with m rows and n columns of numbers, name file A, B, and C. For the following, the contents can be indexed as A[i][j], or B[k][l] and so on. I need to compute the average of A[0][0], B[0][0], C[0][0], and writes it to file D at D[0][0]. And the same for the remaining records. For instance, let's assume that :
A:
1 2 3
4 5 6
B:
0 1 3
2 4 5
C:
2 5 6
1 1 1
Therefore, file D should be
D:
1 2.67 4
2.33 3.33 4
My actual files are of course larger than the present ones, of the order of some Mb. I am unsure about the best solution, if reading all the file contents in a nested structure indexed by filename, or trying to read, for each file, each line and computing the mean. After reading the manual, the fileinput module is not useful in this case because it does not read the lines "in parallel", as I need here, but it reads the lines "serially". Any guidance or advice is highly appreciated.
Have a look at numpy. It can read the three files into three arrays (using fromfile), calculate the average and export it to a text file (using tofile).
import numpy as np
a = np.fromfile('A.csv', dtype=np.int)
b = np.fromfile('B.csv', dtype=np.int)
c = np.fromfile('C.csv', dtype=np.int)
d = (a + b + c) / 3.0
d.tofile('D.csv')
Size of "some MB" should not be a problem.
In case of text files, try this:
def readdat(data,sep=','):
step1 = data.split('\n')
step2 = []
for index in step1:
step2.append(float(index.split(sep)))
return step2
def formatdat(data,sep=','):
step1 = []
for index in data:
step1.append(sep.join(str(data)))
return '\n'.join(step1)
and then use these functions to format the text into lists.
Just for reference, here's how you'd do the same sort of thing without numpy (less elegant, but more flexible):
files = zip(open("A.dat"), open("B.dat"), open("C.dat"))
outfile = open("D.dat","w")
for rowgrp in files: # e.g.("1 2 3\n", "0 1 3\n", "2 5 6\n")
intsbyfile = [[int(a) for a in row.strip().split()] for row in rowgrp]
# [[1,2,3], [0,1,3], [2,5,6]]
intgrps = zip(*intsbyfile) # [(1,0,2), (2,1,5), (3,3,6)]
# use float() to ensure we get true division in Python 2.
averages = [float(sum(intgrp))/len(intgrp) for intgrp in intgrps]
outfile.write(" ".join(str(a) for a in averages) + "\n")
In Python 3, zip will only read the files as they are needed. In Python 2, if they're too big to load into memory, use itertools.izip instead.