I'm new to python and stackoverflow and I'm working on a project that deal with manually created arrays with different length.
path = '/home/Documents/Noise'
files = glob.glob(path + '/*.txt')
data_noise = []
for file in files:
df = pd.read_csv(file, delimiter=',', header=None)
df = df.values
m,n = df.shape
df = np.reshape(df,m)
data_noise.append(df)
I create a list data_noise to store numpy arrays, each array has different length m. I want to select subarrays from each array so that they have same length, say, 100. But instead of selecting the first 100 elements or the last 100 in each array, I want to evenly space and select in each array.
For example, for a length 300 array, I need elements indexed by 0,3,6,9,... and for a length 500 array, I need elements indexed by 0,5,10,15,...
How do I modify my code to do that?
It's not numpy, but in general Python this should work.
distance = 100 / m
sub_list = df[0::distance]
Provided you add some checks and possibly rounding.
Related
I currently have an edge array of dimension (n_edges, 2) containing node pairs described as [NodeID1, NodeID2], which are both integers. I need to efficiently enumerate these NodeIDs so that I can represent them as indices in an adjacency matrix. My current approach is to extract the unique set of sorted NodeIDs, map them to 0 ranging through the number of distinct nodes, and then replacing the entries using pandas.DataFrame.replace(mapping). Here is an example of what I am doing:
import numpy as np
import pandas as pd
a = np.random.randint(0, 100000000, (40000000, 2))
df = pd.DataFrame(a)
unique_values = np.unique(a)
mapping = dict(zip(unique_values, np.arange(len(unique_values))))
df.replace(mapping)
I have also tried defining a function which applies this map and vectorizing it with NumPy, but it is still quite slow. Any ideas as to how I can implement this more efficiently?
Turns out np.unique has an option to return the indices of the original numbers in the unique array, you just need to reshape it.
u, indices = np.unique(a, return_inverse=True)
b = indices.reshape(a.shape)
This runs in about 20 seconds on your example.
I am trying to get better at using numpy functions and methods to run my programs in python faster
I want to do the following:
I create an array 'a' as:
a=np.random.randint(-10,11,10000).reshape(-1,10)
a.shape: (1000,10)
I create another array which takes only the first two columns in array a
b=a[:,0:2]
b,shape: (1000,2)
now I want to create an array c which has 990 rows containing flattened slices of 10 rows of array 'b'.
So the first row of array 'c' will have 20 columns which is a flattened slice of 0 to 10 rows
of array 'b'. The next row of array 'c' will have 20 columns of flattened rows 1 to 11 of array
'b' etc.
I can do this with a for loop. But I want to know if there is much faster way to do this using numpy functions and methods like strides or something else
Thanks for your time and your help.
This loops over shifts rather than rows (loop of size 10):
N = 10
c = np.hstack([b[i:i-N] for i in range(N)])
Explanation: b[i:i-N] is b's rows from i to m-(N-i)(excluding m-(N-i) itself) where m is number of rows in b. Then np.hstack stacks those selected sub-arrays horizontally(stacks b[0:m-10], b[1:m-9], b[2:m-8], ..., b[10:m]) (as question explains).
c.shape: (990, 20)
Also I think you may be looking for a shape of (991, 20) if you want to include all windows.
you can also use strides, but if you want to do operations on it, I would advise against that, since the memory is tricky using them. Here is a strides solution if you insist:
from skimage.util.shape import view_as_windows
c = view_as_windows(b, (10,2)).reshape(-1, 20)
c.shape: (991, 20)
If you don't want the last row, simply remove it by calling c[:-1].
A similar solution applies with numpy's as_strides function (they basically operate similar, not sure of internals of them).
UPDATE: if you want to find unique values and their frequencies in each row of c you can do:
unique_values = []
unique_counts = []
for row in c:
unique, unique_c = np.unique(row, return_counts=True)
unique_values.append(unique)
unique_counts.append(unique_c)
Note that numpy arrays have to be rectangular, meaning the number of elements per each(dimension) row must be the same. Since different rows in c can have different number of unique values, you cannot create a numpy array for unique values of each row (Alternative would be to make a structured numpy array). Therefore, a solution is to make a list/array of arrays, each including unique values of different rows in c. unique_values are list of arrays of unique values and unique_counts is their frequency in the same order.
I have a 1d numpy array of 530 number, which I created like so, np.array([i for i in range(530)]). So the shape of this 1d array is (530,). I also have a 2d array which is an array of 530 lists where each list contains 100 elements. To be clear the shape of this 2d array is (530, 100).
>>>indices = np.array([i for i in range(530)])
>>>print(test_data.shape)
(530,100)
Using these two arrays, indices and test_data, what I want to do is create a pandas dataframe with only 2 columns where the first column is indices(1 integer per row), and the second column is a single list(length 100) from test_data. The sequential nature of each arrays should be maintained, so the first int in indices corresponds to the first 100 length array in test_data.
I tried using zip with these two arrays, and then creating a dataframe but it doesn't work.
Setup
i = np.arange(530) # first column
j = np.random.randn(530, 100).tolist() # second column
Option 1
Initialise a DataFrame
df = pd.DataFrame([i, j]).T
Option 2
Initialise a Series (you don't even need i for this)
df = pd.Series(j).reset_index()
I have a 2d array with dimensions array[x][9]. X because its reading from a file of varying length. I want to find the sum of each column of the array but for 24 columns at a time and input the results into a new array; equivalent to sum(array2[0:24]) but for a 2d array. Is there a special syntax i just dont know about or do i have to do it manually. I know if it was a 1d array i could iterate through it by doing
for x in range(len(array)/24):
total.append(sum(array2[x1:x24])) # so i get an array of the sums
What is the equivalent for a 2d array and doing it column by column. I can imagine doing it by storing each column in its own separate 1d array and then finding the sums, or a mess of for and while loops. Neither of which sound even slightly elegant.
It sounds like you perhaps are working with time series data, with a file containing hourly values and you want a daily sum (hence the 24). The pandas library will do this really nicely:
Suppose you have your data in data.csv:
import pandas
df = pandas.read_csv('data.csv')
If one of your columns was a timestamp, you could use that, but if you only have raw data, you can create a time index:
df.index = pandas.date_range(pandas.datetime.today().date(),
periods=df.shape[0], freq='H')
Now the summing of all columns on daily basis is very easy:
daily = df.resample('D').apply(sum)
You can use zip to transpose your array and use a comprehension to sum each column separately:
>>> array = [[1, 2, 3], [10, 20, 30], [100, 200, 300]]
>>> [sum(a) for a in zip(*array)]
[111, 222, 333]
Please try this:
x = len(a) # x is the length of a
step = 24
# get the number of iterations you need to do
n = int(math.ceil(float(x) / step))
new_a = [map(lambda k: sum(list(k)), zip(*a[i * step:(i + 1) * step]))
for i in range(0, n)]
If x is not a multiple of 24 then the last row in the new_a will have the sum of remainder rows (count of which will be less that 24).
This also assumes that the values in a are numbers so I have not done any conversions.
I have a text file with 93 columns and 1699 rows that I have imported into Python. The first three columns do not contain data that is necessary for what I'm currently trying to do. Within each column, I need to divide each element (aka row) in the column by all of the other elements (rows) in that same column. The result I want is an array of 90 elements where each of 1699 elements has 1699 elements.
A more detailed description of what I'm attempting: I begin with Column3. At Column3, Row1 is to be divided by all the other rows (including the value in Row1) within Column3. That will give Row1 1699 calculations. Then the same process is done for Row2 and so on until Row1699. This gives Column3 1699x1699 calculations. When the calculations of all of the rows in Column 3 have completed, then the program moves on to do the same thing in Column 4 for all of the rows. This is done for all 90 columns which means that for the end result, I should have 90x1699x1699 calculations.
My code as it currently is is:
import numpy as np
from glob import glob
fnames = glob("NIR_data.txt")
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
NIR_band = NIR_values.T
C_values = []
for i in range(3,len(NIR_band)):
for j in range(0,len(NIR_band[3])):
loop_list = NIR_band[i][j]/NIR_band[i,:]
C_values.append(loop_list)
What it produces is an array of 1699x1699 dimension. Each individual array is the results from the Row calculations. Another complaint is that the code takes ages to run. So, I have two questions, is it possible to create the type of array I'd like to work with? And, is there a faster way of coding this calculation?
Dividing each of the numbers in a given column by each of the other values in the same column can be accomplished in one operation as follows.
result = a[:, numpy.newaxis, :] / a[numpy.newaxis, :, :]
Because looping over the elements happens in the optimized binary depths of numpy, this is as fast as Python is ever going to get for this operation.
If a.shape was [1699,90] to begin with, then the result will have shape [1699,1699,90]. Assuming dtype=float64, that means you will need nearly 2 GB of memory available to store the result.
First let's focus on the load:
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
Your text talks about loading a file, and manipulating it. But this clip loads multple files and joins them.
My first change is to collect the arrays in a list, not another array
alist = [np.loadtxt(f, skiprows=1) for f in fnames]
If you want to skip some columns, look at using the usecols parameter. That may save you work later.
The elements of alist will now be 2d arrays (of floats). If they are matching sizes (N,M), they can be joined in various ways. If there are n files, then
arrays = np.array(alist) # (n,N,M) array
arrays = np.concatenate(alist, axis=0) # (n*N, M) array
# similarly for axis=1
Your code does the same, but potentially confuses steps:
In [566]: arrays = np.array([np.ones((3,4)) for i in range(5)])
In [567]: arrays.shape
Out[567]: (5, 3, 4) # (n,N,M) array
In [568]: NIR_values = np.concatenate(arrays)
In [569]: NIR_values.shape
Out[569]: (15, 4) # (n*N, M) array
NIR_band is now (4,15), and it's len() is the .shape[0], the size of the 1st dimension.len(NIR_band[3])isshape[1]`, the size of the 2nd dimension.
You could skip the columns of NIR_values with NIR_values[:,3:].
I get lost in the rest of text description.
The NIR_band[i][j]/NIR_band[i,:], I would rewrite as NIR_band[i,j]/NIR_band[i,:]. What's the purpose of that?
As for you subject line, Storing multiple arrays within multiple arrays within an array - that sounds like making a 3 or 4d array. arrays is 3d, NIR_valus is 2d.
Creating a (90,1699,1699) from a (93,1699) will probably involve (without iteration) a calculation analogous to:
In [574]: X = np.arange(13*4).reshape(13,4)
In [575]: X.shape
Out[575]: (13, 4)
In [576]: (X[3:,:,None]+X[3:,None,:]).shape
Out[576]: (10, 4, 4)
The last dimension is expanded with None (np.newaxis), and 2 versions broadcasted against each other. np.outer does the multiplication of this calculation.