Reading multidimensional array data into Python - python

I have data in the format of 10000x500 matrix contained in a .txt file. In each row, data points are separated from each other by one whitespace and at the end of each row there a new line starts.
Normally I was able to read this kind of multidimensional array data into Python by using the following snippet of code:
with open("position.txt") as f:
data = [line.split() for line in f]
# Get the data and convert to floats
ytemp = np.array(data)
y = ytemp.astype(np.float)
This code worked until now. When I try to use the exact some code with another set of data formatted in the same way, I get the following error:
setting an array element with a sequence.
When I try to get the 'shape' of ytemp, it gives me the following:
(10001,)
So it converts the rows to array, but not the columns.
I thought of any other information to include, but nothing came to my mind. Basically I'm trying to convert my data from a .txt file to a multidimensional array in Python. The code worked before, but now for some reason that is unclear to me it doesn't work. I tried to look compare the data, of course it's huge, but everything seems quite similar between the data that is working and the data that is not working.
I would be more than happy to provide any other information you may need. Thanks in advance.

Use numpy's builtin function:
data = numpy.loadtxt('position.txt')
Check out the documentation to explore other available options.

Related

Convert .xsf (or .dat) file to array with np.loadtxt Python

I have been checking some guides and also some of the questions that have been posted here but nothing has been working for me so far. I have an .xsf file which contains the first 57 lines as general instructions and then ca. 3*10^6 numbers. I want to load those numbers into a np.array and figured that the command
data = np.loadtxt('filename.xsf', skiprows = 57)
would do the trick.
That actually does not work because the data between line 58 and 531509 are organised as following
0.362077E+02 0.960500E+00 0.600950E+00 0.901461E-01 0.478295E-02 0.710280E-01
whereas the last line only contains one element. The error I get is
ValueError: Wrong number of columns at line 531510
I figured then to specify a delimiter (the double space)
data = np.loadtxt('filename.xsf', delimiter =' ' ,skiprows = 57)
this results in the inability of reading the file.
From my understanding my first attempt results in something which is not an array of floats but rather an array where every element is a list (taken from each line as a whole) of floats. Beeing the last line of the file a single number it does not match the format of the rest of the array. In the second case scenario I am struggling with the definition of the delimiter.
I know this is a often asked and answered question, but none of the methods i tried has been working. I am hence trying to provide as much contest as possible as to my problem. Thanks to everyone who is willing to contribute
It took me some time but I have seem to been able to find an answer to my question... which I am posting in order to get possible corrections
1- I have converted my file to a csv (probably not necessary)
2-
import itertools
data = []
with open('filename.csv') as f:
for line in f:
data.append(line.strip().split(','))
#this returns a list of lists each of which is a line from the file
data= list(itertools.chain.from_iterable(data))
#merges the list of list into a single list

Python pd.read_csv: How to read files through a loop?

I would like to be able to read data by using pd.read_csv and store the data in Numpy ndarrays. I have a set of data which includes elements as xN1N1,xN1N2,...,xN1N50 (the general name format is as xN1Ny, for y in range(2,51)). For each of them, I basically would like to run the following operation:
xN1N1 = pd.read_csv("xN1N1.csv")
xN1N1 = xN1N1.to_numpy()
To do this with a for loop (I would like to read and save all the elements at one time), I attempted to define a function that would help, as follows:
def data(id_number):
x1 = pd.read_csv("'xN1N%d' % id_number.csv")
return x1
Executing this for y in range(2,51) gives me nothing, I am aware that the syntax is extremely defected, but I cannot correct it.
I would appreciate any help on this.
You can use python string formatting to solve your problem.
return pd.read_csv("xN1N{}.csv".format(id_number))

Arrays of single-element arrays in Python: why?

I came across an oddity when loading a .mat-file created in Matlab into Python with scipy.io.loadmat. I found similar 'array structures' being alluded to in other posts, but none explaining them. Also, I found ways to work around this oddity, but I would like to understand why Python (or scipy.io.loadmat) handles files this way.
Let's say I create a cell in Matlab and save it:
my_data = cell(dim1, dim2);
% Fill my_data with strings and floats...
save('my_data.mat','my_data')
Now I load it into Python:
import scipy.io as sio
data = sio.loadmat('my_data.mat')['my_data']
Now data has type numpy.ndarray and dtype object. When I look at a slice, it might look something like this:
data[0]
>>> array([array(['Some descriptive string'], dtype='<U13'),
array([[3.141592]]), array([[2.71828]]), array([[4.66920]]), etc.
], dtype=object).
Why is this happening? Why does Python/sio.loadmat create an array of single-element arrays, rather than an array of floats (assuming I remove the first column, which contain strings)?
I'm sorry if my question is basic, but I'd really like to understand what seems like an unnecessary complication.
As said in the comments:
This behaviour comes because you are saving a cell, an "array" that can contain anything inside. You fill this with matrices of size 1x1 (floats).
That is what python is giving you. an nparray of dtype=object that has inside 1x1 arrays.
Python is doing exactly what MATLAB was doing. For this example, you should just avoid using cells in MATLAB.

TypeError: list indices must be integers or slices, not tuple

fvecs = []
for line in open(filename):
stats = line.split(',')
labels.append(int(stats[0]))
fvecs.append([float(x) for x in stats[5,6,12,27,29,37,39,41]])
I have a big csv. file that I am using as a dataset containing 43 columns and hundreds of rows, I am attempting to extract specific columns to be used as individual records and I can't seem to work this out. The error is caused by the final line of code and produces the error message in the title, it works perfectly when the range is set to, stats[30:38] for example.
I have tried storing the required columns in a separate array and calling it like stats[requiredcolumns] but this produces the same error.
I have considered using pandas but this is just a small snippet of code from a much larger program, which all functions correctly, and the implementation of pandas would require a complete overhaul of the full program which is not possible due to time constraints.
Any help would be greatly appreciated
If you have few columns, you can try this:
for line in open(filename):
stats = line.split(',')
labels.append(int(stats[0]))
fvecs.append([float(x) for x in stats[5],stats[6],stats[12],stats[27], stats[29], stats[37], stats[39], stats[41]])
This code will return a list of lists; otherwise, the first comment is right about indexing and NumpPy.

Python quick data read-in and slice

I've got the following code in python and I think I'd need some help optimizing it.
I'm reading in a few million lines of data, but then throwing out most of them if one coordinate per line is not fitting my criterion.
The code is as following:
def loadFargoData(dataname, thlimit):
temp = np.loadtxt(dataname)
return temp[ np.abs(temp[:,1]) < thlimit ]
I've coded it as if it were C-type code and of course in python now this is crazy slow.
Can I throw out my temp object somehow? Or what other optimization can the Pythonian population help me with?
The data reader included in pandas might speed up your script. It reads faster than numpy. Pandas will produce a dataframe object, easy to view as a numpy array (also easy to convert if preferred) so you can execute your condition in numpy (which looks efficient enough in your question).
import pandas as pd
def loadFargoData(dataname, thlimit):
temp = pd.read_csv(dataname) # returns a dataframe
temp = temp.values # returns a numpy array
# the 2 lines above can be replaced by temp = pd.read_csv(dataname).values
return temp[ np.abs(temp[:,1]) < thlimit ]
You might want to check up Pandas' documentation to learn the function arguments you may need to read the file correctly (header, separator, etc).

Categories

Resources