numpy loadtxt not resulting in array - python

Seem to be hitting a wall with a simple problem. I'm trying to read in an array in a file. The columns are a mix of integer and strings; only interested in columns 0,2,3.
import numpy as np
network = np.loadtxt('temp.biflows',skiprows=1, usecols=(0,2,3), delimiter = '\t', dtype=[('ts','i10'), ('sndr','|S14'), ('recr', '|S14')])
print network.shape
A sample of the input file; columns are separate by tabs \t:
1441087368 1441087365 186.251.68.208 186.251.68.145 17 137 137 3 0 150 0
1441087342 1441087341 125.144.214.126 125.144.195.105 17 137 137 2 0 100 0
1441087370 1441087370 186.251.139.178 170.85.175.203 17 35905 161 2 2 760 850
There are actually 30104 lines. The resulting shape of network is network.shape = (30104,). What I am look for is for network to be an array with shape (30104,3).
FWIW my goal is to sort the lines based on the first column (a timestamp).
Any suggestions as to what I might be doing wrong would be greatly appreciated (as well as suggestions for how to do the sort).

You can't create a numpy array with shape (n, 3) where each column has a different type. What you can create (and what you did when you used loadtxt with dtype=[('ts','i10'), ('sndr','|S14'), ('recr', '|S14')]) is create a structured array, where each element in the array is a structure composed of several fields. In your case, you have three fields: one is an integer and two are strings. The array created by loadtxt is a one-dimensional array. Each element in the array is a structure with three fields. You can access the fields (which you can interpret as "columns") as network['ts'], network['sndr'] and network['recr'].
See http://docs.scipy.org/doc/numpy/user/basics.rec.html for more information. There is probably a lot of related information here on SO, too. For example, Access Columns of Numpy Array? Errors Trying to Do by Transpose or by Column Access

Related

Printing a Python Array

I have the array below that represent a matrix of 20 cols x 10 rows.
What I am trying to do is to get the value located on the third position after I provide the Column and Row Values. For example if I type in the values 3 and 0, I expect to get 183 as answer. I used the print command as follows print(matrix[3][0][I don't know]) either I get out of range or the undesirable results.
I also organized the data as matrix[[[0],[0],[180]], [[1],[0],[181]], [[2],[0],[182]],... without too much success.
I have the matrix data on a csv file, so I can formatted accordingly if the problem is the way I am presenting the data.
Can soomeone, please, take a look to this code and direct me? Thanks
matrix =[]
matrix =[
['0','0','180'],
['1','0','181'],
['2','0','182'],
['3','0','183'],
['4','0','184'],
['5','0','185'],
['6','0','186'],
['7','0','187'],
['18','0','198']]
print(matrix[?][?][value])
your matrix here is 9 * 3
if you want the 185, it's in the 6th row 3rd column, so indexes are 5 and 2 respectively.
matrix[5][2] will print the result, idk why you have a 3rd bracket.
basically to access an element you will do [rowNumber][colNumber] , first brackets will give you whatever is in that position of the big array (a 2 d array is just an array of arrays) so you get an array (1D with 3 element) you then put the index of the element in that 1D array.

Given a sparse csr matrix MS, how do i iterate through each row of MS?

I have a sparse matrix called MS, derived from a dense matrix of 5000x5000, and i want to iterate through each of its row (as a list or something else), so that i can perform other steps to it. Eg. Finding total count of each row.
I have tried looking at several API's online, but im still new to this, and have trouble interpreting/understanding them.
Is there anyway i can iterate through each row of my sparse csr matrix MS in python? Thank you in advance for any help/input.
You can use numpy
import numpy
new_MS = numpy.asarray(MS)
This will create a 1d array.
Then you can reshape the dimensions of column and row with,
new_ms = numpy.reshape(new_MS,(no_of_row,no_of_col))
This will create a 2d array of specifies rows and columns
Then iterate through columns like a simple c language on 2d array

Convert genfromtxt array to regular numpy array

I can't post the data being imported, because it's too much. But, it has both number and string fields and is 5543 rows and 137 columns. I import data with this code (ndnames and ndtypes holds the column names and column datatypes):
npArray2 = np.genfromtxt(fileName,
delimiter="|",
skip_header=1,
dtype=(ndtypes),
names=ndnames,
usecols=np.arange(0,137)
)
This works and the resulting variable type is "void7520" with size (5543,). But this is really a 1D array of 5543 rows, where each element holds a sub-array that has 137 elements. I want to convert this into a normal numpy array of 5543 rows and 137 columns. How can this be done?
I have tried the following (using Pandas):
pdArray = pd.read_csv(fileName,
sep=ndelimiter,
index_col=False,
skiprows=1,
names=ndnames
)
npArray = pd.DataFrame.as_matrix(pdArray)
But, the resulting npArray is type Object with size (5543,137) which, at first, looks promising. But, because it's type Object, there are other functions that can't be performed on it. Can this Object array be converted into a normal numpy array?
Edit:
ndtypes look like...
[int,int,...,int,'|U50',int,...,int,'|U50',int,...,int]
That is, 135 number fields with two string-type fields in the middle somewhere.
npArray2 is a 1d structured array, 5543 elements and 137 fields.
What does npArray2.dtype look like, or equivalently what is ndtypes, because the dtype is built from the types and names that you provided. "void7520" is a way of identifying a record of this array, but tells us little except the size (in bytes?).
If all fields of the dtype are numeric, even better yet if they are all the same numeric dtype (int, float), then it is fairly easy to convert it to a 2d array with 137 columns (2nd dim). astype and view can be used.
(edit - it has both number and string fields - you can't convert it to a 2d array of numbers; it could be an array of strings, but you can't do numeric math on strings.)
But if the dtypes are mixed then you can't convert it. All elements of the 2d array have be the same dtype. You have to use the structured array approach if you want mixed types. (well there is the dtype=object, but let's not go there).
Actually pandas is going the object route. Evidently it thinks the only way to make an array from this data is to let each element be its own type. And the math of object arrays is severely limited. They are, in effect a glorified, or debased, list.

How to convert a 2D array to a structured array using view (numpy)?

I am having some problems assigning fields to an array using the view method. Apparently, there doesn't seem to be a control of how you want to assign the field.
a=array([[1,2],[1,2],[1,2]]) # 3x2 matrix
#array([[1, 2],
# [1, 2],
# [1, 2]])
aa=a.transpose() # 2x3 matrix
#array([[1, 1, 1],
# [2, 2, 2]])
a.view(dtype='i8,i8') # This works
a.view(dtype='i8,i8,i8') # This returns error ValueError: new type not compatible with array.
aa.view(dtype='i8,i8') # This works
aa.view(dtype='i8,i8,i8') # This returns error ValueError: new type not compatible with array.
In fact, if I create aa from scratch instead of using transpose of a,
b=array([[1,1,1],[2,2,2]])
b.view(dtype='i8 i8') # This returns ValueError again.
b.view(dtype='i8,i8,i8') # This works
Why does this happen? Is there any way I can set the fields to represent rows or columns?
When you create a standard array in NumPy, some contiguous blocks of memory are occupied by the data. The size of each block depends on the dtype, the number and organization of these blocks by the shape of your array. Structured arrays follow the same pattern, except that each block is now composed of several sub-blocks, each sub-block occupying some space as defined by the corresponding dtype of the field.
In your example, you define a (3,2) array of ints (a). That's 2 int blocks for the first row, followed by 2 other blocks for the second and then 2 last blocks for the first. If you want to transform it into a structured array, you can either keep the original layout (each block becomes a unique field (a.view(dtype=[('f0', int)]), or transform your 2-block rows into rows of 1 larger block consisting of 2 sub-blocks, each sub-block having a int size.
That's what happen when you do a.view(dtype=[('f0',int),('f1',int)]).
You can't make larger blocks (ie, dtype="i8,i8,i8"), as the corresponding information would be spread across different rows.
Now, you can display your array in a different way, for example display it column by column: that's what happen when you do a .transpose of your array. It's only display, though, ('views' in the NumPy lingo), that doesn't change the original memory layout. So, your aa example, the original layout is still "3 rows of 2 integers", that you can represent as "3 rows of one block of 2 integers".
In your second example, b=array([[1,1,1],[2,2,2]]), you have a different layout: 2 rows of 3 int blocks. You can group the 3 int blocks into one larger block (dtype="i8,i8,i8") because you're not going over a row. You can't group it two by two, because you would have an extra block on each row.
You can transform a (N,M) standard array into only (1) a N structured array of M fields or (2) a NxM structured array of 1 field and that's it. The (N,M) is the shape given to the array at its creation. You can display your array as a (M,N) array by a transposition, but that doesn't modify the original memory layout.
when you specify the view as b.view(dtype='i8, i8') you are asking numpy to reinterpret the values as set of tuples with two values in them but this simply isn't feasible since we have 3 values which isn't a multiple of two, its like reshaping the matrix where it would generate a new matrix of different size, numpy doesn't like such things.

Programmatically add column names to numpy ndarray

I'm trying to add column names to a numpy ndarray, then select columns by their names. But it doesn't work. I can't tell if the problem occurs when I add the names, or later when I try to call them.
Here's my code.
data = np.genfromtxt(csv_file, delimiter=',', dtype=np.float, skip_header=1)
#Add headers
csv_names = [ s.strip('"') for s in file(csv_file,'r').readline().strip().split(',')]
data = data.astype(np.dtype( [(n, 'float64') for n in csv_names] ))
Dimension-based diagnostics match what I expect:
print len(csv_names)
>> 108
print data.shape
>> (1652, 108)
"print data.dtype.names" also returns the expected output.
But when I start calling columns by their field names, screwy things happen. The "column" is still an array with 108 columns...
print data["EDUC"].shape
>> (1652, 108)
... and it appears to contain more missing values than there are rows in the data set.
print np.sum(np.isnan(data["EDUC"]))
>> 27976
Any idea what's going wrong here? Adding headers should be a trivial operation, but I've been fighting this bug for hours. Help!
The problem is that you are thinking in terms of spreadsheet-like arrays, whereas NumPy does use different concepts.
Here is what you must know about NumPy:
NumPy arrays only contain elements of a single type.
If you need spreadsheet-like "columns", this type must be some tuple-like type. Such arrays are called Structured Arrays, because their elements are structures (i.e. tuples).
In your case, NumPy would thus take your 2-dimensional regular array and produce a one-dimensional array whose type is a 108-element tuple (the spreadsheet array that you are thinking of is 2-dimensional).
These choices were probably made for efficiency reasons: all the elements of an array have the same type and therefore have the same size: they can be accessed, at a low-level, very simply and quickly.
Now, as user545424 showed, there is a simple NumPy answer to what you want to do (genfromtxt() accepts a names argument with column names).
If you want to convert your array from a regular NumPy ndarray to a structured array, you can do:
data.view(dtype=[(n, 'float64') for n in csv_names]).reshape(len(data))
(you were close: you used astype() instead of view()).
You can also check the answers to quite a few Stackoverflow questions, including Converting a 2D numpy array to a structured array and how to convert regular numpy array to record array?.
Unfortunately, I don't know what is going on when you try to add the field names, but I do know that you can build the array you want directly from the file via
data = np.genfromtxt(csv_file, delimiter=',', names=True)
EDIT:
It seems like adding field names only works when the input is a list of tuples:
data = np.array(map(tuple,data), [(n, 'float64') for n in csv_names])

Categories

Resources