How to keep column names when converting from pandas to numpy

How to keep column names when converting from pandas to numpy - python

According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names
Howevever, if I convert a pandas DataFrame to an ndarray with df.as_matrix() or df.values, then the dtype.names field is None. Additionally, if I try to assign column names to the ndarray
X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.]))
print X
print type(X.as_matrix())# <type 'numpy.ndarray'>
print type(X.as_matrix()[0]) # <type 'numpy.ndarray'>
m = X.as_matrix()
m.dtype.names = list(X.columns)
I get
ValueError: there are no fields defined
UPDATE:
I'm particularly interested in the cases where the matrix only needs to hold a single type (it is an ndarray of a specific numeric type), since I'd also like to use cython for optimization. (I suspect numpy records and structured arrays are more difficult to deal with since they're more freely typed.)
Really, I'd just like to maintain the column_name meta data for arrays passed through a deep tree of sci-kit predictors. Its interface's .fit(X,y) and .predict(X) API don't permit passing additional meta-data about the column labels outside of the X and y object.

Consider a DF as shown below:
X = pd.DataFrame(dict(one=['Strawberry', 'Fields', 'Forever'], two=[1,2,3]))
X
Provide a list of tuples as data input to the structured array:
arr_ip = [tuple(i) for i in X.as_matrix()]
Ordered list of field names:
dtyp = np.dtype(list(zip(X.dtypes.index, X.dtypes)))
Here, X.dtypes.index gives you the column names and X.dtypes it's corresponding dtypes which are unified again into a list of tuples and fed as input to the dtype elements to be constructed.
arr = np.array(arr_ip, dtype=dtyp)
gives:
arr
# array([('Strawberry', 1), ('Fields', 2), ('Forever', 3)],
# dtype=[('one', 'O'), ('two', '<i8')])
and
arr.dtype.names
# ('one', 'two')

Pandas dataframe also has a handy to_records method. Demo:
X = pd.DataFrame(dict(age=[40., 50., 60.],
sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)
Returns:
rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)],
dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])
This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.age in addition to m['age'].
You can pass this to a cython function as a regular float array by constructing a view:
m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)
Which gives:
rec.array([[ 40., 140.],
[ 50., 150.],
[ 60., 160.]],
dtype=float64)
Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False).

Yet more methods of converting a pandas.DataFrame to numpy.array while preserving label/column names
This is mainly for demonstrating how to set dtype/column_dtypes, because sometimes a data source iterator's output'll need some pre-normalization.
Method one inserts by column into a zeroed array of predefined height and is loosely based on a Creating Structured Arrays guide that just a bit of web-crawling turned up
import numpy
def to_tensor(dataframe, columns = [], dtypes = {}):
# Use all columns from data frame if none where listed when called
if len(columns) <= 0:
columns = dataframe.columns
# Build list of dtypes to use, updating from any `dtypes` passed when called
dtype_list = []
for column in columns:
if column not in dtypes.keys():
dtype_list.append(dataframe[column].dtype)
else:
dtype_list.append(dtypes[column])
# Build dictionary with lists of column names and formatting in the same order
dtype_dict = {
'names': columns,
'formats': dtype_list
}
# Initialize _mostly_ empty nupy array with column names and formatting
numpy_buffer = numpy.zeros(
shape = len(dataframe),
dtype = dtype_dict)
# Insert values from dataframe columns into numpy labels
for column in columns:
numpy_buffer[column] = dataframe[column].to_numpy()
# Return results of conversion
return numpy_buffer
Method two is based on user7138814's answer and will likely be more efficient as it is basically a wrapper for the built in to_records method available to pandas.DataFrames
def to_tensor(dataframe, columns = [], dtypes = {}, index = False):
to_records_kwargs = {'index': index}
if not columns: # Default to all `dataframe.columns`
columns = dataframe.columns
if dtypes: # Pull in modifications only for dtypes listed in `columns`
to_records_kwargs['column_dtypes'] = {}
for column in dtypes.keys():
if column in columns:
to_records_kwargs['column_dtypes'].update({column: dtypes.get(column)})
return dataframe[columns].to_records(**to_records_kwargs)
With either of the above one could do...
X = pandas.DataFrame(dict(age = [40., 50., 60.], sys_blood_pressure = [140., 150., 160.]))
# Example of overwriting dtype for a column
X_tensor = to_tensor(X, dtypes = {'age': 'int32'})
print("Ages -> {0}".format(X_tensor['age']))
print("SBPs -> {0}".format(X_tensor['sys_blood_pressure']))
... which should output...
Ages -> array([40, 50, 60])
SBPs -> array([140., 150., 160.])
... and a full dump of X_tensor should look like the following.
array([(40, 140.), (50, 150.), (60, 160.)],
dtype=[('age', '<i4'), ('sys_blood_pressure', '<f8')])
Some thoughts
While method two will likely be more efficient than the first, method one (with some modifications) may be more useful for merging two or more pandas.DataFrames into one numpy.array.
Additionally (after swinging back through to review), method one will likely face-plant as it's written with errors about to_records_kwargs not being a mapping if dtypes is not defined, next time I'm feeling Pythonic I may resolve that with an else condition.

Create an example:
import pandas
import numpy
PandasTable = pandas.DataFrame( {
"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50],
"DDD": ['asdf1', 'asdf2', 'asdf3', 'asdf4'] } )
Solve the problem noting that we are creating something called a "structured numpy array":
NumpyDtypes = list( PandasTable.dtypes.items() )
NumpyTable = PandasTable.to_numpy(copy=True)
NumpyTableRows = [ tuple(Row) for Row in NumpyTable]
NumpyTableWithHeaders = numpy.array( NumpyTableRows, dtype=NumpyDtypes )
Rewrite the solution in 1 line of code:
NumpyTableWithHeaders2 = numpy.array( [ tuple(Row) for Row in PandasTable.to_numpy(copy=True)], dtype=list( PandasTable.dtypes.items() ) )
Print out the solution results:
print ('NumpyTableWithHeaders', NumpyTableWithHeaders)
print ('NumpyTableWithHeaders.dtype', NumpyTableWithHeaders.dtype)
print ('NumpyTableWithHeaders2', NumpyTableWithHeaders2)
print ('NumpyTableWithHeaders2.dtype', NumpyTableWithHeaders2.dtype)
NumpyTableWithHeaders [(4, 10, 100, 'asdf1') (5, 20, 50, 'asdf2') (6, 30, -30, 'asdf3')
(7, 40, -50, 'asdf4')]
NumpyTableWithHeaders.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')]
NumpyTableWithHeaders2 [(4, 10, 100, 'asdf1') (5, 20, 50, 'asdf2') (6, 30, -30, 'asdf3')
(7, 40, -50, 'asdf4')]
NumpyTableWithHeaders2.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')]
Documentation I had to read
Adding row/column headers to NumPy arrays
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html
How to keep column names when converting from pandas to numpy
https://numpy.org/doc/stable/user/basics.creation.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html
https://docs.scipy.org/doc/numpy-1.10.1/user/basics.rec.html
Notes and thoughts:
Pandas should add a flag in their 'to_numpy' function which does this.
Recent version Numpy documentation should be updated to include structured arrays, which behave differently than regular ones.

OK, here where I'm leaning:
class NDArrayWithColumns(np.ndarray):
def __new__(cls, obj, columns=None):
obj = obj.view(cls)
obj.columns = columns
return obj
def __array_finalize__(self, obj):
if obj is None: return
self.columns = getattr(obj, 'columns', None)
#staticmethod
def from_dataframe(df):
cols = tuple(df.columns)
arr = df.as_matrix(cols)
return NDArrayWithColumns.from_array(arr,cols)
#staticmethod
def from_array(array,columns):
if isinstance(array,NDArrayWithColumns):
return array
return NDArrayWithColumns(array,tuple(columns))
def __str__(self):
sup = np.ndarray.__str__(self)
if self.columns:
header = ", ".join(self.columns)
header = "# " + header + "\n"
return header+sup
return sup
NAN = float("nan")
X = pd.DataFrame(dict(age=[40., NAN, 60.], sys_blood_pressure=[140.,150.,160.]))
arr = NDArrayWithColumns.from_dataframe(X)
print arr
print arr.columns
print arr.dtype
Gives:
# age, sys_blood_pressure
[[ 40. 140.]
[ nan 150.]
[ 60. 160.]]
('age', 'sys_blood_pressure')
float64
and can also be passed to types cython function expecting a ndarray[2,double_t].
UPDATE: this works pretty good except for some oddness when passing the type to ufuncs.

Related

Custom reading CSV files (Keyword accesible / custom structure)

I am trying to do the following:
I downloaded a csv file containing my banking transactions of the last 180 days.
I want to readin this csv file and then do some plots with the data.
For that I setup a program that reads the csv file und makes the data avaible through keywords.
e.g. in the csv file there is a column "Buchungstag".
I replace that with the date keyword etc.
import numpy as np
import matplotlib.pylab as mpl
import csv
class finanz():
def __init__(self):
path = "/home/***/"
self.dataFileName = path + "test.csv"
self.data_read = open(self.dataFileName, 'r')
self._columns = {}
self._columns[0] = ["date", "Buchungstag", "", "S15"]
self._columns[1] = ["value", "Umsatz", "Euro", "f8"]
self._ident = {"Buchungstag":"date", "Umsatz in {0}":"value"}
self.base = 1205.30
self._readData()
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
self.data = np.recarray((2), dtype=dtype)
desiredKeys = map(lambda x:x, self._ident.iterkeys())
for i, x in enumerate(r):
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
self.data[self._ident[k]][i] = v
def getAllData(self):
return self.data.copy()
a = finanz()
b = a.getAllData()
print type(b)
print type(b['value']),type(b['date'])
Sample data
"Buchungstag";"Wertstellung (Valuta)";"Vorgang";"Buchungstext";"Umsatz in {0}";
"02.06.2015";"02.06.2015";"Lastschrift/Belast.";"Auftraggeber: abc";"-3,75";
My question now is why is type(b['date']) a class 'numpy.core.records.recarray' and type(b['value']) a type 'numpy.ndarray' ??
And my second question would be how to "save" the date in a format that I can use with matplotlib?
The Third and final question is how can I check many rows the csv file has (for the creation of the empty self.data array)
Thx!

Repeating your array generation without the extra code:
In [230]: dt=np.dtype([('date', 'S15'), ('value', '<f8')])
In [231]: data=np.recarray((2,),dtype=dt)
In [232]: type(data['date'])
Out[232]: numpy.core.records.recarray
In [233]: type(data['value'])
Out[233]: numpy.ndarray
The fact that one field is returned as ndarray, and the other as recarray isn't significant. It's just how the recarray class is setup.
Now we mostly use 'structured arrays', created for example with
data1=np.empty((2,),dtype=dt)
or filled with '0s':
data1=np.zeros((2,),dtype=dt)
# array([('', 0.0), ('', 0.0)],
dtype=[('date', 'S15'), ('value', '<f8')])
With this, both data1['date'] and ['value'] are ndarray. recarray is the old version, and still compatible, but structured arrays are more consistent in their syntax and behavior. There are lots of SO questions about structured arrays, many produced by np.genfromtxt applied to csv files like yours.
I could combine this idea, plus my comment (about list appends):
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
if self._columns[0][1].endswith('tag'):
self._columns[0][2] = 'datetime64[D]'
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
desiredKeys = map(lambda x:x, self._ident.iterkeys())
data = []
for x in r:
aline = np.zeros((1,), dtype=dtype)
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
v1 = v.split('.')
if len(v1)==3: # convert date to yyyyy-mm-dd format
v = '%s-%s-%s'%(v1[2],v1[1],v1[0])
aline[self._ident[k]] = v
data.append(aline)
self.data = np.concatenate(data)
producing a b like:
array([(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55)],
dtype=[('date', '<M8[D]'), ('value', '<f8')])
I believe genfromtxt collects each row as a tuple, and creates the array at the end. The docs for structured arrays shows that they can be constructed from
np.array([(item1, item2), (item3, item4),...], dtype=dtype)
I chose to construct an array for each line, and concatenate them at the end because that required fewer changes to your code.
I also changed that function so it converts the 'tag' column to np.datetime64 dtype. There are a number of SO questions about using that dtype. I believe it can used in matplotlib, though I don't have experience with that.

Pandas Dataframe or Panel to 3d numpy array

Setup:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
output:
c d e
a b
0.439502 0.115087 0.832546 0.760513 0.776555
0.609107 0.247642 0.031650 0.727773
0.995370 0.299640 0.053523 0.565753 0.857235
0.392132 0.832560 0.774653 0.213692
Each data series is grouped by the index ID a and b represents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the a groupings? Currently it reads the data as two dimensional so pdf.shape outputs (4, 5). What I would like is for the array to be of the variable form:
array([[[-1.38655912, -0.90145951, -0.95106951, 0.76570984],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576]],
[[ 0.0768149 , -0.7566995 , -2.57770951, 0.70834656],
[-0.99097395, -0.81592084, -1.21075386, 0.12361382]]])
Is there a native Pandas way to do this? Note that number of rows per a grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?

I just had an extremely similar problem and solved it like this:
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
output:
array([[[ 0.47780308, 0.93422319, 0.00526572, 0.41645868, 0.82089215],
[ 0.47780308, 0.15372096, 0.20948369, 0.76354447, 0.27743855]],
[[ 0.75146799, 0.39133973, 0.25182206, 0.78088926, 0.30276705],
[ 0.75146799, 0.42182369, 0.01166461, 0.00936464, 0.53208731]]])
verifying it is 3d, a3d.shape gives (2, 2, 5).
Lastly, to make the newly created dimension the last dimension (instead of the first) then use:
a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
which has a shape of (2, 5, 2)
For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.
Example setup with ragged shape:
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
dataframe:
c d e
a b
0.460013 0.577535 0.299304 0.617103 0.378887
0.167907 0.244972 0.615077 0.311497
0.318823 0.640575 0.768187 0.652760 0.822311
0.424744 0.958405 0.659617 0.998765
0.077048 0.407182 0.758903 0.273737
One possible solution:
n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
.apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))
a3d.shape gives (2, 3, 5)

as_matrix is deprecated, and here we assume first key is a , then groups in a may have different length, this method solve all the problem .
import pandas as pd
import numpy as np
from typing import List
def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
"""Make an array cube from a Dataframe
Args:
df: Dataframe
idx_cols: columns defining the dimensions of the cube
Returns:
multi-dimensional array
"""
assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'
df = df.set_index(keys=idx_cols) # don't overwrite a parameter, thus copy!
idx_dims = [len(level) + 1 for level in df.index.levels]
idx_dims.append(len(df.columns))
cube = np.empty(idx_dims)
cube.fill(np.nan)
cube[tuple(np.array(df.index.to_list()).T)] = df.values
return cube
Test:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
# a, b must be integer
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give : (2, 2, 3)
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give s (2, 3, 3) .

panel.values
will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).

Instead of deprecated .as_matrix or alternativly .values() pandas documentation recommends to use .to_numpy()
'Warning: We recommend using DataFrame.to_numpy() instead.'

efficient way to convert a nested list to numpy array

I have a nested list with different list sized and types.
def read(f,tree,objects):
Event=[]
for o in objects:
#find different features of one class
temp=[i.GetName() for i in tree.GetListOfBranches() if i.GetName().startswith(o)]
tempList=[] #contains one class of objects
for t in temp:
#print t
tempList.append(t)
comp=np.asarray(getattr(tree,t))
tempList.append(comp)
Event.append(tempList)
return Event
def main():
path="path/to/file"
objects= ['TauJet', 'Jet', 'Electron', 'Muon', 'Photon', 'Tracks', 'ETmis', 'CaloTower']
f=ROOT.TFile(path)
tree=f.Get("RecoTree")
tree.GetEntry(100)
event=read(f,tree,objects)
for example result of event[0] is
['TauJet', array(1), 'TauJet_E', array([ 31.24074173]), 'TauJet_Px', array([-28.27997971]), 'TauJet_Py', array([-13.18042469]), 'TauJet_Pz', array([-1.08304048]), 'TauJet_Eta', array([-0.03470514]), 'TauJet_Phi', array([-2.70545626]), 'TauJet_PT', array([ 31.20065498]), 'TauJet_Charge', array([ 1.]), 'TauJet_NTracks', array([3]), 'TauJet_EHoverEE', array([ 1745.89221191]), 'TauJet_size', array(1)]
how can I convert it into numpy array?
NOTE 1: np.asarray(event, "object") is slow. I am looking for a better way. Also np.fromiter() is not applicable as far as I don't have a fixed type
NOTE 2: I don't know the length of my Events.
NOTE 3: I can also get ride of names if it makes thing easier.

You could try something like this, I'm not sure how fast its going to be though. This creates a numpy record array for first row.
data = event[0]
keys = data[0::2]
vals = data[1::2]
#there are some zero-rank arrays in there, so need to check for those,
#but I think just recasting them to a np.float should work.
temp = [np.float(v) for v in vals]
#you could also just create a np array from the line above with np.array(temp)
dtype={"names":keys, "formats":("f4")*len(vals)}
myArr = np.rec.fromarrays(temp, dtype=dtype)
#test it out
In [53]: data["TauJet_Pz"]
Out[53]: array(-1.0830404758453369, dtype=float32)
#alternatively, you could try something like this, which just creates a 2d numpy array
vals = np.array([[np.float(v) for v in row[1::2]] for row in event])
#now create a nice record array from that using the dtypes above
myRecordArray = np.rec.fromarrays(vals, dtype=dtype)

-9999 as missing value with numpy.genfromtxt()

Lets say I have a dumb text file with the contents:
Year Recon Observed
1505 162.38 23
1506 46.14 -9999
1507 147.49 -9999
-9999 is used to denote a missing value (don't ask).
So, I should be able to read this into a Numpy array with:
import numpy as np
x = np.genfromtxt("file.txt", dtype = None, names = True, missing_values = -9999)
And have all my little -9999s turn into numpy.nan. But, I get:
>>> x
array([(1409, 112.38, 23), (1410, 56.14, -9999), (1411, 145.49, -9999)],
dtype=[('Year', '<i8'), ('Recon', '<f8'), ('Observed', '<i8')])
... That's not right...
Am I missing something?

Nope, you're not doing anything wrong. Using the missing_values argument indeed tells np.genfromtxt that the corresponding values should be flagged as "missing/invalid". The problem is that dealing with missing values is only supported if you use the usemask=True argument (I probably should have made that clearer in the documentation, my bad).
With usemask=True, the output is a masked array. You can transform it into a regular ndarray with the missing values replaced by np.nan with the method .filled(np.nan).
Be careful, though: if you have column that was detected as having a int dtype and you try to fill its missing values with np.nan, you won't get what you expect (np.nan is only supported for float columns).

Trying:
>>> x = np.genfromtxt("file.txt",names = True, missing_values = "-9999", dtype=None)
>>> x
array([(1505, 162.38, 23), (1506, 46.14, -9999), (1507, 147.49, -9999)],
dtype=[('Year', '<i8'), ('Recon', '<f8'), ('Observed', '<i8')])
does not give the correct answer. So just making it a string doesn't help. However, if an additional flag, usemask=True is added, you get:
>>> x = np.genfromtxt("file.txt",names = True, missing_values = -9999, dtype=None, usemask=True)
>>> x
masked_array(data = [(1505, 162.38, 23) (1506, 46.14, --) (1507, 147.49, --)],
mask = [(False, False, False) (False, False, True) (False, False, True)],
fill_value = (999999, 1e+20, 999999),
dtype = [('Year', '<i8'), ('Recon', '<f8'), ('Observed', '<i8')])
which gives what you want in a MaskedArray, which may be useable for you anyway.

The numpy documentation at SciPy suggests that the missing_value should be a string to work the way you want. A straight numeric value seems to be interpreted as a column index.

Using numpy's flatten_dtype with structured dtypes that have titles

I usually don't post questions on these forums, but I've searched all over the place and I haven't found anything about this issue.
I am working with structured arrays to store experimental data. I'm using titles to store information about my fields, in this case the units of measure. When I call numpy.lib.io.flatten_dtype() on my dtype, I get:
ValueError: too many values to unpack
File "c:\Python25\Lib\site-packages\numpy\lib\_iotools.py", line 78, in flatten_dtype
(typ, _) = ndtype.fields[field]
I wouldn't really care, except that numpy.genfromtxt() calls numpy.lib.io.flatten_dtype(), and I need to be able to import my data from text files.
I'm wondering if I've done something wrong. Is flatten_dtype() not meant to support titles? Is there a work-around for genfromtxt()?
Here's a snippet of my code:
import numpy
fname = "C:\\Somefile.txt"
dtype = numpy.dtype([(("Amps","Current"),"f8"),(("Volts","Voltage"),"f8")])
myarray = numpy.genfromtxt(fname,dtype)

Here is a possible workaround:
Since your custom dtype causes a problem, supply a flattened dtype instead:
In [77]: arr=np.genfromtxt('a',dtype='f8,f8')
In [78]: arr
Out[78]:
array([(1.0, 2.0), (3.0, 4.0)],
dtype=[('f0', '<f8'), ('f1', '<f8')])
Then use astype to convert to your desired dtype:
In [79]: arr=np.genfromtxt('a',dtype='f8,f8').astype(dtype)
In [80]: arr
Out[80]:
array([(1.0, 2.0), (3.0, 4.0)],
dtype=[(('Amps', 'Current'), '<f8'), (('Volts', 'Voltage'), '<f8')])
Edit: Another alternative is to monkey-patch numpy.lib.io.flatten_dtype:
import numpy
import numpy.lib.io
def flatten_dtype(ndtype):
"""
Unpack a structured data-type.
"""
names = ndtype.names
if names is None:
return [ndtype]
else:
types = []
for field in names:
typ_fields = ndtype.fields[field]
flat_dt = flatten_dtype(typ_fields[0])
types.extend(flat_dt)
return types
numpy.lib.io.flatten_dtype=flatten_dtype

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to keep column names when converting from pandas to numpy - python

Related

Custom reading CSV files (Keyword accesible / custom structure)

Pandas Dataframe or Panel to 3d numpy array

efficient way to convert a nested list to numpy array

-9999 as missing value with numpy.genfromtxt()

Using numpy's flatten_dtype with structured dtypes that have titles

Categories

Resources