-9999 as missing value with numpy.genfromtxt() - python

Lets say I have a dumb text file with the contents:
Year Recon Observed
1505 162.38 23
1506 46.14 -9999
1507 147.49 -9999
-9999 is used to denote a missing value (don't ask).
So, I should be able to read this into a Numpy array with:
import numpy as np
x = np.genfromtxt("file.txt", dtype = None, names = True, missing_values = -9999)
And have all my little -9999s turn into numpy.nan. But, I get:
>>> x
array([(1409, 112.38, 23), (1410, 56.14, -9999), (1411, 145.49, -9999)],
dtype=[('Year', '<i8'), ('Recon', '<f8'), ('Observed', '<i8')])
... That's not right...
Am I missing something?

Nope, you're not doing anything wrong. Using the missing_values argument indeed tells np.genfromtxt that the corresponding values should be flagged as "missing/invalid". The problem is that dealing with missing values is only supported if you use the usemask=True argument (I probably should have made that clearer in the documentation, my bad).
With usemask=True, the output is a masked array. You can transform it into a regular ndarray with the missing values replaced by np.nan with the method .filled(np.nan).
Be careful, though: if you have column that was detected as having a int dtype and you try to fill its missing values with np.nan, you won't get what you expect (np.nan is only supported for float columns).

Trying:
>>> x = np.genfromtxt("file.txt",names = True, missing_values = "-9999", dtype=None)
>>> x
array([(1505, 162.38, 23), (1506, 46.14, -9999), (1507, 147.49, -9999)],
dtype=[('Year', '<i8'), ('Recon', '<f8'), ('Observed', '<i8')])
does not give the correct answer. So just making it a string doesn't help. However, if an additional flag, usemask=True is added, you get:
>>> x = np.genfromtxt("file.txt",names = True, missing_values = -9999, dtype=None, usemask=True)
>>> x
masked_array(data = [(1505, 162.38, 23) (1506, 46.14, --) (1507, 147.49, --)],
mask = [(False, False, False) (False, False, True) (False, False, True)],
fill_value = (999999, 1e+20, 999999),
dtype = [('Year', '<i8'), ('Recon', '<f8'), ('Observed', '<i8')])
which gives what you want in a MaskedArray, which may be useable for you anyway.

The numpy documentation at SciPy suggests that the missing_value should be a string to work the way you want. A straight numeric value seems to be interpreted as a column index.

Related

How to keep column names when converting from pandas to numpy

According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names
Howevever, if I convert a pandas DataFrame to an ndarray with df.as_matrix() or df.values, then the dtype.names field is None. Additionally, if I try to assign column names to the ndarray
X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.]))
print X
print type(X.as_matrix())# <type 'numpy.ndarray'>
print type(X.as_matrix()[0]) # <type 'numpy.ndarray'>
m = X.as_matrix()
m.dtype.names = list(X.columns)
I get
ValueError: there are no fields defined
UPDATE:
I'm particularly interested in the cases where the matrix only needs to hold a single type (it is an ndarray of a specific numeric type), since I'd also like to use cython for optimization. (I suspect numpy records and structured arrays are more difficult to deal with since they're more freely typed.)
Really, I'd just like to maintain the column_name meta data for arrays passed through a deep tree of sci-kit predictors. Its interface's .fit(X,y) and .predict(X) API don't permit passing additional meta-data about the column labels outside of the X and y object.
Consider a DF as shown below:
X = pd.DataFrame(dict(one=['Strawberry', 'Fields', 'Forever'], two=[1,2,3]))
X
Provide a list of tuples as data input to the structured array:
arr_ip = [tuple(i) for i in X.as_matrix()]
Ordered list of field names:
dtyp = np.dtype(list(zip(X.dtypes.index, X.dtypes)))
Here, X.dtypes.index gives you the column names and X.dtypes it's corresponding dtypes which are unified again into a list of tuples and fed as input to the dtype elements to be constructed.
arr = np.array(arr_ip, dtype=dtyp)
gives:
arr
# array([('Strawberry', 1), ('Fields', 2), ('Forever', 3)],
# dtype=[('one', 'O'), ('two', '<i8')])
and
arr.dtype.names
# ('one', 'two')
Pandas dataframe also has a handy to_records method. Demo:
X = pd.DataFrame(dict(age=[40., 50., 60.],
sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)
Returns:
rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)],
dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])
This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.age in addition to m['age'].
You can pass this to a cython function as a regular float array by constructing a view:
m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)
Which gives:
rec.array([[ 40., 140.],
[ 50., 150.],
[ 60., 160.]],
dtype=float64)
Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False).
Yet more methods of converting a pandas.DataFrame to numpy.array while preserving label/column names
This is mainly for demonstrating how to set dtype/column_dtypes, because sometimes a data source iterator's output'll need some pre-normalization.
Method one inserts by column into a zeroed array of predefined height and is loosely based on a Creating Structured Arrays guide that just a bit of web-crawling turned up
import numpy
def to_tensor(dataframe, columns = [], dtypes = {}):
# Use all columns from data frame if none where listed when called
if len(columns) <= 0:
columns = dataframe.columns
# Build list of dtypes to use, updating from any `dtypes` passed when called
dtype_list = []
for column in columns:
if column not in dtypes.keys():
dtype_list.append(dataframe[column].dtype)
else:
dtype_list.append(dtypes[column])
# Build dictionary with lists of column names and formatting in the same order
dtype_dict = {
'names': columns,
'formats': dtype_list
}
# Initialize _mostly_ empty nupy array with column names and formatting
numpy_buffer = numpy.zeros(
shape = len(dataframe),
dtype = dtype_dict)
# Insert values from dataframe columns into numpy labels
for column in columns:
numpy_buffer[column] = dataframe[column].to_numpy()
# Return results of conversion
return numpy_buffer
Method two is based on user7138814's answer and will likely be more efficient as it is basically a wrapper for the built in to_records method available to pandas.DataFrames
def to_tensor(dataframe, columns = [], dtypes = {}, index = False):
to_records_kwargs = {'index': index}
if not columns: # Default to all `dataframe.columns`
columns = dataframe.columns
if dtypes: # Pull in modifications only for dtypes listed in `columns`
to_records_kwargs['column_dtypes'] = {}
for column in dtypes.keys():
if column in columns:
to_records_kwargs['column_dtypes'].update({column: dtypes.get(column)})
return dataframe[columns].to_records(**to_records_kwargs)
With either of the above one could do...
X = pandas.DataFrame(dict(age = [40., 50., 60.], sys_blood_pressure = [140., 150., 160.]))
# Example of overwriting dtype for a column
X_tensor = to_tensor(X, dtypes = {'age': 'int32'})
print("Ages -> {0}".format(X_tensor['age']))
print("SBPs -> {0}".format(X_tensor['sys_blood_pressure']))
... which should output...
Ages -> array([40, 50, 60])
SBPs -> array([140., 150., 160.])
... and a full dump of X_tensor should look like the following.
array([(40, 140.), (50, 150.), (60, 160.)],
dtype=[('age', '<i4'), ('sys_blood_pressure', '<f8')])
Some thoughts
While method two will likely be more efficient than the first, method one (with some modifications) may be more useful for merging two or more pandas.DataFrames into one numpy.array.
Additionally (after swinging back through to review), method one will likely face-plant as it's written with errors about to_records_kwargs not being a mapping if dtypes is not defined, next time I'm feeling Pythonic I may resolve that with an else condition.
Create an example:
import pandas
import numpy
PandasTable = pandas.DataFrame( {
"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50],
"DDD": ['asdf1', 'asdf2', 'asdf3', 'asdf4'] } )
Solve the problem noting that we are creating something called a "structured numpy array":
NumpyDtypes = list( PandasTable.dtypes.items() )
NumpyTable = PandasTable.to_numpy(copy=True)
NumpyTableRows = [ tuple(Row) for Row in NumpyTable]
NumpyTableWithHeaders = numpy.array( NumpyTableRows, dtype=NumpyDtypes )
Rewrite the solution in 1 line of code:
NumpyTableWithHeaders2 = numpy.array( [ tuple(Row) for Row in PandasTable.to_numpy(copy=True)], dtype=list( PandasTable.dtypes.items() ) )
Print out the solution results:
print ('NumpyTableWithHeaders', NumpyTableWithHeaders)
print ('NumpyTableWithHeaders.dtype', NumpyTableWithHeaders.dtype)
print ('NumpyTableWithHeaders2', NumpyTableWithHeaders2)
print ('NumpyTableWithHeaders2.dtype', NumpyTableWithHeaders2.dtype)
NumpyTableWithHeaders [(4, 10, 100, 'asdf1') (5, 20, 50, 'asdf2') (6, 30, -30, 'asdf3')
(7, 40, -50, 'asdf4')]
NumpyTableWithHeaders.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')]
NumpyTableWithHeaders2 [(4, 10, 100, 'asdf1') (5, 20, 50, 'asdf2') (6, 30, -30, 'asdf3')
(7, 40, -50, 'asdf4')]
NumpyTableWithHeaders2.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')]
Documentation I had to read
Adding row/column headers to NumPy arrays
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html
How to keep column names when converting from pandas to numpy
https://numpy.org/doc/stable/user/basics.creation.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html
https://docs.scipy.org/doc/numpy-1.10.1/user/basics.rec.html
Notes and thoughts:
Pandas should add a flag in their 'to_numpy' function which does this.
Recent version Numpy documentation should be updated to include structured arrays, which behave differently than regular ones.
OK, here where I'm leaning:
class NDArrayWithColumns(np.ndarray):
def __new__(cls, obj, columns=None):
obj = obj.view(cls)
obj.columns = columns
return obj
def __array_finalize__(self, obj):
if obj is None: return
self.columns = getattr(obj, 'columns', None)
#staticmethod
def from_dataframe(df):
cols = tuple(df.columns)
arr = df.as_matrix(cols)
return NDArrayWithColumns.from_array(arr,cols)
#staticmethod
def from_array(array,columns):
if isinstance(array,NDArrayWithColumns):
return array
return NDArrayWithColumns(array,tuple(columns))
def __str__(self):
sup = np.ndarray.__str__(self)
if self.columns:
header = ", ".join(self.columns)
header = "# " + header + "\n"
return header+sup
return sup
NAN = float("nan")
X = pd.DataFrame(dict(age=[40., NAN, 60.], sys_blood_pressure=[140.,150.,160.]))
arr = NDArrayWithColumns.from_dataframe(X)
print arr
print arr.columns
print arr.dtype
Gives:
# age, sys_blood_pressure
[[ 40. 140.]
[ nan 150.]
[ 60. 160.]]
('age', 'sys_blood_pressure')
float64
and can also be passed to types cython function expecting a ndarray[2,double_t].
UPDATE: this works pretty good except for some oddness when passing the type to ufuncs.

Matlab to Python conversion: Read a text file into numpy records and search array for a string

I am just learning Python and I am not familiar with all the terminology yet. I have the following Matlab code that I would like to do in Python.
Read a text file into a structure (record/list?)
Search a field (array of strings) for a particular value.
Use that index in another field
sampleData.txt
name descript sr type scale offset
a Param_a 10 int8 1 0
b Param_b 20 unit 2 -10
c Param_c 30 int8 3 -20
d Param_d 40 int8 4 -30
e Param_e 50 uint 5 -40
Matlab Code:
>> [info.name info.descrip info.sr info.type info.scale info.offset] = textread('sampleData.txt','%s\t%s\t%f\t%s\t%f\t%f','headerlines',1);
info =
name: {5x1 cell}
descrip: {5x1 cell}
sr: [5x1 double]
type: {5x1 cell}
scale: [5x1 double]
offset: [5x1 double]
>> nameIdx = strcmp(info.name,'b') ;
>> matched_sr = info.sr(nameIdx)
matched_sr =
20
In python I was able to read-in the text file using numpy with:
info= recfromcsv('sampleData.txt', delimiter='\t')
Out:
rec.array([(b'a', b'Param_a', 10, b'int8', 1, 0),
(b'b', b'Param_b', 20, b'unit', 2, -10),
(b'c', b'Param_c', 30, b'int8', 3, -20),
(b'd', b'Param_d', 40, b'int8', 4, -30),
(b'e', b'Param_e', 50, b'uint', 5, -40)],
dtype=[('name', 'S1'), ('descript', 'S7'), ('sr', '<i4'), ('type', 'S4'), ('scale', '<i4'), ('offset', '<i4')])
I can do things like the following to get an array of logicals:
In [77]: info.sr == 20
Out[77]: array([False, True, False, False, False], dtype=bool)
But the same thing for info.name doesn't work.
In [78]: info.name == 'b'
Out[78]: False
So, how fo I find a parameter by "name" like I did in matlab with strcmp? Also, more generally is there a better approach in Python/numpy to read in text files as arrays(records or lists?)? Sorry for any incorrect Python jargon as I am still very new.
Thanks,
Looks like you are using Python3, which uses unicode strings as default. But the data file is ASCII, so the strings load as byte arrays. So all the strings display with the b.
So to do comparisons, you need to compare byte strings to byte strings.
Try:
info.name == b'b'
e.g.
In [21]: info.type==b'int8'
Out[21]: array([ True, False, True, True, False], dtype=bool)

Check if dataframe column is Categorical

I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({
'x': np.linspace(0, 50, 6),
'y': np.linspace(0, 20, 6),
'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is 'category':
df.cat_column.dtype
Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64'
Out[21]: True
But this doesn't seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'
TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?
Use the name property to do the comparison instead, it should always work because it's just a string:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'
>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like):
return array_like.dtype.name == 'category'
First, the string representation of the dtype is 'category' and not 'categorical', so this works:
In [41]: df.cat_column.dtype == 'category'
Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True
In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True
For non-categorical columns, those statements will return False instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False
For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to #Jeff.
In my pandas version (v1.0.3), a shorter version of joris' answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})
print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True
print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ)) # True
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True
Nowadays you can use:
pandas.api.types.is_categorical_dtype(series)
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html
Available since at least pandas 1.0
Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical,
I propose this considering categorical the dtypes within 'categorical_dtypes' list:
def is_cat(column):
categorical_dtypes = ['object', 'category', 'bool']
if column.dtype.name in categorical_dtypes:
return True
else:
return False
ยดยดยด

numpy loadtext returns 0-d array for single line files, annoying?

I have the problem, that I wrote code which is using the following numpy calls
columnNames = ['A','B','C'];
dt = [(s,np.float64) for s in columnNames];
# load structured unit
SimData = np.loadtxt( file ,dtype=dt, delimiter="\t", comments='#')
If my file contains only one line, then SimData['A'][0] does not exist because the command returns a 0-d array, this is somehow annoying?
How can I make my code using indexing which also works for single line files?
You can use the argument ndmin=1 to force loadtxt to return a result that is at least one-dimensional:
In [10]: !cat data.tsv
100 200 300
In [11]: a = np.loadtxt('data.tsv', dtype=dt, delimiter='\t', ndmin=1)
In [12]: a.shape
Out[12]: (1,)
In [13]: a
Out[13]:
array([(100.0, 200.0, 300.0)],
dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Using numpy's flatten_dtype with structured dtypes that have titles

I usually don't post questions on these forums, but I've searched all over the place and I haven't found anything about this issue.
I am working with structured arrays to store experimental data. I'm using titles to store information about my fields, in this case the units of measure. When I call numpy.lib.io.flatten_dtype() on my dtype, I get:
ValueError: too many values to unpack
File "c:\Python25\Lib\site-packages\numpy\lib\_iotools.py", line 78, in flatten_dtype
(typ, _) = ndtype.fields[field]
I wouldn't really care, except that numpy.genfromtxt() calls numpy.lib.io.flatten_dtype(), and I need to be able to import my data from text files.
I'm wondering if I've done something wrong. Is flatten_dtype() not meant to support titles? Is there a work-around for genfromtxt()?
Here's a snippet of my code:
import numpy
fname = "C:\\Somefile.txt"
dtype = numpy.dtype([(("Amps","Current"),"f8"),(("Volts","Voltage"),"f8")])
myarray = numpy.genfromtxt(fname,dtype)
Here is a possible workaround:
Since your custom dtype causes a problem, supply a flattened dtype instead:
In [77]: arr=np.genfromtxt('a',dtype='f8,f8')
In [78]: arr
Out[78]:
array([(1.0, 2.0), (3.0, 4.0)],
dtype=[('f0', '<f8'), ('f1', '<f8')])
Then use astype to convert to your desired dtype:
In [79]: arr=np.genfromtxt('a',dtype='f8,f8').astype(dtype)
In [80]: arr
Out[80]:
array([(1.0, 2.0), (3.0, 4.0)],
dtype=[(('Amps', 'Current'), '<f8'), (('Volts', 'Voltage'), '<f8')])
Edit: Another alternative is to monkey-patch numpy.lib.io.flatten_dtype:
import numpy
import numpy.lib.io
def flatten_dtype(ndtype):
"""
Unpack a structured data-type.
"""
names = ndtype.names
if names is None:
return [ndtype]
else:
types = []
for field in names:
typ_fields = ndtype.fields[field]
flat_dt = flatten_dtype(typ_fields[0])
types.extend(flat_dt)
return types
numpy.lib.io.flatten_dtype=flatten_dtype

Categories

Resources