panda dataframe to numpy array using pandas.dataframe.values - python

I have a data frame as below. I want to make it a numpy array.
When i using df.values command it is making as numpy array but all the attributes are converted to float. I checked df.values documentation but was not helpful, can i assign same datatype of df to numpy?
Thanks in advance for your help
High Low ... Volume Adj Close
Date ...
2018-12-20 2509.629883 2441.179932 ... 5585780000 2467.419922
2018-12-21 2504.409912 2408.550049 ... 7609010000 2416.620117
2018-12-24 2410.340088 2351.100098 ... 2613930000 2351.100098
2018-12-26 2467.760010 2346.580078 ... 4233990000 2467.699951
2018-12-27 2489.100098 2397.939941 ... 4096610000 2488.830078
2018-12-28 2520.270020 2472.889893 ... 3702620000 2485.739990
2018-12-31 2509.239990 2482.820068 ... 3442870000 2506.850098
2019-01-02 2519.489990 2467.469971 ... 3733160000 2510.030029

Numpy arrays have a uniform data type, as you can see from the documentation:
numpy.ndarray class numpy.ndarray(shape, dtype=float, buffer=None,
offset=0, strides=None, order=None)[source] An array object represents
a multidimensional, homogeneous array of fixed-size items. An
associated data-type object describes the format of each element in
the array (its byte-order, how many bytes it occupies in memory,
whether it is an integer, a floating point number, or something else,
etc.)
When you use the df.values, it will cast all values to the most suitable datatype to keep the homogeneity.
The pandas.DataFrame.values also mentions that:
Notes
The dtype will be a lower-common-denominator dtype (implicit
upcasting); that is to say if the dtypes (even of numeric types) are
mixed, the one that accommodates all will be chosen. Use this with
care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to
float32. If dtypes are int32 and uint8, dtype will be upcast to int32.
By numpy.find_common_type() convention, mixing int64 and uint64 will
result in a float64 dtype.

You can do it with NumPy structured arrays.
I will create a DataFrame with only 2 rows and 2 columns similar to yours to demonstrate how you can do it with any size of DataFrame.
import Pandas as pd
import Numpy as np
df = pd.DataFrame({'High': [2509.629883, 2504.409912],
'Volume': [5585780000, 7609010000]},
index=np.array(['2018-12-20', '2018-12-21'], dtype='datetime64'))
Then you create an empty NumPy array defining what datatype each column must have. In my example, I only have 2 rows so the array will only have 2 rows as following:
array = np.empty(2, dtype={'names':('col1', 'col2', 'col3'),
'formats':('datetime64[D]', 'f8', 'i8')})
array['col1'] = df.index
array['col2'] = df['High']
array['col3'] = df['Volume']
and, the array will look like:
array([('2018-12-20', 2509.629883, 5585780000),
('2018-12-21', 2504.409912, 7609010000)],
dtype=[('col1', '<M8[D]'), ('col2', '<f8'), ('col3', '<i8')])
You also can create a np.recarray class using command np.rec.array. This is almost identical with structured arrays with only one extra feature. You can access fields as attributes, i.e. array.col1 instead of array['col1']. However, numpy record arrays are apparently slower than structured arrays!

Related

In a dataframe how do I select a column and convert the value to a float? [duplicate]

Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.

.sum() method in pandas gives inconsistent results

I have a large DataFrame (circa 4e+07 rows).
When summing it, I get 2 significantly different results whether I do the sum before or after the column selection.
Also, the type changes from float32 to float64 even though totals are all below 2**31
df[[col1, col2, col3]].sum()
Out[1]:
col1 9.36e+07
col2 1.39e+09
col3 6.37e+08
dtype: float32
df.sum()[[col1, col2, col3]]
Out[2]:
col1 1.21e+08
col2 1.70e+09
col3 7.32e+08
dtype: float64
I am obviously missing something, has anybody had the same issue?
Thanks for your help.
To understand what's going on here, you need to understand what Pandas is doing under the hood. I'm going to simplify a bit, since there are lots of bells and whistles and special cases to consider, but roughly it looks like this:
Suppose you've got a Pandas DataFrame object df with various numeric columns (we'll ignore datetime columns, categorical columns, and the like). When you compute df.sum(), Pandas:
Extracts the values of the dataframe into a two-dimensional NumPy array.
Applies the NumPy sum function to that 2d array with axis=0 to compute the column sums.
It's the first step that's important here. The columns of a DataFrame might have different dtypes, but a 2d NumPy array can only have a single dtype. If df has a mixture of float32 and int32 columns (for example), Pandas has to choose a single dtype that's appropriate for both columns simultaneously, and in this case it chooses float64. So when the sum is computed, it's computed on double-precision values, using double-precision arithmetic. This is what's happening in your second example.
On the other hand, if you cut down to just the float32 columns in the first place, then Pandas can and will use the float32 dtype for the 2d NumPy array, and so the sum computation is performed in single precision. This is what's happening in your first example.
Here's a simple example showing this in action: we'll set up a DataFrame with 100 million rows and three columns, of dtypes float32, float32 and int32 respectively. All the values are ones:
>>> import numpy as np, pandas as pd
>>> s = np.ones(10**8, dtype=np.float32)
>>> t = np.ones(10**8, dtype=np.int32)
>>> df = pd.DataFrame(dict(A=s, B=s, C=t))
>>> df.head()
A B C
0 1.0 1.0 1
1 1.0 1.0 1
2 1.0 1.0 1
3 1.0 1.0 1
4 1.0 1.0 1
>>> df.dtypes
A float32
B float32
C int32
dtype: object
Now when we compute the sums directly, Pandas first turns everything into float64s. The computation is also done using the float64 type, for all three columns, and we get an accurate answer.
>>> df.sum()
A 100000000.0
B 100000000.0
C 100000000.0
dtype: float64
But if we first cut down our dataframe to just the float32 columns, then float32-arithmetic is used for the sum, and we get very poor answers.
>>> df[['A', 'B']].sum()
A 16777216.0
B 16777216.0
dtype: float32
The inaccuracy is of course due to using a dtype that doesn't have enough precision for the task in question: at some point in the summation, we end up repeatedly adding 1.0 to 16777216.0, and getting 16777216.0 back each time, thanks to the usual floating-point problems. The solution is to explicitly convert to float64 yourself before doing the computation.
However, this isn't quite the end of the surprises that Pandas has in store for us. With the same dataframe as above, let's try just computing the sum for column "A":
>>> df[['A']].sum()
A 100000000.0
dtype: float32
Suddenly we're getting full accuracy again! So what's going on? This has little to do with dtypes: we're still using float32 to do the summation. It's now the second step (the NumPy summation) that's responsible for the difference. What's happening is that NumPy can, and sometimes does, use a more accurate summation algorithm, called pairwise summation, and with float32 dtype and the size arrays that we're using, that accuracy can make a hugely significant difference to the final result. However, it only uses that algorithm when summing along the fastest-varying axis of an array; see this NumPy issue for related discussion. In the case where we compute the sum of both column "A" and column "B", we end up with a values array of shape (100000000, 2). The fastest-varying axis is axis 1, and we're computing the sum along axis 0, so the naive summation algorithm is used and we get poor results. But if we only ask for the sum of column "A", we get the accurate sum result, computed using pairwise summation.
In sum, when working with DataFrames of this size, you want to be careful to (a) work with double precision rather than single precision whenever possible, and (b) be prepared for differences in output results due to NumPy making different algorithm choices.
You can lose precision with np.float32 relative to np.float64
np.finfo(np.float32)
finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)
And
np.finfo(np.float64)
finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)
A contrived example
df = pd.DataFrame(dict(
x=[-60499999.315, 60500002.685] * int(2e7),
y=[-60499999.315, 60500002.685] * int(2e7),
z=[-60499999.315, 60500002.685] * int(2e7),
)).astype(dict(x=np.float64, y=np.float32, z=np.float32))
print(df.sum()[['y', 'z']], df[['y', 'z']].sum(), sep='\n\n')
y 80000000.0
z 80000000.0
dtype: float64
y 67108864.0
z 67108864.0
dtype: float32

Convert genfromtxt array to regular numpy array

I can't post the data being imported, because it's too much. But, it has both number and string fields and is 5543 rows and 137 columns. I import data with this code (ndnames and ndtypes holds the column names and column datatypes):
npArray2 = np.genfromtxt(fileName,
delimiter="|",
skip_header=1,
dtype=(ndtypes),
names=ndnames,
usecols=np.arange(0,137)
)
This works and the resulting variable type is "void7520" with size (5543,). But this is really a 1D array of 5543 rows, where each element holds a sub-array that has 137 elements. I want to convert this into a normal numpy array of 5543 rows and 137 columns. How can this be done?
I have tried the following (using Pandas):
pdArray = pd.read_csv(fileName,
sep=ndelimiter,
index_col=False,
skiprows=1,
names=ndnames
)
npArray = pd.DataFrame.as_matrix(pdArray)
But, the resulting npArray is type Object with size (5543,137) which, at first, looks promising. But, because it's type Object, there are other functions that can't be performed on it. Can this Object array be converted into a normal numpy array?
Edit:
ndtypes look like...
[int,int,...,int,'|U50',int,...,int,'|U50',int,...,int]
That is, 135 number fields with two string-type fields in the middle somewhere.
npArray2 is a 1d structured array, 5543 elements and 137 fields.
What does npArray2.dtype look like, or equivalently what is ndtypes, because the dtype is built from the types and names that you provided. "void7520" is a way of identifying a record of this array, but tells us little except the size (in bytes?).
If all fields of the dtype are numeric, even better yet if they are all the same numeric dtype (int, float), then it is fairly easy to convert it to a 2d array with 137 columns (2nd dim). astype and view can be used.
(edit - it has both number and string fields - you can't convert it to a 2d array of numbers; it could be an array of strings, but you can't do numeric math on strings.)
But if the dtypes are mixed then you can't convert it. All elements of the 2d array have be the same dtype. You have to use the structured array approach if you want mixed types. (well there is the dtype=object, but let's not go there).
Actually pandas is going the object route. Evidently it thinks the only way to make an array from this data is to let each element be its own type. And the math of object arrays is severely limited. They are, in effect a glorified, or debased, list.

Converting numpy string array to float: Bizarre?

So, this should be a really straightforward thing but for whatever reason, nothing I'm doing to convert an array of strings to an array of floats is working.
I have a two column array, like so:
Name Value
Bob 4.56
Sam 5.22
Amy 1.22
I try this:
for row in myarray[1:,]:
row[1]=float(row[1])
And this:
for row in myarray[1:,]:
row[1]=row[1].astype(1)
And this:
myarray[1:,1] = map(float, myarray[1:,1])
And they all seem to do something, but when I double check:
type(myarray[9,1])
I get
<type> 'numpy.string_'>
Numpy arrays must have one dtype unless it is structured. Since you have some strings in the array, they must all be strings.
If you wish to have a complex dtype, you may do so:
import numpy as np
a = np.array([('Bob','4.56'), ('Sam','5.22'),('Amy', '1.22')], dtype = [('name','S3'),('val',float)])
Note that a is now a 1d structured array, where each element is a tuple of type dtype.
You can access the values using their field name:
In [21]: a = np.array([('Bob','4.56'), ('Sam','5.22'),('Amy', '1.22')],
...: dtype = [('name','S3'),('val',float)])
In [22]: a
Out[22]:
array([('Bob', 4.56), ('Sam', 5.22), ('Amy', 1.22)],
dtype=[('name', 'S3'), ('val', '<f8')])
In [23]: a['val']
Out[23]: array([ 4.56, 5.22, 1.22])
In [24]: a['name']
Out[24]:
array(['Bob', 'Sam', 'Amy'],
dtype='|S3')
The type of the objects in a numpy array is determined at the initialsation of that array. If you want to change that later, you must cast the array, not the objects within that array.
myNewArray = myArray.asType(float)
Note: Upcasting is possible, for downcasting you need the astype method.
For further information see:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.chararray.astype.html

NumPy or Pandas: Keeping array type as integer while having a NaN value

Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.

Categories

Resources