Convert dataframe to a rec array (and objects to strings) - python

I have a pandas dataframe with a mix of datatypes (dtypes) that I wish to convert to a numpy structured array (or record array, basically the same thing in this case). For purely numeric dataframes, this is easy to do with the to_records() method. I also need the dtypes of pandas columns to be converted to strings rather than objects so that I can use the numpy method tofile() which will output numbers and strings to a binary file, but will not output objects.
In a nutshell, I need to convert pandas columns with dtype=object to numpy structured arrays of string or unicode dtype.
Here's an example, with code that would be sufficient if all columns had a numerical (float or int) dtype.
import pandas as pd
df=pd.DataFrame({'f_num': [1.,2.,3.], 'i_num':[1,2,3],
'char': ['a','bb','ccc'], 'mixed':['a','bb',1]})
struct_arr=df.to_records(index=False)
print('struct_arr',struct_arr.dtype,'\n')
# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'),
# ('char', 'O'), ('mixed', 'O')])
But because I want to end up with string dtypes, I need to add this additional and somewhat involved code:
lst=[]
for col in struct_arr.dtype.names: # this was the only iterator I
# could find for the column labels
dt=struct_arr[col].dtype
if dt == 'O': # this is 'O', meaning 'object'
# it appears an explicit string length is required
# so I calculate with pandas len & max methods
dt = 'U' + str( df[col].astype(str).str.len().max() )
lst.append((col,dt))
struct_arr = struct_arr.astype(lst)
print('struct_arr',struct_arr.dtype)
# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'),
# ('char', '<U3'), ('mixed', '<U2')])
See also: How to change the dtype of certain columns of a numpy recarray?
This seems to work, as the character and mixed dtypes are now <U3 and <U2 rather than 'O' or 'object'. I'm just checking if there is a simpler or more elegant approach. But since pandas does not have a native string type as numpy does, maybe there is not?

Combining suggestions from #jpp (list comp for conciseness) & #hpaulj (cannibalize to_records for speed), I came up with the following, which is cleaner code and also about 5x faster than my original code (tested by expanding the sample dataframe above to 10,000 rows):
names = df.columns
arrays = [ df[col].get_values() for col in names ]
formats = [ array.dtype if array.dtype != 'O'
else f'{array.astype(str).dtype}' for array in arrays ]
rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )
The above will output unicode rather than strings which is probably better in general but in my case I need to convert to strings because I'm reading the binary file in fortran and strings seem to read in more easily. Hence, it may be better to replace the "formats" line above with this:
formats = [ array.dtype if array.dtype != 'O'
else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]
E.g. a dtype of <U4 becomes S4.

As far as I am aware, there is no native functionality for this. For example, the maximum length of all values within a series is not stored anywhere.
However, you can implement your logic more efficiently via a list comprehension and f-strings:
data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \
f'U{df[col].astype(str).str.len().max()}') for col in arr.dtype.names]

Related

How to preserve dtype int when reading integers with NaN in pandas

The question might sound silly but I am interested to know the following:
my_df = pd.read_sql_query(sql_script.read(), engine).astype(object)
functions with the pandas version 1.0.5 and does not allow for the NaNs in the integer column to be turned to floats whereas on pandas version 1.3.5 .astype(object) does absolutely nothing.
I am curious to know why is this and of course what is the best approach for keeping the data obtained from sql as is without converting it to floats (where columns have NaNs because NaNs are floats).
Thank you in advance!
Use dtype 'Int64' for NaN support
'Int64' (capital I) is a pandas nullable integer, so it can mix with NaNs.
Default numpy integers cannot mix with NaNs, so the column will become dtype object.
For example, say column Col D contains only integers and NaNs:
Either use the dtype param at load time (available in most read_* methods):
df = pd.read_sql_query(sql_script.read(), engine, dtype={'Col D': 'Int64'})
# ^^^^^
Or use astype after loading:
df = pd.read_sql_query(sql_script.read(), engine).astype({'Col D': 'Int64'})
# ^^^^^^

How to change dtype of each dataframe column to float if it best fits?

I have an excel that has columns with values like 1.32131**, among columns of dtype string. As a result the dtypes of these columns in the dataframe are object. I have cleaned the asterisks from the dataframe and now I need to convert the dtypes of these columns to float64. I am aware of ways to do it if I define the columns that need to be changed or the desired dtypes for each column (like the functions mentioned here), but I have too many columns to use such solutions. Thus I am looking for a more efficient and clean way.
For example, if I wanted to convert to int64 I would use convert_dtypes(), but it seems that it doesn't support floats and it returns these columns with object dtype.
Then, if possible, convert to StringDtype, BooleanDtype or an
appropriate integer extension type, otherwise leave as object.
Right now I am using the following script that works but I think it's to big for its purpose and it a bit slow.
# Create df and clean it
# note that the data exist in an excel normally and the dict is only for reproducibility purposes
dict = {'Name':['BPh1', 'BPh2', 'BPh3', 'BPh4', 'BPh5', 'BPh6', 'BPh7'], 'BBB':['2.00755**', '2.7766**', '0.490127**','0.490127**', '0.87667**', '0.899189**', '3.084**'], 'Buffer_solubility_mg_L':['0.00112934**','0.000798559**', '0.000218191**', '0.000122249**', '0.00382848**', '0.00109165**', '0.000665366**'], 'CYP_2C19_inhibition':['Inhibitor','Inhibitor','Non','Non','Inhibitor','Inhibitor',
'Inhibitor']}
ss = pd.DataFrame(dict).replace("\*",'',regex=True)
# Convert dtype to float when possible
for col in ss.columns[1:]:
print(col,'\n',ss[col].dtypes)
try:
ss[col] = pd.to_numeric(ss[col])
except:
pass
print(ss[col].dtypes,'\n')
Is there a cleaner way to do this conversion?
I'd change/clean the values before creating the dataframe, that way you're not first creating one, and then converting it to something else (might save a little bit of time as well). The advantage is that you can do it in a single line. I don't think you can get much faster than this, given the input data that you have to work with.
import pandas
# Create df and clean it
dict = {'Name':['BPh1', 'BPh2', 'BPh3', 'BPh4', 'BPh5', 'BPh6', 'BPh7'], 'BBB':['2.00755**', '2.7766**', '0.490127**','0.490127**', '0.87667**', '0.899189**', '3.084**'], 'Buffer_solubility_mg_L':['0.00112934**','0.000798559**', '0.000218191**', '0.000122249**', '0.00382848**', '0.00109165**', '0.000665366**'], 'CYP_2C19_inhibition':['Inhibitor','Inhibitor','Non','Non','Inhibitor','Inhibitor',
'Inhibitor']}
# Perform the conversion on creation
df = pandas.DataFrame(
{
col: pandas.to_numeric([v.replace("*", "") for v in values], errors="ignore")
for col, values in dict.items()
}
)

Converting numpy ndarray with dtype <U30 into float

I'm reading a list from pandas dataframe cell.
>>from pandas import DataFrame as table
>>x = table.loc[table['person'] == int(123), table.columns != 'xyz']['segment'][0]
>>print("X = ",x)
where 'person' and 'segment' are my column names and segment contains a list with floating values.
>>X = [[39.414, 39.498000000000005]]
Now, when I try to convert this into a numpy array,
>>x = numpy.asarray(x)
>>x=x.astype(float)
I get the following error
ValueError: could not convert string to float: '[[39.414, 39.498000000000005]]'
I have tried parsing the string and tried to remove any "\n" or " " or any unnecessary quotes, but it does not work. Then I tried to find the dtype
>>print("Dtype = ", x.dtype)
>>Dtype = <U30
I assume that we need to convert the U30 dtype into floats, but I am not sure how to do it. I am using numpy version 1.15.0.
All I want to do is, to parse the above list into a list with floating point values.
The datatype should have tipped you off. U30 here stands for a length 30 unicode string (Which is what you'll see if you type len(x).
What you have is the string representation of a list, not a list of strings/floats/etc..
You need to use the ast library here:
x = '[[39.414, 39.498000000000005]]'
x = ast.literal_eval(x)
np.array(x, dtype=float)
array([[39.414, 39.498]])
For the specific format you see, consider np.fromstring. With string slicing you can also remove the unused dimension:
x = '[[39.414, 39.498000000000005]]'
res = np.fromstring(x[2:-2], sep=',')
# array([ 39.414, 39.498])

Python structured numpy array multiple sort

Hello all I have a list of delimiter separated strings:
lists=['1|Abra|23|43|0','2|Cadabra|15|18|0','3|Grabra|4|421|0','4|Lol|1|15|0']
I need to convert it to numpy array than sort it just like excel do first by Column 3, then by Column 2, and finaly by the last column
Ive tried this:
def man():
a = np.array(lists[0].split('|'))
for line in lists:
temp = np.array(line.split('|'),)
a=np.concatenate((a, temp))
a.sort(order=[0, 1])
man()
Of course no luck because it is wrong! Unfortunately im not strong in numpy arrays. Can somebody help me pls? :(
This works just perfect for me but here numpy builds array from file so to make it work i've write my list of strings to file than read it and convert to array
import numpy as np
# let numpy guess the type with dtype=None
my_data = np.genfromtxt('Selector/tmp.txt',delimiter='|', dtype=None, names ["Num", "Date", "Desc", "Rgh" ,"Prc", "Color", "Smb", "MType"])
my_data.sort(order=["Color","Prc", "Rgh"])
# save specifying required format (tab separated values)
print(my_data)
How to remain everything as is but change the conversion function to make it build the same array not from file but from list
There may be better solutions, but for a start I would sort the array once by each column in reverse order.
I assume you want to sort by column 3 and ties are resolved by column 2. Finally, remaining ties are resolved by the last column. Thus, you'd actually sort by the last column first, then by 2, then by 3.
Furthermore, you can easily convert the list to an array using a list comprehension.
import numpy as np
lists=['1|Abra|23|43|0','2|Cadabra|15|18|0','3|Grabra|4|421|0','4|Lol|1|15|0']
# convert to numpy array by splitting each row
a = np.array([l.split('|') for l in lists])
# specify columns to sort by, in order
sort_cols = [3, 2, -1]
# sort by columns in reverse order.
# This only works correctly if the sorting algorithm is stable.
for sc in sort_cols[::-1]:
order = np.argsort(a[:, sc])
a = a[order]
print(a)
You can use a list comprehension in order to split your strings and convert the integers to int. Then use a proper dtype to create your numpy array then use np.sort() function by passing the expected order:
>>> dtype = [('1st', int), ('2nd', '|S7'), ('3rd', int), ('4th', int), ('5th', int)]
>>>
>>> a = np.array([tuple([int(i) if i.isdigit() else i for i in sub.split('|')]) for sub in delimit_strs], dtype=dtype)
>>> np.sort(a, axis=0, order=['3rd','2nd', '5th'])
array([(4, 'Lol', 1, 15, 0), (3, 'Grabra', 4, 421, 0),
(2, 'Cadabra', 15, 18, 0), (1, 'Abra', 23, 43, 0)],
dtype=[('1st', '<i8'), ('2nd', 'S7'), ('3rd', '<i8'), ('4th', '<i8'), ('5th', '<i8')])
You can also do this in python which for shorter data sets in more optimized. You can simple use sorted() function by passing a proper key function.
from operator import itemgetter
sorted([[int(i) if i.isdigit() else i for i in sub.split('|')]) for sub in delimit_strs], key=itemgetter(3, 2, 4))

Why does function behavior used within pandas apply change?

I cannot figure out why a simple function:
def to_integer(value):
if value == "":
return None
return int(value)
changes values from str to int only if there's no empty string "" in the dataframe, i.e. only if no value is to be returned as None.
If I go:
type(to_integer('1')) == int
returns True.
Now, using apply and to_integer with df1:
df1 = pd.DataFrame(['1', '2', '3'], columns=['integer'])
result = df1['integer'].apply(to_integer)
gives column of integers (np.int64).
But if I apply it to this df2:
df2 = pd.DataFrame(['1', '', '3'], columns=['integer'])
result = df2['integer'].apply(to_integer)
it returns a column of floats (np.float64).
Isn't it possible to have a dataframe with integers and None at the same time?
I use Python 3.3 and Pandas 0.12.
You are exactly right, it is not possible to have a series of ints and np.nan values.
The way that numpy implements missing values is as np.float64
http://pandas.pydata.org/pandas-docs/dev/missing_data.html.
The relevant part of the documentation is as follows:
"While pandas supports storing arrays of integer and boolean type, these types are not capable of storing missing data. Until we can switch to using a native NA type in NumPy, we’ve established some “casting rules” when reindexing will cause missing data to be introduced into, say, a Series or DataFrame. Here they are:
`data type Cast to`
`integer float`
`boolean object`
`float no cast`
`object no cast`

Categories

Resources