Pandas DataFrame to Numpy Array ValueError - python

I am trying to convert a single column of a dataframe to a numpy array. Converting the entire dataframe has no issues.
df
viz a1_count a1_mean a1_std
0 0 3 2 0.816497
1 1 0 NaN NaN
2 0 2 51 50.000000
Both of these functions work fine:
X = df.as_matrix()
X = df.as_matrix(columns=df.columns[1:])
However, when I try:
y = df.as_matrix(columns=df.columns[0])
I get:
TypeError: Index(...) must be called with a collection of some kind, 'viz' was passed

The problem here is that you're passing just a single element which in this case is just the string title of that column, if you convert this to a list with a single element then it works:
In [97]:
y = df.as_matrix(columns=[df.columns[0]])
y
Out[97]:
array([[0],
[1],
[0]], dtype=int64)
Here is what you're passing:
In [101]:
df.columns[0]
Out[101]:
'viz'
So it's equivalent to this:
y = df.as_matrix(columns='viz')
which results in the same error
The docs show the expected params:
DataFrame.as_matrix(columns=None) Convert the frame to its Numpy-array
representation.
Parameters: columns: list, optional, default:None If None, return all
columns, otherwise, returns specified columns

as_matrix expects a list for the columns keyword and df.columns[0] isn't a list. Try
df.as_matrix(columns=[df.columns[0]]) instead.

Using the index tolist function works as well
df.as_matrix(columns=df.columns[0].tolist())
When giving multiple columns, for example, the ten first, then the command
df.as_matrix(columns=[df.columns[0:10]])
does not work as it returns an index. However, using
df.as_matrix(columns=df.columns[0:10].tolist())
works well.

Related

How to pass a series to call a user defined function?

I am trying to pass a series to a user defined function and getting this error:
Function:
def scale(series):
sc=StandardScaler()
sc.fit_transform(series)
print(series)
Code for calling:
df['Value'].apply(scale) # df['Value'] is a Series having float dtype.
Error:
ValueError: Expected 2D array, got scalar array instead:
array=28.69.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Can anyone help address this issue?
The method apply will apply a function to each element in the Series (or in case of a DataFrame either each row or each column depending on the chosen axis). Here you expect your function to process the entire Series and to output a new Series in its stead.
You can therefore simply run:
StandardScaler().fit_transform(df['Value'].values.reshape(-1, 1))
StandardScaler excepts a 2D array as input where each row is a sample input that consists of one or more features. Even it is just a single feature (as seems to be the case in your example) it has to have the right dimensions. Therefore, before handing over your Series to sklearn I am accessing the values (the numpy representation) and reshaping it accordingly.
For more details on reshape(-1, ...) check this out: What does -1 mean in numpy reshape?
Now, the best bit. If your entire DataFrame consists of a single column you could simply do:
StandardScaler().fit_transform(df)
And even if it doesn't, you could still avoid the reshape:
StandardScaler().fit_transform(df[['Value']])
Note how in this case 'Value' is surrounded by 2 sets of braces so this time it is not a Series but rather a DataFrame with a subset of the original columns (in case you do not want to scale all of them). Since a DataFrame is already 2-dimensional, you don't need to worry about reshaping.
Finally, if you want to scale just some of the columns and update your original DataFrame all you have to do is:
>>> df = pd.DataFrame({'A': [1,2,3], 'B': [0,5,6], 'C': [7, 8, 9]})
>>> columns_to_scale = ['A', 'B']
>>> df[columns_to_scale] = StandardScaler().fit_transform(df[columns_to_scale])
>>> df
A B C
0 -1.224745 -1.397001 7
1 0.000000 0.508001 8
2 1.224745 0.889001 9

ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

I'm using Pandas 0.20.3 in my python 3.X. I want to add one column in a pandas data frame from another pandas data frame. Both the data frame contains 51 rows. So I used following code:
class_df['phone']=group['phone'].values
I got following error message:
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
class_df.dtypes gives me:
Group_ID object
YEAR object
Terget object
phone object
age object
and type(group['phone']) returns pandas.core.series.Series
Can you suggest me what changes I need to do to remove this error?
The first 5 rows of group['phone'] are given below:
0 [735015372, 72151508105, 7217511580, 721150431...
1 []
2 [735152771, 7351515043, 7115380870, 7115427...
3 [7111332015, 73140214, 737443075, 7110815115...
4 [718218718, 718221342, 73551401, 71811507...
Name: phoen, dtype: object
In most cases, this error comes when you return an empty dataframe. The best approach that worked for me was to check if the dataframe is empty first before using apply()
if len(df) != 0:
df['indicator'] = df.apply(assign_indicator, axis=1)
You have a column of ragged lists. Your only option is to assign a list of lists, and not an array of lists (which is what .value gives).
class_df['phone'] = group['phone'].tolist()
The error of the Question-Headline
"ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series"
might as well occur if for what ever reason the table does not have any rows.
Instead of using an if-statement, you can use set result_type argument of apply() function to "reduce".
df['new_column'] = df.apply(func, axis=1, result_type='reduce')
The data assigned to a column in the DataFrame must be a single dimension array. For example, consider a num_arr to be added to a DataFrame
num_arr.shape
(1, 126)
For this num_arr to be added to a DataFrame column, It should be reshaped....
num_arr = num_arr.reshape(-1, )
num_arr.shape
(126,)
Now I could set this arr as a DataFrame column
df = pd.DataFrame()
df['numbers'] = num_arr

Pandas: how to identify columns with dtype object but mixed-type items?

In a pandas dataframe, a column with dtype = object can, in fact, contain items of mixed types, eg integers and strings.
In this example, column a is dtype object, but the first item is string while all the others are int:
import numpy as np, pandas as pd
df=pd.DataFrame()
df['a']=np.arange(0,9)
df.iloc[0,0]='test'
print(df.dtypes)
print(type(df.iloc[0,0]))
print(type(df.iloc[1,0]))
My question is: is there a quick way to identify which columns with dtype=object contain, in fact, mixed types like above? Since pandas does not have a dtype = str, this is not immediately apparent.
However, I have had situations where, importing a large csv file into pandas, I would get a warning like:
sys:1: DtypeWarning: Columns (15,16) have mixed types. Specify dtype option on import or set low_memory=False
Is there an easy way to replicate that and explicitly list the columns with mixed types? Or do I manually have to go through them one by one, see if I can convert them to string, etc?
The background is that I am trying to export a dataframe to a Microsoft SQL Server using DataFrame.to_sql and SQLAlchemy. I get an
OverflowError: int too big to convert
but my dataframe does not contain columns with dtype int - only object and float64. I'm guessing this is because one of the object columns must have both strings and integers.
Setup
df = pd.DataFrame(np.ones((3, 3)), columns=list('WXY')).assign(Z='c')
df.iloc[0, 0] = 'a'
df.iloc[1, 2] = 'b'
df
W X Y Z
0 a 1.0 1 c
1 1 1.0 b c
2 1 1.0 1 c
Solution
Find all types and count how many unique ones per column.
df.loc[:, df.applymap(type).nunique().gt(1)]
W Y
0 a 1
1 1 b
2 1 1

Select row from a DataFrame based on the type of the object(i.e. str)

So there's a DataFrame say:
>>> df = pd.DataFrame({
... 'A':[1,2,'Three',4],
... 'B':[1,'Two',3,4]})
>>> df
A B
0 1 1
1 2 Two
2 Three 3
3 4 4
I want to select the rows whose datatype of particular row of a particular column is of type str.
For example I want to select the row where type of data in the column A is a str.
so it should print something like:
A B
2 Three 3
Whose intuitive code would be like:
df[type(df.A) == str]
Which obviously doesn't works!
Thanks please help!
This works:
df[df['A'].apply(lambda x: isinstance(x, str))]
You can do something similar to what you're asking with
In [14]: df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[14]:
A B
2 Three 3
Why only similar? Because Pandas stores things in homogeneous columns (all entries in a column are of the same type). Even though you constructed the DataFrame from heterogeneous types, they are all made into columns each of the lowest common denominator:
In [16]: df.A.dtype
Out[16]: dtype('O')
Consequently, you can't ask which rows are of what type - they will all be of the same type. What you can do is to try to convert the entries to numbers, and check where the conversion failed (this is what the code above does).
It's generally a bad idea to use a series to hold mixed numeric and non-numeric types. This will cause your series to have dtype object, which is nothing more than a sequence of pointers. Much like list and, indeed, many operations on such series can be more efficiently processed with list.
With this disclaimer, you can use Boolean indexing via a list comprehension:
res = df[[isinstance(value, str) for value in df['A']]]
print(res)
A B
2 Three 3
The equivalent is possible with pd.Series.apply, but this is no more than a thinly veiled loop and may be slower than the list comprehension:
res = df[df['A'].apply(lambda x: isinstance(x, str))]
If you are certain all non-numeric values must be strings, then you can convert to numeric and look for nulls, i.e. values that cannot be converted:
res = df[pd.to_numeric(df['A'], errors='coerce').isnull()]

Convert all elements in float Series to integer

I have a column, having float values,in a dataframe (so I am calling this column as Float series). I want to convert all the values to integer or just round it up so that there are no decimals.
Let us say the dataframe is df and the column is a, I tried this :
df['a'] = round(df['a'])
I got an error saying this method can't be applied to a Series, only applicable to individual values.
Next I tried this :
for obj in df['a']:
obj =int(round(obj))
After this I printed df but there was no change.
Where am I going wrong?
round won't work as it's being called on a pandas Series which is array-like rather than a scalar value, there is the built in method pd.Series.round to operate on the whole Series array after which you can change the dtype using astype:
In [43]:
df = pd.DataFrame({'a':np.random.randn(5)})
df['a'] = df['a'] * 100
df
Out[43]:
a
0 -4.489462
1 -133.556951
2 -136.397189
3 -106.993288
4 -89.820355
In [45]:
df['a'] = df['a'].round(0).astype(int)
df
Out[45]:
a
0 -4
1 -134
2 -136
3 -107
4 -90
Also it's unnecessary to iterate over the rows when there are vectorised methods available
Also this:
for obj in df['a']:
obj =int(round(obj))
Does not mutate the individual cell in the Series, it's operating on a copy of the value which is why the df is not mutated.
The code in your loop:
obj = int(round(obj))
Only changes which object the name obj refers to. It does not modify the data stored in the series. If you want to do this you need to know where in the series the data is stored and update it there.
E.g.
for i, num in enumerate(df['a']):
df['a'].iloc[i] = int(round(obj))
When converting a float to an integer, I found out using df.dtypes that the column I was trying to round off was an object not a float. The round command won't work on objects so to do the conversion I did:
df['a'] = pd.to_numeric(df['a'])
df['a'] = df['a'].round(0).astype(int)
or as one line:
df['a'] = pd.to_numeric(df['a']).round(0).astype(int)
If you specifically want to round up as your question states, you can use np.ceil:
import numpy as np
df['a'] = np.ceil(df['a'])
See also Floor or ceiling of a pandas series in python?
Not sure there's much advantage to type converting to int; pandas and numpy love floats.

Categories

Resources