How to pass a series to call a user defined function? - python

I am trying to pass a series to a user defined function and getting this error:
Function:
def scale(series):
sc=StandardScaler()
sc.fit_transform(series)
print(series)
Code for calling:
df['Value'].apply(scale) # df['Value'] is a Series having float dtype.
Error:
ValueError: Expected 2D array, got scalar array instead:
array=28.69.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Can anyone help address this issue?

The method apply will apply a function to each element in the Series (or in case of a DataFrame either each row or each column depending on the chosen axis). Here you expect your function to process the entire Series and to output a new Series in its stead.
You can therefore simply run:
StandardScaler().fit_transform(df['Value'].values.reshape(-1, 1))
StandardScaler excepts a 2D array as input where each row is a sample input that consists of one or more features. Even it is just a single feature (as seems to be the case in your example) it has to have the right dimensions. Therefore, before handing over your Series to sklearn I am accessing the values (the numpy representation) and reshaping it accordingly.
For more details on reshape(-1, ...) check this out: What does -1 mean in numpy reshape?
Now, the best bit. If your entire DataFrame consists of a single column you could simply do:
StandardScaler().fit_transform(df)
And even if it doesn't, you could still avoid the reshape:
StandardScaler().fit_transform(df[['Value']])
Note how in this case 'Value' is surrounded by 2 sets of braces so this time it is not a Series but rather a DataFrame with a subset of the original columns (in case you do not want to scale all of them). Since a DataFrame is already 2-dimensional, you don't need to worry about reshaping.
Finally, if you want to scale just some of the columns and update your original DataFrame all you have to do is:
>>> df = pd.DataFrame({'A': [1,2,3], 'B': [0,5,6], 'C': [7, 8, 9]})
>>> columns_to_scale = ['A', 'B']
>>> df[columns_to_scale] = StandardScaler().fit_transform(df[columns_to_scale])
>>> df
A B C
0 -1.224745 -1.397001 7
1 0.000000 0.508001 8
2 1.224745 0.889001 9

Related

Remove row based on sum of numpy array within each entry in df column

I feel I'm making this harder than it should be: what I have is a dataframe with some columns whose entries each contain numpy arrays (the names of the columns containing these arrays is in an array called names_of_cols_that_contain_arrays). What I want to do is filter out rows for which these numpy arrays have a sum value of zero. This is a similar question on which my code is based, but it doesn't seem to work with the iterator over rows in each column.
What I have currently in my code is
for col_name in names_of_cols_that_contain_arrays:
for i in range(len(df[col_name])):
df = df[df[col_name][i].sum() > 0.0]
which doesn't seem that efficient but is a first attempt that explictly goes through what I thought would be the correct method. But this appears to return a boolean, i.e.
Traceback
...
KeyError: True
In fact in most cases to the code above I get some error associated with a boolean being returned. Any pointers would be appreciated, thanks in advance!
IIUC:
You can try:
df=df.loc[df['names_of_cols_that_contain_arrays'].map(sum)>0]
#OR
df=df.loc[df['names_of_cols_that_contain_arrays'].map(np.sum).gt(0)]
Sample dataframe used:
from numpy import array
d={'names_of_cols_that_contain_arrays': {0: array([-1, 0, -8]),
1: array([-1, -2, 5])}}
df=pd.DataFrame(d)

Pandas concat: why is the `DataFrame` with duplicated index not working with concat()?

Code example:
a = pd.DataFrame({"a": [1,2,3],}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5],}, index=[1,4,5])
pd.concat([a, b], axis=1)
It raises error: ValueError: Shape of passed values is (7, 2), indices imply (5, 2)
What I expected as a result:
Why does it not return like this? concat's default joining is outer so I think my thought is reasonable enough... Am I missing something?
TLDR: Why? I don't really know for sure, but I think it has to do with just the design of the package.
An index in pandas "is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names." source
Now you are doing it where axis = 1, aka along the vertical axis. That means that we have an address which points to two different values. Hence we can still "access" these values by doing a[a.index == 2]. Do note however the index in a mathematical sense is now not a proper function because one value maps to two different values source. I am guessing the implementation was designed so that indices would be injective, surjective, or bijective in order to make it easier to design.
Thus, when attempting to concatenate, pandas wants to match all the indices together where possible and fill in nans where not possible. However, as the error says, it thinks the shape based off the indices is (5, 2) because of this address sharing two different values. So why doesn't it work? Because I believe pandas checks the shape it should be before hand, and then does the concatenation. In order to check the shape before hand it looks at the indices and therefore it breaks when it checks.
Do note too that this would not work with identical column names as well:
a = pd.DataFrame({"a": [1,2,3], 'b': [9,8,7]}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5], 'bx': [1,4,3]}, index=[1,4,5]).rename(columns={'bx': 'b'})
pd.concat([a,b]) # axis=0 is the default
ValueError: Plan shapes are not aligned
Therefore pd.concat needs unique indices along whichever axis it is operating upon. You can't have two identical column names when you normally concatenate row wise, and likewise you can't be able to do it column wise.
Interestingly, for your original example, pd.concat([a, b], ignore_index=True, axis=1) also raises the same error, leading me to more strongly suspect that pandas is checking the shape before the concatenation.

Multiplying columns by another column in a dataframe

(Full disclosure that this is related to another question I asked, so bear with me if I should have appended it to what I wrote previously, even though the problem is different.)
I have a dataframe consisting of a column of weights and columns containing binary values of 0 and 1. I'd like to multiply every column within the dataframe by the weights column. However, I seem to be replacing every column within the dataframe with the weight column. I'm sure I'm missing something incredibly stupid/basic here--I'm rather new to pandas and python as a whole. What am I doing wrong?
celebfile = pd.read_csv(celebcsv)
celebframe = pd.DataFrame(celebfile)
behaviorfile = pd.read_csv(behaviorcsv)
behaviorframe = pd.DataFrame(behaviorfile)
celebbehavior = pd.merge(celebframe, behaviorframe, how ='inner', on = 'RespID')
celebbehavior2 = celebbehavior.copy()
def multiplycolumns(column):
for column in celebbehavior:
return celebbehavior[column]*celebbehavior['WEIGHT']
celebbehavior2 = celebbehavior2.apply(lambda column: multiplycolumns(column), axis=0)
print(celebbehavior2.head())
You have return statement in a for loop, which means the for loop is executed only once, to multiply a data frame with a column, you can use mul method with the correct axis parameter:
celebbehavior.mul(celebbehavior['WEIGHT'], axis=0)
read_csv
returns a pd.DataFrame... Not necessary to use pd.DataFrame on top of it.
mul with axis=0
You can use apply but that is awkward. Use mul(axis=0)... This should be all you need.
df = pd.read_csv(celebcsv).merge(pd.read_csv(behaviorcsv), on='RespID')
df = df.mul(df.WEIGHT, 0)
?
You said that it looks like you are just replacing with the weights column? Are you other columns all ones?
you can use the `mul' method to multiply the columns. However, just fyi if you do want to use apply you can bear in mind the following:
The apply function passes each series in the dataframe to the function. This looping is inherent to the apply function. Therefore first thing to say is that your loop within the function is redundant. Also you have a return statement inside it which is causing the behavior you do not want.
If each column is passed as the argument automatically all you need to do is tell the function what to multiply it by. In this case your weights series.
Here is an implementation using apply. Of course the undesirable here is that the weights are also multiplpied by themselves:
df = pd.DataFrame({'1' : [1, 1, 0, 1],
'2' : [0, 0, 1, 0],
'weights' : [0.5, 0.25, 0.1, 0.05]})
def multiply_columns(column, weights):
return column * weights
df.apply(lambda x: multiply_columns(x, df['weights']))

Why is 'series' type not a special case of 'dataframe' type in pandas? [duplicate]

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.
Why is that? Is there a way to ensure I always get back a data frame?
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame
In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series
Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.
In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame
In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame
The TLDR
When using loc
df.loc[:] = Dataframe
df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe
df.loc[:, ["col_name"]] = Dataframe if you have more than one row and Series if you have only 1 row in the selection
df.loc[:, "col_name"] = Series
Not using loc
df["col_name"] = Series
df[["col_name"]] = Dataframe
You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.
The reason is that you don't specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.
Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.
If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).
Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.
You wrote in a comment to joris' answer:
"I don't understand the design
decision for single rows to get converted into a series - why not a
data frame with one row?"
A single row isn't converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data. For example, DataFrame is a
container for Series, and Panel is a container for DataFrame objects.
We would like to be able to insert and remove objects from these
containers in a dictionary-like fashion.
http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure
The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don't know (I don't fully understand the last sentence of the citation, maybe it's the reason)
.
Edit : I don't agree with me
A DataFrame can't be composed of elements that would be Series, because the following code gives the same type "Series" as well for a row as for a column:
import pandas as pd
df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])
print '-------- df -------------'
print df
print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])
print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])
result
-------- df -------------
0
2 11
3 12
3 13
------- df.loc[2] --------
0 11
Name: 2, dtype: int64
type(df.loc[1]) : <class 'pandas.core.series.Series'>
--------- df[0] ----------
2 11
3 12
3 13
Name: 0, dtype: int64
type(df[0]) : <class 'pandas.core.series.Series'>
So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.
.
Then what is a DataFrame ?
In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.
Then, as the Pandas' docs cited above says that the pandas' data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.
However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas' data structures.
This advice doesn't mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn't strictly the case in reality. "Good" meaning that this vision enables to use DataFrames with efficiency. That's all.
.
Then what is a DataFrame object ?
The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.
# with pandas 0.12
from pandas import Series
print 'Series :\n',Series
print 'Series.__bases__ :\n',Series.__bases__
from pandas import DataFrame
print '\nDataFrame :\n',DataFrame
print 'DataFrame.__bases__ :\n',DataFrame.__bases__
print '\n-------------------'
from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__ :\n',NDFrame.__bases__
from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__ :\n',PandasContainer.__bases__
from pandas.core.base import PandasObject
print '\nPandasObject.__bases__ :\n',PandasObject.__bases__
from pandas.core.base import StringMixin
print '\nStringMixin.__bases__ :\n',StringMixin.__bases__
result
Series :
<class 'pandas.core.series.Series'>
Series.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)
DataFrame :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__ :
(<class 'pandas.core.generic.NDFrame'>,)
-------------------
NDFrame.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>,)
PandasContainer.__bases__ :
(<class 'pandas.core.base.PandasObject'>,)
PandasObject.__bases__ :
(<class 'pandas.core.base.StringMixin'>,)
StringMixin.__bases__ :
(<type 'object'>,)
So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.
The ways these extracting methods work are described in this page:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.
Why these extracting methods have been crafted as they were ?
That's certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It's precisely what is expressed in this sentence:
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data.
The why of the extraction of data from a DataFRame instance doesn't lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas' data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.
If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :
df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3]
isinstance(result, pd.DataFrame) # True
result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True
every time we put [['column name']] it returns Pandas DataFrame object,
if we put ['column name'] we got Pandas Series object
If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).
This function ensures that you always get a list from your selection (if the df, index and column are valid):
def get_list_from_df_column(df, index, column):
df_or_series = df.loc[index,[column]]
# df.loc[index,column] is also possible and returns a series or a scalar
if isinstance(df_or_series, pd.Series):
resulting_list = df_or_series.tolist() #get list from series
else:
resulting_list = df_or_series[column].tolist()
# use the column key to get a series from the dataframe
return(resulting_list)

A series query on groupby function

I have a data frame called active and it has 10 unique POS column values.
Then I group POS values and mean normalize the OPW columns and then store the normalized values as a seperate column ['resid'].
If I groupby on POS values shouldnt the new active data frame's POS columns contain only unique POS values??
For example:
df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
print df2
df2.groupby(['X']).sum()
I get an output like this:
Y
X
A 7
B 3
In my example shouldn't I get an column with only unique Pos values as mentioned below??
POS Other Columns
Rf values
2B values
LF values
2B values
OF values
I can't be 100% sure without the actual data, but i'm pretty sure that the problem here is that you are not aggregating the data.
Let's go through the groupby step by step.
When you do active.groupby('POS'), what's actually happening is that you are slicing the dataframe per each unique POS, and passing each of these sclices, sequentially, to the applied function.
You can get a better vision of what's happening by using get_group (ex : active.groupby('POS').get_group('RF') )
So you're applying your meanNormalizeOPW function to each of those slices. That function creates a mean normalized value of the column 'resid' for each line of the passed dataframe. And you return that dataframe, ending with a shape that is similar than what was passed.
So if you just add an aggregation function to the returned df, it should work fine. I guess here you want a mean, so just change return df into return df.mean()

Categories

Resources