Can't use loc with DatetimeIndex - python

I can't select using loc when there is DatetimeIndex.
test = pd.DataFrame(data=np.array([[0, 0], [0, 2], [1, 3]), columns=pd.date_range(start='2019-01-01', end='2019-01-02', freq='D'))
test.loc[test>1, '2019-01-02']
I expect it to return pandas.Series([2, 3]), but it returns the error "Cannot index with multidimensional key"

In this case, your index is not a DatetimeIndex, only your columns are. The issue is that when you use test>1 as a comparison, it will return a DataFrame with the same size as test with Booleans for each cell showing whether the value is > 1. When you pass an array of booleans, it expects it to be a 1 dimensional array, but since you're passing it a DataFrame (2 dimensional), you get the "multidemensional key" error. I believe what you want here is:
test.loc[test['2019-01-02']>1, '2019-01-02']

Related

Select values from different Dataframe column based on a list of index

Here is my issue, I have a dataframe, let's say:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
I also have a list of index:
idx = [1, 2]
I would like to store in a list, the corresponding value in each column.
Meaning I want the first value of the col1 and the second value of the col2.
I'm sure there is a simple answer to my issue however I'm mixing everything up with iloc and cannot find a way of developing a optimized method in my case (I have 1000 rows and 4 columns).
IIUC, you can try:
you can extract the complete rows and then pick the diagonal elements
result = np.diag(df.values[idx])
Alternative:
convert the dataframe to numpy array.
use numpy indexing to access the required values.
result = df.values[idx, range(len(df.columns))]
OUTPUT:
array([6, 3])
Use:
list(df.values[idx, range(len(idx))])
Output:
[6, 3]
Here is a different way:
df.stack().loc[list(zip(idx,df.columns[:(len(idx))]))].to_numpy()

How to pass a series to call a user defined function?

I am trying to pass a series to a user defined function and getting this error:
Function:
def scale(series):
sc=StandardScaler()
sc.fit_transform(series)
print(series)
Code for calling:
df['Value'].apply(scale) # df['Value'] is a Series having float dtype.
Error:
ValueError: Expected 2D array, got scalar array instead:
array=28.69.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Can anyone help address this issue?
The method apply will apply a function to each element in the Series (or in case of a DataFrame either each row or each column depending on the chosen axis). Here you expect your function to process the entire Series and to output a new Series in its stead.
You can therefore simply run:
StandardScaler().fit_transform(df['Value'].values.reshape(-1, 1))
StandardScaler excepts a 2D array as input where each row is a sample input that consists of one or more features. Even it is just a single feature (as seems to be the case in your example) it has to have the right dimensions. Therefore, before handing over your Series to sklearn I am accessing the values (the numpy representation) and reshaping it accordingly.
For more details on reshape(-1, ...) check this out: What does -1 mean in numpy reshape?
Now, the best bit. If your entire DataFrame consists of a single column you could simply do:
StandardScaler().fit_transform(df)
And even if it doesn't, you could still avoid the reshape:
StandardScaler().fit_transform(df[['Value']])
Note how in this case 'Value' is surrounded by 2 sets of braces so this time it is not a Series but rather a DataFrame with a subset of the original columns (in case you do not want to scale all of them). Since a DataFrame is already 2-dimensional, you don't need to worry about reshaping.
Finally, if you want to scale just some of the columns and update your original DataFrame all you have to do is:
>>> df = pd.DataFrame({'A': [1,2,3], 'B': [0,5,6], 'C': [7, 8, 9]})
>>> columns_to_scale = ['A', 'B']
>>> df[columns_to_scale] = StandardScaler().fit_transform(df[columns_to_scale])
>>> df
A B C
0 -1.224745 -1.397001 7
1 0.000000 0.508001 8
2 1.224745 0.889001 9

Why is it ok to have an index with lists as values but not ok for columns?

Consider the numpy.array i
i = np.empty((1,), dtype=object)
i[0] = [1, 2]
i
array([list([1, 2])], dtype=object)
Example 1
index
df = pd.DataFrame([1], index=i)
df
0
[1, 2] 1
Example 2
columns
But
df = pd.DataFrame([1], columns=i)
Leads to this when I display it
df
TypeError: unhashable type: 'list'
However, df.T works!?
Question
Why is it necessary for index values to be hashable in a column context but not in an index context? And why only when it's displayed?
This is because of how pandas internally determines the string representation of the DataFrame object. Essentially, the difference between column labels and index labels here is that the column determines the format of the string representation (as the column could be a float, int, etc.).
The error thus happens because pandas stores a separate formatter object for each column in a dictionary and this object is retrieved using the column name. Specifically, the line that triggers the error is https://github.com/pandas-dev/pandas/blob/d1accd032b648c9affd6dce1f81feb9c99422483/pandas/io/formats/format.py#L420
The "unhashable type" error usually means that the type, list in this case, is mutable. Mutable types aren't hashable, because they may change after they have produced the hash code. This happens because you are trying to retrieve an item using a list as a key, but since a key has to be hashable, the retrieval fails.

Pandas selecting with unaligned indexes

I have 2 series.
The first one contains a list of numbers with an index counting 0..8.
A = pd.Series([2,3,4,6,5,4,7,6,5], name=['A'], index=[0,1,2,3,4,5,6,7,8])
The second one only contains True values, but the index of the series is a subset of the first one.
B = pd.Series([1, 1, 1, 1, 1], name=['B'], index=[0,2,4,7,8], dtype=bool)
I'd like to use B as boolean vector to get the A-values for the corresponding indexes, like:
A[B]
[...]
IndexingError: Unalignable boolean Series key provided
Unfortunately this raises an error.
Do I need to align them first?
Does
A[B.index.values]
work for your version of pandas? (I see we have different versions because now the Series name has to be hashable, so your code gave me an error)

What's the difference between [] and [[]] in pandas? [duplicate]

This question already has answers here:
The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas
(4 answers)
Closed 11 months ago.
I'm confused about the results for indexing columns in pandas.
Both
db['varname']
and
db[['varname']]
give me the column value of 'varname'. However it looks like there is some subtle difference, since the output from db['varname'] shows me the type of the value.
The first looks for a specific Key in your df, a specific column, the second is a list of columns to sub-select from your df so it returns all columns matching the values in the list.
The other subtle thing is that the first by default will return a Series object whilst the second returns a DataFrame even if you pass a list containing a single item
Example:
In [2]:
df = pd.DataFrame(columns=['VarName','Another','me too'])
df
Out[2]:
Empty DataFrame
Columns: [VarName, Another, me too]
Index: []
In [3]:
print(type(df['VarName']))
print(type(df[['VarName']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
so when you pass a list then it tries to match all elements:
In [4]:
df[['VarName','Another']]
Out[4]:
Empty DataFrame
Columns: [VarName, Another]
Index: []
but without the additional [] then this will raise a KeyError:
df['VarName','Another']
KeyError: ('VarName', 'Another')
Because you're then trying to find a column named: 'VarName','Another' which doesn't exist
This is close to a dupe of another, and I got this answer from it at: https://stackoverflow.com/a/45201532/1331446, credit to #SethMMorton.
Answering here as this is the top hit on Google and it took me ages to "get" this.
Pandas has no [[ operator at all.
When you see df[['col_name']] you're really seeing:
col_names = ['col_name']
df[col_names]
In consequence, the only thing that [[ does for you is that it makes the
result a DataFrame, rather than a Series.
[ on a DataFrame looks at the type of the parameter; it ifs a scalar, then you're only after one column, and it hands it back as a Series; if it's a list, then you must be after a set of columns, so it hands back a DataFrame (with only these columns).
That's it!
As #EdChum pointed out, [] will return pandas.core.series.Series whereas [[]] will return pandas.core.frame.DataFrame.
Both are different data structures in pandas.
For sklearn, it is better to use db[['varname']], which has a 2D shape.
for example:
from sklearn.preprocessing import KBinsDiscretizer kbinsDiscretizer
est = KBinsDiscretizer(n_bins=3, encode='onehot-dense', strategy='uniform')
est.fit(db[['varname']]) # where use dfb['varname'] causes error
In [84]: single_brackets = np.array( [ 0, 13, 31, 1313 ] )
In [85]: single_brackets.shape, single_brackets.ndim
Out[85]: ((4,), 1)
# (4, ) : is 4-Elements/Values
# 1 : is One_Dimensional array (Generally...In Pandas we call 1D-Array as "SERIES")
In [86]: double_brackets = np.array( [[ 0, 13, 31, 1313 ]] )
In [87]: double_brackets.shape, double_brackets.ndim
Out[87]: ((1, 4), 2)
#(1, 4) : is 1-row and 4-columns
# 2 : is Two_Dimensional array (Generally...In Pandas we call 2D-Array as "DataFrame")
This is the concept of NumPy ...don't blame Pandas
[ ] -> One_Dimensional array which yields SERIES
[[ ]] -> Two_Dimensional array which yields DataFrame
Still don't believe:
check this:
In [89]: three_brackets = np.array( [[[ 0, 13, 31, 1313 ]]] )
In [93]: three_brackets.shape, three_brackets.ndim
Out[93]: ((1, 1, 4), 3)
# (1, 1, 4) -> In general....(rows, rows, columns)
# 3 -> Three_Dimensional array
Work on creating some NumPy Arrays and 'reshape' and check 'ndim'

Categories

Resources