This question already has answers here:
The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas
(4 answers)
Closed 11 months ago.
I'm confused about the results for indexing columns in pandas.
Both
db['varname']
and
db[['varname']]
give me the column value of 'varname'. However it looks like there is some subtle difference, since the output from db['varname'] shows me the type of the value.
The first looks for a specific Key in your df, a specific column, the second is a list of columns to sub-select from your df so it returns all columns matching the values in the list.
The other subtle thing is that the first by default will return a Series object whilst the second returns a DataFrame even if you pass a list containing a single item
Example:
In [2]:
df = pd.DataFrame(columns=['VarName','Another','me too'])
df
Out[2]:
Empty DataFrame
Columns: [VarName, Another, me too]
Index: []
In [3]:
print(type(df['VarName']))
print(type(df[['VarName']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
so when you pass a list then it tries to match all elements:
In [4]:
df[['VarName','Another']]
Out[4]:
Empty DataFrame
Columns: [VarName, Another]
Index: []
but without the additional [] then this will raise a KeyError:
df['VarName','Another']
KeyError: ('VarName', 'Another')
Because you're then trying to find a column named: 'VarName','Another' which doesn't exist
This is close to a dupe of another, and I got this answer from it at: https://stackoverflow.com/a/45201532/1331446, credit to #SethMMorton.
Answering here as this is the top hit on Google and it took me ages to "get" this.
Pandas has no [[ operator at all.
When you see df[['col_name']] you're really seeing:
col_names = ['col_name']
df[col_names]
In consequence, the only thing that [[ does for you is that it makes the
result a DataFrame, rather than a Series.
[ on a DataFrame looks at the type of the parameter; it ifs a scalar, then you're only after one column, and it hands it back as a Series; if it's a list, then you must be after a set of columns, so it hands back a DataFrame (with only these columns).
That's it!
As #EdChum pointed out, [] will return pandas.core.series.Series whereas [[]] will return pandas.core.frame.DataFrame.
Both are different data structures in pandas.
For sklearn, it is better to use db[['varname']], which has a 2D shape.
for example:
from sklearn.preprocessing import KBinsDiscretizer kbinsDiscretizer
est = KBinsDiscretizer(n_bins=3, encode='onehot-dense', strategy='uniform')
est.fit(db[['varname']]) # where use dfb['varname'] causes error
In [84]: single_brackets = np.array( [ 0, 13, 31, 1313 ] )
In [85]: single_brackets.shape, single_brackets.ndim
Out[85]: ((4,), 1)
# (4, ) : is 4-Elements/Values
# 1 : is One_Dimensional array (Generally...In Pandas we call 1D-Array as "SERIES")
In [86]: double_brackets = np.array( [[ 0, 13, 31, 1313 ]] )
In [87]: double_brackets.shape, double_brackets.ndim
Out[87]: ((1, 4), 2)
#(1, 4) : is 1-row and 4-columns
# 2 : is Two_Dimensional array (Generally...In Pandas we call 2D-Array as "DataFrame")
This is the concept of NumPy ...don't blame Pandas
[ ] -> One_Dimensional array which yields SERIES
[[ ]] -> Two_Dimensional array which yields DataFrame
Still don't believe:
check this:
In [89]: three_brackets = np.array( [[[ 0, 13, 31, 1313 ]]] )
In [93]: three_brackets.shape, three_brackets.ndim
Out[93]: ((1, 1, 4), 3)
# (1, 1, 4) -> In general....(rows, rows, columns)
# 3 -> Three_Dimensional array
Work on creating some NumPy Arrays and 'reshape' and check 'ndim'
Related
I have a pandas DataFrame called df where df.shape is (53, 80) where indexes and columns are both int.
If I select the first row like this, I get :
df.loc[0].shape
(80,)
instead of :
(1,80)
But then df.loc[0:0].shape or df[0:1].shape both show the correct shape.
df.loc[0] returns a one-dimensional pd.Series object representing the data in a single row, extracted via indexing.
df.loc[0:0] returns a two-dimensional pd.DataFrame object representing the data in a dataframe with one row, extracted via slicing.
You can see this more clearly if you print the results of these operations:
import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(9).reshape(3, 3))
res1 = df.loc[0]
res2 = df.loc[0:0]
print(type(res1), res1, sep='\n')
<class 'pandas.core.series.Series'>
0 0
1 1
2 2
Name: 0, dtype: int32
print(type(res2), res2, sep='\n')
<class 'pandas.core.frame.DataFrame'>
0 1 2
0 0 1 2
The convention follows NumPy indexing / slicing. This is natural since Pandas is built on NumPy arrays.
arr = np.arange(9).reshape(3, 3)
print(arr[0].shape) # (3,), i.e. 1-dimensional
print(arr[0:0].shape) # (0, 3), i.e. 2-dimensional
When you call df.iloc[0], it is selecting first row and type is Series whereas, in other case df.iloc[0:0] it is slicing rows and is of type dataframe. And Series are according to pandas Series documentation :
One-dimensional ndarray with axis labels
whereas dataframe are Two-dimensional (pandas Dataframe documentation).
Try running following lines to see the difference:
print(type(df.iloc[0]))
# <class 'pandas.core.series.Series'>
print(type(df.iloc[0:0]))
# <class 'pandas.core.frame.DataFrame'>
Edit: As explained below in #floydian's comment, the problem was that calling a = np.array(a, dtype=d) creates an a double array which was causing the problem.
I am aware that this has been already asked multiple times, and in fact am looking at Creating a Pandas DataFrame with a numpy array containing multiple types answer right now. But I still seem to have a problem while converting. It must be something very simple that I am missing. I hope that someone can be so kind and point it out. Sample code below:
import numpy as np
import pandas as pd
a = np.array([[1, 2], [3, 4]])
d = [('x','float'), ('y','int')]
a = np.array(a, dtype=d)
# Try 1
df= pd.DataFrame(a)
# Result - ValueError: If using all scalar values, you must pass an index
# Try 2
i = [1,2]
df= pd.DataFrame(a, index=i)
# Result - Exception: Data must be 1-dimensional
I would define the array like this:
a = np.array([(1, 2), (3, 4)], dtype=[('x','float'), ('y', 'int')])
pd.DataFrame(a)
gets what you want.
One option to separate it after the fact could be e.g.
pd.DataFrame(a.astype("float32").T, columns=a.dtype.names).astype({k: v[0] for k, v in a.dtype.fields.items()})
Out[296]:
x y
0 1.0 3
1 2.0 4
I have a DataFrame with mixed column name types: some column names are strings and some are tuples.
Is there a way to reorder the columns without changing the types of the column names?
For example, if all columns are strings, this works fine:
df = pd.DataFrame([["Alice", 34],
["Bob", 55]])
df.columns = ["name", "age"]
df[["age", "name"]]
# Out:
age name
0 34 Alice
1 55 Bob
If all columns are tuples, this also works with no problem:
df = pd.DataFrame([[5, 30],
[6, 31]])
df.columns = [(0,0), (1,1)]
df[[(1,1), (0,0)]]
# Out[15]:
(1, 1) (0, 0)
0 30 5
1 31 6
However, if the columns are mixed strings and tuples, there is an error.
df = pd.DataFrame([["Alice", 0, 34],
["Bob", 1, 55]])
df.columns = ["name", (0,0), "age"]
df[["age", "name", (0,0)]]
# Out:
ValueError: setting an array element with a sequence
I can probably fix this by converting the tuples in the columns to strings, or the strings to tuples, then converting back.
However, what I really want to know what causes this error and if there is a way to get around it in a more elegant manner.
df[np.array(["age", "name", (0,0)],dtype=object)] works.
As you pointed out, Python is complaining since the array containing column names has both tuple and string values. But explicitly creating an array with the dtype=object specification tells the array to hold arbitrary objects and not complain. If the dtype argument is skipped, then the dtype is inferred, and Python assumes that the dtype is the same for the whole array, causing an error.
I'm trying to create a DataFrame in Pandas with the following code:
df_coefficients = pd.DataFrame(data = log_model.coef_, index = X.columns,
columns = ['Coefficients'])
However, I keep getting the following error:
Shape of passed values is (5, 1), indices imply (1, 5)
The values and indices are as follows:
Indices =
Index([u'Daily Time Spent on Site', u'Age', u'Area Income',
u'Daily Internet Usage', u'Male'],
dtype='object')
Values =
array([[ -4.45816498e-02, 2.18379839e-01, -7.63621392e-06,
-2.45264007e-02, 1.13334440e-03]])
How would I fix this? I've built the same type of table before and I've never gotten this error.
Any help would be appreciated.
Thanks
It looks like your Index and Values arrays have different shapes. As you can see the Index array has single brackets while the Values array has double brackets.
That way python reads index as having shape (5,1) while the Values array is (1,5).
if you enter Values as you wrote in the question:
Values =
array([[ -4.45816498e-02, 2.18379839e-01, -7.63621392e-06,
-2.45264007e-02, 1.13334440e-03]])
and call Values.shape it returns
Values.shape
(1,5)
Instead if you set Values as:
Values = np.array([ -4.45816498e-02, 2.18379839e-01, -7.63621392e-06,
-2.45264007e-02, 1.13334440e-03])
then the shape of Values will be (5,) which will fit with the index array.
Your data has five columns and one row instead of one column and five rows. Just use the transposed version of it with .T:
df_coefficients = pd.DataFrame(data = log_model.coef_.T, index = X.columns,
columns = ['Coefficients'])
I am trying to convert a single column of a dataframe to a numpy array. Converting the entire dataframe has no issues.
df
viz a1_count a1_mean a1_std
0 0 3 2 0.816497
1 1 0 NaN NaN
2 0 2 51 50.000000
Both of these functions work fine:
X = df.as_matrix()
X = df.as_matrix(columns=df.columns[1:])
However, when I try:
y = df.as_matrix(columns=df.columns[0])
I get:
TypeError: Index(...) must be called with a collection of some kind, 'viz' was passed
The problem here is that you're passing just a single element which in this case is just the string title of that column, if you convert this to a list with a single element then it works:
In [97]:
y = df.as_matrix(columns=[df.columns[0]])
y
Out[97]:
array([[0],
[1],
[0]], dtype=int64)
Here is what you're passing:
In [101]:
df.columns[0]
Out[101]:
'viz'
So it's equivalent to this:
y = df.as_matrix(columns='viz')
which results in the same error
The docs show the expected params:
DataFrame.as_matrix(columns=None) Convert the frame to its Numpy-array
representation.
Parameters: columns: list, optional, default:None If None, return all
columns, otherwise, returns specified columns
as_matrix expects a list for the columns keyword and df.columns[0] isn't a list. Try
df.as_matrix(columns=[df.columns[0]]) instead.
Using the index tolist function works as well
df.as_matrix(columns=df.columns[0].tolist())
When giving multiple columns, for example, the ten first, then the command
df.as_matrix(columns=[df.columns[0:10]])
does not work as it returns an index. However, using
df.as_matrix(columns=df.columns[0:10].tolist())
works well.