I have a DataFrame with mixed column name types: some column names are strings and some are tuples.
Is there a way to reorder the columns without changing the types of the column names?
For example, if all columns are strings, this works fine:
df = pd.DataFrame([["Alice", 34],
["Bob", 55]])
df.columns = ["name", "age"]
df[["age", "name"]]
# Out:
age name
0 34 Alice
1 55 Bob
If all columns are tuples, this also works with no problem:
df = pd.DataFrame([[5, 30],
[6, 31]])
df.columns = [(0,0), (1,1)]
df[[(1,1), (0,0)]]
# Out[15]:
(1, 1) (0, 0)
0 30 5
1 31 6
However, if the columns are mixed strings and tuples, there is an error.
df = pd.DataFrame([["Alice", 0, 34],
["Bob", 1, 55]])
df.columns = ["name", (0,0), "age"]
df[["age", "name", (0,0)]]
# Out:
ValueError: setting an array element with a sequence
I can probably fix this by converting the tuples in the columns to strings, or the strings to tuples, then converting back.
However, what I really want to know what causes this error and if there is a way to get around it in a more elegant manner.
df[np.array(["age", "name", (0,0)],dtype=object)] works.
As you pointed out, Python is complaining since the array containing column names has both tuple and string values. But explicitly creating an array with the dtype=object specification tells the array to hold arbitrary objects and not complain. If the dtype argument is skipped, then the dtype is inferred, and Python assumes that the dtype is the same for the whole array, causing an error.
Related
I have a dataframe like as shown below
df = [{"condition": "a", "runtimes": "1,15,2.0,nan"}, {"condition": "b", "runtimes": "51,75,1.0,NaN"}]
df = pd.DataFrame(df)
print(df)
My objective is to
a) Create a output_list - which will concat/append all the runtimes column values and store it as a list
b) output_list should not contain NA, nan and also should not contain duplicates values
c) All the values in the list should be of int datatype
I was trying something like below
for b in df.runtimes.tolist():
print(type(b))
for a in b.split(','):
print(int(a,base=10)) # it threw error here
ValueError: invalid literal for int() with base 10: '2.0'
I want all the runtimes value to be in int format (it can only be int data type)
I expect my output to be a python list like as below
output_list = [1,2,15,51,75]
First split values and use Series.explode, convert to numeric with misisng values if not match, then rmove misisng values, sorting, convert to integers, remove duplicates and last convert to list:
L = (pd.to_numeric(df.runtimes.str.split(',').explode(), errors='coerce')
.dropna()
.sort_values()
.astype(int)
.unique()
.tolist())
print (L)
[1, 2, 15, 51, 75]
Or if possible convert to floats instead pd.to_numeric:
L = (df.runtimes.str.split(',')
.explode()
.astype(float)
.dropna()
.sort_values()
.astype(int)
.unique()
.tolist())
I'm trying to use pandas' read_csv with the dtype parameter set to CategoricalDtype. It does generate the DataFrame with categories as expected but I have noticed that the categories themselves are object type instead of some kind of int. For example,
import pandas as pd
from io import StringIO
data = 'data\n1\n2\n3\n'
df = pd.read_csv(StringIO(data), dtype=pd.CategoricalDtype())
df['data']
results in
0 1
1 2
2 3
Name: data, dtype: category
Categories (3, object): ['1', '2', '3']
This is a bit surprising because if I create a list of numbers and then generate a Series, without using read_csv, the categories are int64.
lst = [1, 2, 3]
pd.Series(lst, dtype=pd.CategoricalDtype())
results in
0 1
1 2
2 3
dtype: category
Categories (3, int64): [1, 2, 3]
I know I can pass the categories explicitly to the CategoricalDtype to circumvent this, but this is a bit annoying. Is this behaviour expected?
Yes, this behavior is expected. When reading a csv all data is stored as a string and pandas essentially guesses (intelligently) at whether or not a column is supposed to be something else after parsing the data (unless given a dtype beforehand). This is probably an oversimplification of how pandas interprets text-based files, so some one please correct me if I'm wrong or has more information to include.
If you remove the manual dtype in your pd.read_csv, pandas will read in your data and then accurately guess that the column should be of an int dtype. By manually setting dtype=pd.CategoricalDtype (note you can also achieve the result with dtype="category") pandas skips the implicit conversion to an int dtype before converting it to a CategoricalDtype which is why your categories have an object dtype.
In your second example, the data in your list lst are all numeric. Since you aren't explicitly supplying the categories, pandas draws on the unique values in lst to create its categories. Since all the value in the categories are int, it the unique values in lst will be of dtype int. If you want your categories in your second example to be a string you'll need to recast lst to contain strings (e.g. lst = [str(x) for x in lst]), or better yet, you can replace the underlying categories with a copy that has an object/string dtype after creation of the Series.
lst = [1, 2, 3]
s = pd.Series(lst, dtype=pd.CategoricalDtype())
# replace underlying categories with an string version
s = s.cat.rename_categories(s.cat.categories.astype(str))
print(s)
0 1
1 2
2 3
dtype: category
Categories (3, object): ['1', '2', '3']
Consider the numpy.array i
i = np.empty((1,), dtype=object)
i[0] = [1, 2]
i
array([list([1, 2])], dtype=object)
Example 1
index
df = pd.DataFrame([1], index=i)
df
0
[1, 2] 1
Example 2
columns
But
df = pd.DataFrame([1], columns=i)
Leads to this when I display it
df
TypeError: unhashable type: 'list'
However, df.T works!?
Question
Why is it necessary for index values to be hashable in a column context but not in an index context? And why only when it's displayed?
This is because of how pandas internally determines the string representation of the DataFrame object. Essentially, the difference between column labels and index labels here is that the column determines the format of the string representation (as the column could be a float, int, etc.).
The error thus happens because pandas stores a separate formatter object for each column in a dictionary and this object is retrieved using the column name. Specifically, the line that triggers the error is https://github.com/pandas-dev/pandas/blob/d1accd032b648c9affd6dce1f81feb9c99422483/pandas/io/formats/format.py#L420
The "unhashable type" error usually means that the type, list in this case, is mutable. Mutable types aren't hashable, because they may change after they have produced the hash code. This happens because you are trying to retrieve an item using a list as a key, but since a key has to be hashable, the retrieval fails.
This question already has answers here:
The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas
(4 answers)
Closed 11 months ago.
I'm confused about the results for indexing columns in pandas.
Both
db['varname']
and
db[['varname']]
give me the column value of 'varname'. However it looks like there is some subtle difference, since the output from db['varname'] shows me the type of the value.
The first looks for a specific Key in your df, a specific column, the second is a list of columns to sub-select from your df so it returns all columns matching the values in the list.
The other subtle thing is that the first by default will return a Series object whilst the second returns a DataFrame even if you pass a list containing a single item
Example:
In [2]:
df = pd.DataFrame(columns=['VarName','Another','me too'])
df
Out[2]:
Empty DataFrame
Columns: [VarName, Another, me too]
Index: []
In [3]:
print(type(df['VarName']))
print(type(df[['VarName']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
so when you pass a list then it tries to match all elements:
In [4]:
df[['VarName','Another']]
Out[4]:
Empty DataFrame
Columns: [VarName, Another]
Index: []
but without the additional [] then this will raise a KeyError:
df['VarName','Another']
KeyError: ('VarName', 'Another')
Because you're then trying to find a column named: 'VarName','Another' which doesn't exist
This is close to a dupe of another, and I got this answer from it at: https://stackoverflow.com/a/45201532/1331446, credit to #SethMMorton.
Answering here as this is the top hit on Google and it took me ages to "get" this.
Pandas has no [[ operator at all.
When you see df[['col_name']] you're really seeing:
col_names = ['col_name']
df[col_names]
In consequence, the only thing that [[ does for you is that it makes the
result a DataFrame, rather than a Series.
[ on a DataFrame looks at the type of the parameter; it ifs a scalar, then you're only after one column, and it hands it back as a Series; if it's a list, then you must be after a set of columns, so it hands back a DataFrame (with only these columns).
That's it!
As #EdChum pointed out, [] will return pandas.core.series.Series whereas [[]] will return pandas.core.frame.DataFrame.
Both are different data structures in pandas.
For sklearn, it is better to use db[['varname']], which has a 2D shape.
for example:
from sklearn.preprocessing import KBinsDiscretizer kbinsDiscretizer
est = KBinsDiscretizer(n_bins=3, encode='onehot-dense', strategy='uniform')
est.fit(db[['varname']]) # where use dfb['varname'] causes error
In [84]: single_brackets = np.array( [ 0, 13, 31, 1313 ] )
In [85]: single_brackets.shape, single_brackets.ndim
Out[85]: ((4,), 1)
# (4, ) : is 4-Elements/Values
# 1 : is One_Dimensional array (Generally...In Pandas we call 1D-Array as "SERIES")
In [86]: double_brackets = np.array( [[ 0, 13, 31, 1313 ]] )
In [87]: double_brackets.shape, double_brackets.ndim
Out[87]: ((1, 4), 2)
#(1, 4) : is 1-row and 4-columns
# 2 : is Two_Dimensional array (Generally...In Pandas we call 2D-Array as "DataFrame")
This is the concept of NumPy ...don't blame Pandas
[ ] -> One_Dimensional array which yields SERIES
[[ ]] -> Two_Dimensional array which yields DataFrame
Still don't believe:
check this:
In [89]: three_brackets = np.array( [[[ 0, 13, 31, 1313 ]]] )
In [93]: three_brackets.shape, three_brackets.ndim
Out[93]: ((1, 1, 4), 3)
# (1, 1, 4) -> In general....(rows, rows, columns)
# 3 -> Three_Dimensional array
Work on creating some NumPy Arrays and 'reshape' and check 'ndim'
I am trying to convert a single column of a dataframe to a numpy array. Converting the entire dataframe has no issues.
df
viz a1_count a1_mean a1_std
0 0 3 2 0.816497
1 1 0 NaN NaN
2 0 2 51 50.000000
Both of these functions work fine:
X = df.as_matrix()
X = df.as_matrix(columns=df.columns[1:])
However, when I try:
y = df.as_matrix(columns=df.columns[0])
I get:
TypeError: Index(...) must be called with a collection of some kind, 'viz' was passed
The problem here is that you're passing just a single element which in this case is just the string title of that column, if you convert this to a list with a single element then it works:
In [97]:
y = df.as_matrix(columns=[df.columns[0]])
y
Out[97]:
array([[0],
[1],
[0]], dtype=int64)
Here is what you're passing:
In [101]:
df.columns[0]
Out[101]:
'viz'
So it's equivalent to this:
y = df.as_matrix(columns='viz')
which results in the same error
The docs show the expected params:
DataFrame.as_matrix(columns=None) Convert the frame to its Numpy-array
representation.
Parameters: columns: list, optional, default:None If None, return all
columns, otherwise, returns specified columns
as_matrix expects a list for the columns keyword and df.columns[0] isn't a list. Try
df.as_matrix(columns=[df.columns[0]]) instead.
Using the index tolist function works as well
df.as_matrix(columns=df.columns[0].tolist())
When giving multiple columns, for example, the ten first, then the command
df.as_matrix(columns=[df.columns[0:10]])
does not work as it returns an index. However, using
df.as_matrix(columns=df.columns[0:10].tolist())
works well.