Resolving Reindexing only valid with uniquely valued Index objects - python

I have viewed many of the questions that come up with this error. I am running pandas '0.10.1'
df = DataFrame({'A' : np.random.randn(5),
'B' : np.random.randn(5),'C' : np.random.randn(5),
'D':['a','b','c','d','e'] })
#gives error
df.take([2,0,1,2,3], axis=1).drop(['C'],axis=1)
#works fine
df.take([2,0,1,2,1], axis=1).drop(['C'],axis=1)
Only thing I can see is that in the former case I have the non-numeric column, which seems to be affecting the index somehow but the below command returns empty:
df.take([2,0,1,2,3], axis=1).index.get_duplicates()
Reindexing error makes no sense does not seem to apply as my old index is unique.
My index appears unique as far as I can tell using this command df.take([2,0,1,2,3], axis=1).index.get_duplicates() from this Q&A: problems with reindexing dataframes: Reindexing only valid with uniquely valued Index objects
"Reindexing only valid with uniquely valued Index objects" does not seem to apply
I think my pandas version# is ok so this should bug should not be the problem pandas Reindexing only valid with uniquely valued Index objects

Firstly, I believe you meant to test for duplicates using the following command:
df.take([2,0,1,2,3],axis=1).columns.get_duplicates()
because if you used index instead of columns, then it would obviously returned an empty array because the random float values don't repeat. The above command returns, as expected:
['C']
Secondly, I think you're right, the non-numeric column is throwing it off, because even if you use the following, there is still an error:
df = DataFrame({'A' : np.random.randn(5), 'B' : np.random.randn(5),'C' :np.random.randn(5), 'D':[str(x) for x in np.random.randn(5) ]})
It could be a bug, because if you check out the core file called 'index.py', on line 86, and line 1228, the type it is expecting is either (respectively):
_engine_type = _index.ObjectEngine
_engine_type = _index.Int64Engine
and neither of those seem to be expecting a string, if you look deeper into the documentation. That's the best I got, good luck!! Let me know if you solve this as I'm interested too.

Related

TypeError: unhashable type: 'list'. in pandas with Groupby or PivotTable

first at all I want to specify that my question is very similar to others questions done before, but I tried their answers and nothing worked for me.
I'm trying to aggregate some info using more than one variable to group. I can use pivot table or groupby, both are fine for this, but I get the same error all the time.
My code is:
import numpy as np
vars_agrup = ['num_vars', 'list_vars', 'Modelo']
metricas = ["MAE", "MAE_perc", "MSE", "R2"]
g = pd.pivot_table(df, index=vars_agrup, aggfunc=np.sum, values=metricas)
or
df.groupby(vars_agrup, as_index=False).agg(Count=('MAE','sum'))
Also, I tried to use () instead of [] to avoid make it a list, but then the program search a column called "'num_vars', 'list_vars', 'Modelo'" which doesn't exist. I tried ([]) and [()]. Index instead of columns. It's always the same: for one variable to group it's fine, for multiples I get the error: TypeError: unhashable type: 'list'
For sure, all these variables are columns in df.
Edit: My df looks like this:

Understanding of .loc in Pandas to save variables in specific cell of a dataframe

I don't understand the behaviour .loc or .at, when I want to save a variable in a specific cell of a dataframe. Can somebody help me to understand, please?
My failing working example:
import pandas as pd
import numpy as np
print(pd.__version__)
from platform import python_version
print(python_version())
df=pd.DataFrame(index=[0,1,2,3],columns=['A','B'])
df = pd.DataFrame({'a':[np.array([1,2,3]), np.array([4,5,6]), np.array([7,8,9]), np.array([10,11,12]), np.array([13,14,15])],'b':[5,5,12,123,6]})
display(df)
df.loc[0,'c']='string 0'
df.loc[1,'c']='string 1'
df.loc[2,'c']='string 2'
df.loc[3,'c']='string 3'
print(df.index.values)
testdata=np.array(np.arange(0,3648,1),dtype=np.float32)
print('----------testdata----------')
print(type(testdata))
print(testdata.dtype)
print(testdata.shape)
print('----------file_handle----------')
file_handle=np.array([1],dtype=np.int64)
print(file_handle)
print(type(file_handle))
print(file_handle.dtype)
if not 'new_column' in df.columns:
df=df.assign(new_column=None)
display(df)
df.loc[file_handle,'new_column']=[testdata]
display(df)
Result: ValueError: Must have equal len keys and value when setting with an ndarray
But with df.at[file_handle[0],'new_column']=[testdata], df.at[1,'new_column']=[testdata] it works. I don't understand. With df.loc[file_handle[0],'new_column']=testdata it does not work either.
In other places of my code, I can use as row index [1] to assign dicts or scalars into one specific location, but no numpy arrays.
Thank you for your explanation and insight. I would be thankful to understand, how to use .loc and at and what variables they accept, both as row index, but also as item stored in the dataframe.
When you have an ndarray on the right side, Pandas will not treat it like any Python object that can be inserted into the DataFrame. Instead you run into a code path that tries to set multiple values at multiple locations from that array, hence the error message pointing out when setting with an ndarray.
Consider some working multiloc code like
df.loc[[0,1,3], ['b', 'new_column']] = np.array([[4,5], [6,7], [8,9]])
Here, the shape of the ilocs on the left side is the same shape as the array on the right side, and it sets all the values successfully.
In your code, the list of the testdata array of shape (3648) is treated like a 2D-array of shape (1, 3648) by Pandas in this operation. This shape does not match the ilocs on the left side, thus Pandas throws an error about not being able to match them up.
The correct way to handle this issue is to use .at instead, which can only handle a single location, and won't run into the ndarray setting codepath.

Does iloc use the indices or the place of the row

I have extracted few rows from a dataframe to a new dataframe. In this new dataframe old indices remain. However, when i want to specify range from this new dataframe i used it like new indices, starting from zero. Why did it work? Whenever I try to use the old indices it gives an error.
germany_cases = virus_df_2[virus_df_2['location'] == 'Germany']
germany_cases = germany_cases.iloc[:190]
This is the code. The rows that I extracted from the dataframe virus_df_2 have indices between 16100 and 16590. I wanted to take the first 190 rows. in the second line of code i used iloc[:190] and it worked. However, when i tried to use iloc[16100:16290] it gave an error. What could be the reason?
In pandas there are two attributes, loc and iloc.
The iloc is, as you have noticed, an indexing based on the order of the rows in memory, so there you can reference the nth line using iloc[n].
In order to reference rows using the pandas indexing, which can be manually altered and can not only be integers but also strings or other objects that are hashable (have the __hash__ method defined), you should use loc attribute.
In your case, iloc raises an error because you are trying to access a range that is outside the region defined by your dataframe. You can try loc instead and it will be ok.
At first it will be hard to grasp the indexing notation, but it can be very helpful in some circumstances, like for example sorting or performing grouping operations.
Quick example that might help:
df = pd.DataFrame(
dict(
France=[1, 2, 3],
Germany=[4, 5, 6],
UK=['x', 'y', 'z'],
))
df = df.loc[:,"Germany"].iloc[1:2]
Out:
1 5
Name: Germany, dtype: int64
Hope I could help.

Passing list-likes to .loc or [] with any missing labels is no longer supported

I want to create a modified dataframe with the specified columns.
I tried the following but throws the error "Passing list-likes to .loc or [] with any missing labels is no longer supported"
# columns to keep
filtered_columns = ['text', 'agreeCount', 'disagreeCount', 'id', 'user.firstName', 'user.lastName', 'user.gender', 'user.id']
tips_filtered = tips_df.loc[:, filtered_columns]
# display tips
tips_filtered
Thank you
It looks like Pandas has deprecated this method of indexing. According to their docs:
This behavior is deprecated and will show a warning message pointing
to this section. The recommended alternative is to use .reindex()
Using the new recommended method, you can filter your columns using:
tips_filtered = tips_df.reindex(columns = filtered_columns).
NB: To reindex rows, you would use reindex(index = ...) (More information here).
Some of the columns in the list are not included in the dataframe , if you do want do that , let us try reindex
tips_filtered = tips_df.reindex(columns=filtered_columns)
I encountered the same error with missing row index labels rather than columns.
For example, I would have a dataset of products with the following ids: ['a','b','c','d']. I store those products in a dataframe with indices ['a','b','c','d']:
df=pd.DataFrame(['product a','product b','product c', 'product d'],index=['a','b','c','d'])
Now let's assume I have an updated product index:
row_indices=['b','c','d','e'] in which 'e' corresponds to a new product: 'product e'. Note that 'e' was not present in my original index ['a','b','c','d'].
If I try to pass this updated index to my df dataframe: df.loc[row_indices,:],
I'll get this nasty error message:
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['e'], dtype='object').
To avoid this error I need to do intersection of my updated index with the original index:
df.loc[df.index.intersection(row_indices),:]
this is in line with recommendation of what pandas docs
This error pops up if indexing on something which is not present - reset_index() worked for me as I was indexing on a subset of the actual dataframe with actual indices, in this case the column may not be present in the dataframe.
I had the same issue while trying to create new columns along with existing ones :
df = pd.DataFrame([[1,2,3]], columns=["a","b","c"])
def foobar(a,b):
return a,b
df[["c","d"]] = df.apply(lambda row: foobar(row["a"], row["b"]), axis=1)
The solution was to add result_type="expand" as an argument of apply() :
df[["c","d"]] = df.apply(lambda row: foobar(row["a"], row["b"]), axis=1, result_type="expand")

Pandas: Selecting column from data frame

Pandas beginner here. I'm looking to return a full column's data and I've seen a couple of different methods for this.
What is the difference between the two entries below, if any? It looks like they return the same thing.
loansData['int_rate']
loansData.int_rate
The latter is basically syntactic sugar for the former. There are (at least) a couple of gotchas:
If the name of the column is not a valid Python identifier (e.g., if the column name is my column name?!, you must use the former.
Somewhat surprisingly, you can only use the former form to completely correctly add a new column (see, e.g., here).
Example for latter statement:
import pandas as pd
df = pd.DataFrame({'a': range(4)})
df.b = range(4)
>> df.columns
Index([u'a'], dtype='object')
For some reason, though, df.b returns the correct results.
They do return the same thing. The column names in pandas are akin to dictionary keys that refer to a series. The column names themselves are named attributes that are part of the dataframe object.
The first method is preferred as it allows for spaces and other illegal operators.
For a more complete explanation, I recommend you take a look at this article:
http://byumcl.bitbucket.org/bootcamp2013/labs/pd_types.html#pandas-types
Search 'Access using dict notation' to find the examples where they show that these two methods return identical values.
They're the same but for me the first method handles spaces in column names and illegal characters so is preferred, example:
In [115]:
df = pd.DataFrame(columns=['a', ' a', '1a'])
df
Out[115]:
Empty DataFrame
Columns: [a, a, 1a]
Index: []
In [116]:
print(df.a) # works
print([' a']) # works
print(df.1a) # error
File "<ipython-input-116-4fa4129a400e>", line 3
print(df.1a)
^
SyntaxError: invalid syntax
Really when you use dot . it's trying to find a key as an attribute, if for some reason you have used column names that match an attribute then using dot will not do what you expect.
Example:
In [121]:
df = pd.DataFrame(columns=['index'], data = np.random.randn(3))
df
Out[121]:
index
0 0.062698
1 -1.066654
2 -1.560549
In [122]:
df.index
Out[122]:
Int64Index([0, 1, 2], dtype='int64')
The above has now shown the index as opposed to the column 'index'
In case if you are working on any ML projects and you want to extract feature and target variables separately and need to have them separably.
Below code will be useful: This is selecting features through indexing as a list and applying them to the dataframe. in this code data is DF.
len_col=len(data.columns)
total_col=list(data.columns)
Target_col_Y=total_col[-1]
Feature_col_X=total_col[0:-1]
print('The dependent variable is')
print(Target_col_Y)
print('The independent variables are')
print(Feature_col_X)
The output for the same can be obtained as given below:
The dependent variable is
output
The independent variables are
['age', 'job', 'marital', 'education','day_of_week', ... etc]

Categories

Resources