I have two pandas.Series objects, say a and b, having the same index, and when performing the difference a - b I get the error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
which I don't understand where is coming from.
The Series a is obtained as a slice of a DataFrame whose index is a MultiIndex, and when I do a renaming
a.name = 0
the operation works fine (but if I rename to a tuple I get the same error).
Unfortunately, I am not able to reproduce a minimal example of the phenomenon (the difference of ad-hoc Series with name a tuple seems to work fine).
Any ideas on why this is happening?
If relevant, pandas version is 0.22.0
EDIT
The full traceback of the error:
----------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-e4efbf202d3c> in <module>()
----> 1 one - two
~/venv/lib/python3.4/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op)
727
728 if isinstance(rvalues, ABCSeries):
--> 729 name = _maybe_match_name(left, rvalues)
730 lvalues = getattr(lvalues, 'values', lvalues)
731 rvalues = getattr(rvalues, 'values', rvalues)
~/venv/lib/python3.4/site-packages/pandas/core/common.py in _maybe_match_name(a, b)
137 b_has = hasattr(b, 'name')
138 if a_has and b_has:
--> 139 if a.name == b.name:
140 return a.name
141 else:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
EDIT 2
Some more details on how a and b are obtained:
I have a DataFrame df whose index is a multyindex (year, id_)
I have a Series factors whose index are the columns of df (something like the standard deviation of the columns)
Then:
tmp = df.loc[(year, id_)]
a = tmp[factors != 0]
b = factors[factors != 0]
diff = a - b
and executing the last line the error happens.
EDIT 3
And it keeps happening also if I reduce the columns: the original df has around 1000 rows and columns, but reducing to the last 5 lines and columns, the problem persists!
For example, by doing
df = df.iloc[-10:][df.columns[-5:]]
line = df.iloc[-3]
factors = factors[df.columns]
a = line[factors != 0]
b = factors[factors != 0]
diff = a - b
I keep getting the same error, while printing a and b I obtain
a:
end_bin_68.750_100.000 0.002413
end_bin_75.000_100.000 0.002614
end_bin_81.250_100.000 0.001810
end_bin_87.500_100.000 0.002313
end_bin_93.750_100.000 0.001609
Name: (2015, 10000030), dtype: float64
b:
end_bin_68.750_100.000 0.001244
end_bin_75.000_100.000 0.001242
end_bin_81.250_100.000 0.000918
end_bin_87.500_100.000 0.000659
end_bin_93.750_100.000 0.000563
Name: 1, dtype: float64
While if I manually create df and factors with these same values (also in the indices) the error does not happen.
EDIT 4
While debugging, when one gets to the function _maybe_match_name one obtains the following:
ipdb> type(a.name)
<class 'tuple'>
ipdb> type(b.name)
<class 'numpy.int64'>
ipdb> a.name == b.name
a = end_bin_68.750_100.000 0.002413
end_bin_75.000_100.000 0.002614
end_bin_81.250_100.000 0.001810
end_bin_87.500_100.000 0.002313
end_bin_93.750_100.000 0.001609
Name: (2015, 10000030), dtype: float64
b = end_bin_68.750_100.000 0.001244
end_bin_75.000_100.000 0.001242
end_bin_81.250_100.000 0.000918
end_bin_87.500_100.000 0.000659
end_bin_93.750_100.000 0.000563
Name: 1, dtype: float64
ipdb> (a.name == b.name)
array([False, False])
EDIT 5
Finally I got to a minimal example:
a = pd.Series([1, 2, 3])
a.name = np.int64(13)
b = pd.Series([4, 5, 6])
b.name = (123, 789)
a - b
this raises the error to me, np.__version__ == 1.14.0 and pd.__version__ == 0.22.0
When an operation is made between two pandas Series it tries to give a name to the resulting Series.
s1 = pd.Series(np.random.randn(5))
s2 = pd.Series(np.random.randn(5))
s1.name = "hello"
s2.name = "hello"
s3 = s1-s2
s3.name
>>> "hello"
If the name is not the same, then the resulting Series has no name.
s1 = pd.Series(np.random.randn(5))
s2 = pd.Series(np.random.randn(5))
s1.name = "hello"
s2.name = "goodbye"
s3 = s1-s2
s3.name
>>>
This is done by comparing Series names with the function _maybe_match_name(), than is here on GitHub.
The comparison operator compares apparently in your case an array with a tuple, which is not possible (I haven't been able to reproduce the error), and raise the ValueError exception.
I guess it is a bug, what is weird is that np.int64(42) == ("A", "B")doesn't raise an exception for me.
But I have a FutureWarning from numpy:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison.
Which makes me think that you are using a extremely recent numpy version (you compiled it from the master branch on GitHub ?).
The bug will likely be corrected in next pandas release as it is a result of a future change in the behavior of numpy.
My guess is that the best thing to do is just to rename your Series before making operation as you already did b.name = None, or to change your numpy version (1.15.0works well).
Related
I have two dataframes df1 and df2 each of them have column containing product code and product price, I wanted to check the difference between prices in the 2 dataframes and store the result of this function I created in a new dataframe "df3" containing the product code and the final price, Here is my attempt :
Function to calculate the difference in the way I want:
def range_calc(z, y):
final_price = pd.DataFrame(columns = ["Market_price"])
res = z-y
abs_res = abs(res)
if abs_res == 0:
return (z)
if z>y:
final_price = (abs_res / z ) * 100
else:
final_price = (abs_res / y ) * 100
return(final_price)
For loop I created to check the two df and use the function:
Last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
for i in df1["product_ID"]:
for x in df2["product_code"]:
if i == x:
Last_df["Product_number"] = i
Last_df["Market_Price"] = range_calc(df1["full_price"],df2["tot_price"])
The problem is that I am getting this error every time:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Why you got the error message "The truth value of a Series is ambiguous"
You got the error message The truth value of a Series is ambiguous
because you tried input a pandas.Series into an if-clause
nums = pd.Series([1.11, 2.22, 3.33])
if nums == 0:
print("nums == zero (nums is equal to zero)")
else:
print("nums != zero (nums is not equal to zero)")
# AN EXCEPTION IS RAISED!
The Error Message is something like the following:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Somehow, you got a Series into the inside of the if-clause.
Actually, I know how it happened, but it will take me a moment to explain:
Well, suppose that you want the value out of row 3 and column 4 of a pandas dataframe.
If you attempt to extract a single value out of a pandas table in a specific row and column, then that value is sometimes a Series object, not a number.
Consider the following example:
# DATA:
# Name Age Location
# 0 Nik 31 Toronto
# 1 Kate 30 London
# 2 Evan 40 Kingston
# 3 Kyra 33 Hamilton
To create the dataframe above, we can write:
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
Now, let us try to get a specific row of data:
evans_row = df.loc[df['Name'] == 'Evan']
and we try to get a specific value out of that row of data:
evans_age = evans_row['Age']
You might think that evans_age is the integer 40, but you would be wrong.
Let us see what evans_age really is:
print(80*"*", "EVAN\'s AGE", type(Evans_age), sep="\n")
print(Evans_age)
We have:
EVAN's AGE
<class 'pandas.core.series.Series'>
2 40
Name: Age, dtype: int64
Evan's Age is not a number.
evans_age is an instance of the class stored as pandas.Series
After extracting a single cell out of a pandas dataframe you can write .tolist()[0] to extract the number out of that cell.
evans_real_age = evans_age.tolist()[0]
print(80*"*", "EVAN\'s REAL AGE", type(evans_real_age), sep="\n")
print(evans_real_age)
EVAN's REAL AGE
<class 'numpy.int64'>
40
The exception in your original code was probably thrown by if abs_res == 0.
If abs_res is a pandas.Series then abs_res == 0 returns another Series.
There is no way to compare if an entire list of numbers is equal to zero.
Normally people just enter one input to an if-clause.
if (912):
print("912 is True")
else:
print("912 is False")
When an if-statement receives more than one value, then the python interpreter does not know what to do.
For example, what should the following do?
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
if data:
print("data is true")
else:
print("data is false")
You should only input one value into an if-condition. Instead, you entered a pandas.Series object as input to the if-clause.
In your case, the pandas.Series only had one number in it. However, in general, pandas.Series contain many values.
The authors of the python pandas library assume that a series contains many numbers, even if it only has one.
The computer thought that you tired to put many different numbers inside of one single if-clause.
The difference between a "function definition" and a "function call"
Your original question was,
"I want to make a function if the common key is found"
Your use of the phrase "make a function" is incorrect. You probably meant, "I want to call a function if a common key is found."
The following are all examples of function "calls":
import pandas as pd
import numpy as np
z = foo(1, 91)
result = funky_function(811, 22, "green eggs and ham")
output = do_stuff()
data = np.random.randn(6, 4)
df = pd.DataFrame(data, index=dates, columns=list("ABCD"))
Suppose that you have two containers.
If you truly want to "make" a function if a common key is found, then you would have code like the following:
dict1 = {'age': 26, 'phone':"303-873-9811"}
dict2 = {'name': "Bob", 'phone':"303-873-9811"}
def foo(dict1, dict2):
union = set(dict2.keys()).intersection(set(dict1.keys()))
# if there is a shared key...
if len(union) > 0:
# make (create) a new function
def bar(*args, **kwargs):
pass
return bar
new_function = foo(dict1, dict2)
print(new_function)
If you are not using the def keyword, that is known as a function call
In python, you "make" a function ("define" a function) with the def keyword.
I think that your question should be re-titled.
You could write, "How do I call a function if two pandas dataframes have a common key?"
A second good question be something like,
"What went wrong if we see the error message, ValueError: The truth value of a Series is ambiguous.?"
Your question was worded strangely, but I think I can answer it.
Generating Test Data
Your question did not include test data. If you ask a question on stack overflow again, please provide a small example of some test data.
The following is an example of data we can use:
product_ID full_price
0 prod_id 1-1-1-1 11.11
1 prod_id 2-2-2-2 22.22
2 prod_id 3-3-3-3 33.33
3 prod_id 4-4-4-4 44.44
4 prod_id 5-5-5-5 55.55
5 prod_id 6-6-6-6 66.66
6 prod_id 7-7-7-7 77.77
------------------------------------------------------------
product_code tot_price
0 prod_id 3-3-3-3 34.08
1 prod_id 4-4-4-4 45.19
2 prod_id 5-5-5-5 56.30
3 prod_id 6-6-6-6 67.41
4 prod_id 7-7-7-7 78.52
5 prod_id 8-8-8-8 89.63
6 prod_id 9-9-9-9 100.74
Products 1 and 2 are unique to data-frame 1
Products 8 and 9 are unique to data-frame 2
Both data-frames contain data for products 3, 4, 5, ..., 7.
The prices are slightly different between data-frames.
The test data above is generated by the following code:
import pandas as pd
from copy import copy
raw_data = [
[
"prod_id {}-{}-{}-{}".format(k, k, k, k),
int("{}{}{}{}".format(k, k, k, k))/100
] for k in range(1, 10)
]
raw_data = [row for row in raw_data]
df1 = pd.DataFrame(data=copy(raw_data[:-2]), columns=["product_ID", "full_price"])
df2 = pd.DataFrame(data=copy(raw_data[2:]), columns=["product_code", "tot_price"])
for rowid in range(0, len(df2.index)):
df2.at[rowid, "tot_price"] += 0.75
print(df1)
print(60*"-")
print(df2)
Add some error checking
It is considered to be "best-practice" to make sure that your function
inputs are in the correct format.
You wrote a function named range_calc(z, y). I reccomend making sure that z and y are integers, and not something else (such as a pandas Series object).
def range_calc(z, y):
try:
z = float(z)
y = float(y)
except ValueError:
function_name = inspect.stack()[0][3]
with io.StringIO() as string_stream:
print(
"Error: In " + function_name + "(). Inputs should be
like decimal numbers.",
"Instead, we have: " + str(type(y)) + " \'" +
repr(str(y))[1:-1] + "\'",
file=string_stream,
sep="\n"
)
err_msg = string_stream.getvalue()
raise ValueError(err_msg)
# DO STUFF
return
Now we get error messages:
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
range_calc("I am supposed to be an integer", data)
# ValueError: Error in function range_calc(). Inputs should be like
decimal numbers.
# Instead, we have: <class 'str'> "I am supposed to be an integer"
Code which Accomplishes what you Wanted.
The following is some rather ugly code which computes what you wanted:
# You can continue to use your original `range_calc()` function unmodified
# Use the test data I provided earlier in this answer.
def foo(df1, df2):
last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
df1_ids = set(df1["product_ID"].tolist())
df2_ids = set(df2["product_code"].tolist())
pids = df1_ids.intersection(df2_ids) # common_product_ids
for pid in pids:
row1 = df1.loc[df1['product_ID'] == pid]
row2 = df2.loc[df2["product_code"] == pid]
price1 = row1["full_price"].tolist()[0]
price2 = row2["tot_price"].tolist()[0]
price3 = range_calc(price1, price2)
row3 = pd.DataFrame([[pid, price3]], columns=["Product_number", "Market_Price"])
last_df = pd.concat([last_df, row3])
return last_df
# ---------------------------------------
last_df = foo(df1, df2)
The result is:
Product_number Market_Price
0 prod_id 6-6-6-6 1.112595
0 prod_id 7-7-7-7 0.955171
0 prod_id 4-4-4-4 1.659659
0 prod_id 5-5-5-5 1.332149
0 prod_id 3-3-3-3 2.200704
Note that one of many reasons that my solution is ugly is in the following line of code:
last_df = pd.concat([last_df, row3])
if last_df is large (thousands of rows), then the code will run very slowly.
This is because instead of inserting a new row of data, we:
copy the original dataframe
append a new row of data to the copy.
delete/destroy the original data-frame.
It is really silly to copy 10,000 rows of data only to add one new value, and then delete the old 10,000 rows.
However, my solution has fewer bugs than your original code, relatively speaking.
sometimes when you check a condition on series or dataframes, your output is a series such as ( , False).
In this case you must use any, all, item,...
use print function for your condition to see the series.
Also I must tell your code is very very slow and has O(n**2). You can first calculate df3 as joining df1 and df2 then using apply method for fast calculating.
Edit: It looks like this is a potential bug in Pandas. Check out this GitHub issue raised helpfully by #NicMoetsch noticing the unexpected behavior assigning with ditionary values has to do with a difference between frame's __setitem__() and __getitem__().
Earlier on in my code I rename some columns with a dictionary:
cols_dict = {
'Long_column_Name': 'first_column',
'Other_Long_Column_Name': 'second_column',
'AnotherLongColName': 'third_column'
}
for key, val in cols_dict.items():
df.rename(columns={key: val}, inplace=True)
(I know the loop isn't necessary here — in my actual code I'm having to search the columns of a dataframe in a list of dataframes and get a substring match for the dictionary key.)
Later on I do some clean up with applymap(), index with the dictionary values, and it works fine
pibs[cols_dict.values()].applymap(
lambda x: np.nan if ':' in str(x) else x
)
but when I try to assign the slice back to itself, I get a key error (full error message here).
pibs[cols_dict.values()] = pibs[cols_dict.values()].applymap(
lambda x: np.nan if ':' in str(x) else x
)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: dict_values(['first_column', 'second_column', 'third_column'])
The code runs fine if I convert the dictionary values to a list
pibs[list(cols_dict.values())] = ...
so I guess I'm just wondering why I'm able to slice with dictionary values and run applymap() on it, but I'm not able to slice with dictionary values when I turn around and try to assign the result back to the dataframe.
Put simply: why does pandas recognize cols_dict.values() as a list of column names when it's used for indexing, but not when it's used for indexing for assignment?
The issue seems to be unrelated to the applymap(), as using aneroid's example without applymap():
import copy
cols_dict = {
'Long_column_Name': 'first_column',
'Other_Long_Column_Name': 'second_column',
'AnotherLongColName': 'third_column'
}
df = pd.DataFrame({'Long_column_Name': range(3),
'Other_Long_Column_Name': range(3, 6),
'AnotherLongColName': range(15, 10, -2),
})
df.rename(columns=cols_dict, inplace=True)
df[cols_dict.values()] = df[cols_dict.values()]
yields the same error.
Obviously it's not the operation part that doesn't work, but the assignment part, as
df = df[cols_dict.values()]
works fine.
Playing around with different DataFrame combinations showed that the 3 in the error message
ValueError: Wrong number of items passed 3, placement implies 1
Isn't caused by the assignment portion, as trying to assign a four-column DataFrame throws a diffrent error:
df2 = pd.DataFrame({'Long_column_Name': range(3),
'Other_Long_Column_Name': range(3, 6),
'AnotherLongColName': range(15, 10, -2),
'ShtClNm': range(10, 13)})
yields
ValueError: Wrong number of items passed 4, placement implies 1
Thus I tried only assigning one column so that in theory it only passes 1 item which worked fine without throwing an error.
df[cols_dict.values()] = df2['Long_column_Name']
The result however is not what was expected:
df
first_column second_column third_column (first_column, second_column,third_column)
0 0 3 15 0
1 1 4 13 1
2 1 5 11 2
So to me it seems like what is happening is that pandas doesn't recognize the cols_dict.values() that is passed to df[...] = as a list of column names but instead as the name of one new column
(first_column, second_column,third_column).
That's why it tries to fill that new column with the values passed for assignment. Since you passed to many (3) columns to assign to the one new column it broke.
When you use list() in df[list(cols_dict.values())] = it works fine, because it then recognizes that a list of columns is passed.
Diving deeper into pandas DataFrames, I think I've found the issue.
From my understanding, pandas uses __setitem__() for assignment and __getitem__() for look-ups. both functions make use of convert_to_index_sliceable() defined here. convert_to_index_sliceable(), which returns a slice if whatever you've passed is sliceable and Noneif it isn't.
Both __getitem__() and __setitem__() first check, whether convert_to_index_sliceable() returns None however if it doesn't return None, they differ.
__getitem__() converts the indexer to np.intp, which is numpy's indexing datetype before returning the slice as follows:
# Do we have a slicer (on rows)?
indexer = convert_to_index_sliceable(self, key)
if indexer is not None:
if isinstance(indexer, np.ndarray):
indexer = lib.maybe_indices_to_slice(
indexer.astype(np.intp, copy=False), len(self)
)
# either we have a slice or we have a string that can be converted
# to a slice for partial-string date indexing
return self._slice(indexer, axis=0)
__setitem__()on the other hand returns right away:
# see if we can slice the rows
indexer = convert_to_index_sliceable(self, key)
if indexer is not None:
# either we have a slice or we have a string that can be converted
# to a slice for partial-string date indexing
return self._setitem_slice(indexer, value)
Assuming that no unnecessary code was added to __getitem__(), I think __setitem__() must be missing that code, since both pre-return comments state the exact same thing as to what indexer could possibly be.
I'm going to raise a GitHub issue asking if that is intended behavior or not.
Not a direct answer to your question why you're able to fetch records with the dict.values() slicing but not set with it - however, it probably has to do with indexing: Because if I use loc, it works fine.
Let's set it up:
cols_dict = {
'Long_column_Name': 'first_column',
'Other_Long_Column_Name': 'second_column',
'AnotherLongColName': 'third_column'
}
df = pd.DataFrame({'Long_column_Name': range(3),
'Other_Long_Column_Name': range(3, 6),
'AnotherLongColName': range(15, 10, -2),
})
df.rename(columns=cols_dict, inplace=True)
df
first_column second_column third_column
0 0 3 15
1 1 4 13
2 2 5 11
An applymap to use:
df[cols_dict.values()].applymap(lambda x: -1 if x % 2 == 0 else x ** 2)
first_column second_column third_column
0 -1 9 225
1 1 -1 169
2 -1 25 121
This line throws the error you got:
df[cols_dict.values()] = df[cols_dict.values()].applymap(lambda x: -1 if x % 2 == 0 else x ** 2)
# error thrown
But this works, with df.loc:
df.loc[:, cols_dict.values()] = df[cols_dict.values()].applymap(lambda x: -1 if x % 2 == 0 else x ** 2)
df
first_column second_column third_column
0 -1 9 225
1 1 -1 169
2 -1 25 121
Edit, some partial inference which could be wrong: Btw, the longer error shows what else might have been happening:
KeyError: dict_values(['first_column', 'second_column', 'third_column'])
During handling of the above exception, another exception occurred:
# later:
ValueError: Wrong number of items passed 3, placement implies 1
...which has gone through a section of insert and make_block which leads me to think it was trying to create columns and failed there. And that section was invoked for setitem but not for getitem - so the lookups occurring did not have the same result. I would have instead expected the "setting with copy" error.
Something happened overnight to my favorite np function and I don't understand what?
The below code used to work just fine and now I get the following error
data = {'text': ['Facotry One fired', 'Second value', 'Match'],
'H&S': [1, 0 , 0]}
df_test = pd.DataFrame(data, columns = ['text','H&S'])
df_test['H&S'] = np.where(df_test['text'].str.contains('fired'), 0, df_test['H&S'])
Expected Outcome
data = {'text': ['Facotry One fired', 'Second value', 'Match'],
'H&S': [0, 0 , 0]}
df_test = pd.DataFrame(data, columns = ['text','H&S'])
The error:
ValueError Traceback (most recent call last)
<ipython-input-28-d8607dc64cae> in <module>
4 df_test = pd.DataFrame(data, columns = ['text','H&S'])
5
----> 6 df_test['H&S'] = np.where(df_test['text'].str.contains('fired'), 0, df_test['H&S'])
~\anaconda3\lib\site-packages\pandas\core\generic.py in where(self, cond, other, inplace, axis, level, errors, try_cast)
8916
8917 other = com.apply_if_callable(other, self)
-> 8918 return self._where(
8919 cond, other, inplace, axis, level, errors=errors, try_cast=try_cast
8920 )
~\anaconda3\lib\site-packages\pandas\core\generic.py in _where(self, cond, other, inplace, axis, level, errors, try_cast)
8649 applied as a function even if callable. Used in __setitem__.
8650 """
-> 8651 inplace = validate_bool_kwarg(inplace, "inplace")
8652
8653 # align the cond to same shape as myself
~\anaconda3\lib\site-packages\pandas\util\_validators.py in validate_bool_kwarg(value, arg_name)
208 """ Ensures that argument passed in arg_name is of type bool. """
209 if not (is_bool(value) or value is None):
--> 210 raise ValueError(
211 f'For argument "{arg_name}" expected type bool, received '
212 f"type {type(value).__name__}."
ValueError: For argument "inplace" expected type bool, received type Series.
Look at line 7 of your StackTrace. It contains:
lib\site-packages\pandas\core\generic.py
So you are calling here the pandasonic version of where (not
Numpythonic).
Then look at the documentation of pandas.DataFrame.where, especially at
the order of parameters.
Note that the third parameter is inplace and it should be of bool type.
Note also that the third parameter in your instruction is a DataFrame column
(actually, just a Series).
So probably at the moment when you execute the offending instruction
np contains a DataFrame (instead of Numpy module).
To check this, add print(type(np)) just before the offending instruction.
Under normal conditions you should get module.
But if I am right (somewhere earlier in your code you executed something like
np = <some DataFrame>), you wil get pandas.core.frame.DataFrame.
If I'm right, inspect your code, find where Numpy module is overwritten
with a DataFrame and change np in this place with any other name.
I am trying to change the series of a pandas DataFrame object using the iterrows() function. The DataFrame is full of random floats. Below is a sample of both pieces of code:
This one works:
for index,row in other_copy.iterrows()
other_copy.loc[index] = (other_copy.loc[index] > 30)
But this one doesn't:
for index,row in other_copy.iterrows():
top_3 = other_copy.loc[index].nlargest(3)
minimum = min(top_3)
other_copy.loc[index] = (other_copy.loc[index] > minimum)
The first one modifies the DataFrame, True and False accordingly. However, the second one gives me the below error:
> TypeError Traceback (most recent call last) <ipython-input-116-11f6c908f54a> in <module>()
1 for index,row in other_copy.iterrows():
----> 2 top_3 = other_copy.loc[index].nlargest(3)
3 minimum = min(top_3)
4 other_copy.loc[index] = (other_copy.loc[index] > minimum)
/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in
nlargest(self, n, keep) 2061 dtype: float64 2062
"""
-> 2063 return algorithms.SelectNSeries(self, n=n, keep=keep).nlargest() 2064 2065 def nsmallest(self, n=5,
keep='first'):
/opt/conda/lib/python3.6/site-packages/pandas/core/algorithms.py in
nlargest(self)
915
916 def nlargest(self):
--> 917 return self.compute('nlargest')
918
919 def nsmallest(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/algorithms.py in
compute(self, method)
952 raise TypeError("Cannot use method '{method}' with "
953 "dtype {dtype}".format(method=method,
--> 954 dtype=dtype))
955
956 if n <= 0:
TypeError: Cannot use method 'nlargest' with dtype object
Am I missing something simple here? The minimum variable is just a float and the comparison should go through. I even tried by using
int(minimum)
but it still gives me the same error. Also I'm able to use:
print(other_copy.loc[index] > minimum)
and this works as well to print the correct response. Any ideas why this might be happening? Sorry if this is something simple.
The problem isn't minimum, it's the code that sets minimum. When you slice out your row, it turns into a series which has a dtype object (because there are mixed dtypes in your columns the object dtype is the only one that's compatible with all of them)
When you try to run .nlargest() on this row slice, it clearly tells you the problem: TypeError: Cannot use method 'nlargest' with dtype object You should therefore cast your series to a numeric.
import pandas as pd
for index,row in other_copy.iterrows():
top_3 = pd.to_numeric(other_copy.loc[index], errors = 'coerce').nlargest(3)
minimum = min(top_3)
other_copy.loc[index] = (other_copy.loc[index] > minimum)
This may cause another error if there are no entries that can be cast to numerics in the row, and it probably will fail if you try to do an unsafe comparison (like 'str' > 'float')
I have two data frames D1 and D2. What I want to achieve is for any column pairs in D1 and D2 which are non-int and non-float type, I want to compute a distance metric using the formula
|A intersect B|/ |A union B|
I first defined a function
def jaccard_d(series1, series2):
if (series1.dtype is not (pd.np.dtype(int) or pd.np.dtype(float))) and (series2.dtype is not (pd.np.dtype(int) or pd.np.dtype(float))):
series1 = series1.drop_duplicates()
series2 = series2.drop_duplicates()
return len(set(series1).intersection(set(series2))) /len(set(series1).union(set(series2)))
else:
return np.nan
Then what I did is to first loop over all columns in D1, then for each fixed column in D1, I use apply on my jaccard_d function. I try to avoid writing 2 layer loops. May be there is a better way without wrting any loops?
DC = dict.fromkeys(list(D1.columns))
INN = list(D2.columns)
for col in D1:
DC[col] = dict(zip(INN, D2.apply(jaccard_d,D1[col])))
First, I am not sure whether I use the apply function correctly, i.e., my jaccard_d function takes 2 series as input, but here for each iteration, I have D1[col] as one series, and I want to use apply to apply D1[col] to all columns of D2
Second, I get this error "'Series' objects are mutable, thus they cannot be hashed", which I don't quite understand. Any comments are appreciated.
I tried to just write a 2-layer loop and use my function jaccard_d to do that. It works. But I want to write more efficient code.
So after floundering around, and finding exactly where the error occurs, and checking the apply docs, I've deduced that you need to call apply thusly:
D2.apply(jaccard_d, args=(D1[col],))
Instead you were using
D2.apply(jaccard_d, axis=D1[col])
==================
I can reproduce your error message with a simple dataframe:
In [589]: df=pd.DataFrame(np.arange(12).reshape(6,2))
In [590]: df
Out[590]:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
5 10 11
Putting a Series in set works, just as if we'd put a list in set:
In [591]: set(df[0]).union(set(df[1]))
Out[591]: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
But if I try to put a list containing a Series in the set I get your error.
In [592]: set([df[0]])
....
TypeError: 'Series' objects are mutable, thus they cannot be hashed
If the problem isn't with the the set expressions then it occurs in the dict() one.
You did not specify where the error occurs, nor have you given a MVCe.
(but as it turns out this is a deadend)
========================
OK, simulating your code:
In [606]: DC=dict.fromkeys(list(df.columns))
In [607]: DC
Out[607]: {0: None, 1: None}
In [608]: INN=list(df.columns)
In [609]: INN
Out[609]: [0, 1]
In [610]: for col in df:
...: dict(zip(INN, df.apply(jaccard_d, df[col])))
....
----> 2 dict(zip(INN, df.apply(jaccard_d, df[col])))
/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
...
-> 4125 axis = self._get_axis_number(axis)
/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in _get_axis_number(self, axis)
326
327 def _get_axis_number(self, axis):
--> 328 axis = self._AXIS_ALIASES.get(axis, axis)
....
TypeError: 'Series' objects are mutable, thus they cannot be hashed
So the problem is in
df.apply(jaccard_d, df[0])
The problem has nothing to do with jaccard_d. It occurs if I replace it with simple
def foo(series1, series2):
print(series1)
print(series2)
return 1
======================
But look at the docs for apply
df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
The 2nd argument, if not keyword, is the axis number. So we have been trying to use a Series as the axis number! No wonder it objects! That should have been obvious if I'd read the error trace more carefully.
Leaving the default axis=0, lets pass the other Series as args:
In [632]: df.apply(jaccard_d,args=(df[1],))
Out[632]:
0 0.0
1 1.0
dtype: float64
or in your loop:
In [643]: for col in df:
...: DC[col] = dict(zip(INN, df.apply(jaccard_d,args=(df[col],))))
In [644]: DC
Out[644]: {0: {0: 1.0, 1: 0.0}, 1: {0: 0.0, 1: 1.0}}