Taking logarithm of column - python

Im quite new to programming (in python) and I would like to create a new variable that is the logarithm of a column (from an imported excel file). I have tried different solutions from this site, but I keep getting an error. My latest error is AttributeError: 'str' object has no attribute 'log'.
I have already dropped all the values that are not "numbers', but I still don't know how to convert the values from strings to integers (if this is the case, because 'int(neighborhood)' doesn't work).
This is the code I have now:
import pandas as pd
import numpy as np
df=pd.read_excel("kwb-2016_del_col_del_row.xls")
df = df[df.m_woz != "."] # drop rows with values "."
neighborhood=df[df.recs=="Neighborhood"]
neighborhood=neighborhood["m_woz"]
print(neighborhood)
np.log(neighborhood)
and this is the error I'm getting:
AttributeError Traceback (most recent call last)
<ipython-input-66-46698de51811> in <module>()
12 print(neighborhood)
13
---> 14 np.log(neighborhood)
AttributeError: 'str' object has no attribute 'log'
Could someone help me please?

Perhaps you are not removing the data you think you are?
Try printing the data types to see what they are.
In a DataFrame, your column might be filled with objects instead of numbers.
print(df.dtypes)
Also, you might want to look at these two pages
Select row from a DataFrame based on the type of the object(i.e. str)
Pandas: convert dtype 'object' to int
Here's an example I constructed and ran interactively that correctly gets the logarithms (don't type >>>):
>>> raw_data = {'m_woz': ['abc', 'def', 1.23, 45.6, '.xyz'],
'recs': ['Neighborhood', 'Neighborhood',
'unknown', 'Neighborhood', 'whatever']}
>>> df = pd.DataFrame(raw_data, columns = ['m_woz', 'recs'])
>>> print(df.dtypes)
m_woz object
recs object
dtype: object
Note that the type is object, not float or int or str
Continuing on, here is what df and neighborhood look like:
>>> df
m_woz recs
0 42 Neighborhood
1 def Neighborhood
2 1.23 unknown
3 45.6 Neighborhood
4 .xyz whatever
>>> neighborhood=df[df.recs=="Neighborhood"]
>>> neighborhood
m_woz recs
0 42 Neighborhood
1 def Neighborhood
3 45.6 Neighborhood
And here are the tricks...
This line selects all rows in neighborhood that are int or float (be careful to fix indents if you copy/paste this
>>> df_num_strings = neighborhood[neighborhood['m_woz'].
apply(lambda x: type(x) in (int, float))]
>>> df_num_strings
m_woz recs
0 42 Neighborhood
3 45.6 Neighborhood
Almost there... convert the numbers to floating point from string
>>> df_float = df_num_strings['m_woz'].astype(str).astype(float)
>>> df_float
0 42.0
3 45.6
Finally, compute logarithms:
>>> np.log(df_float)
0 3.737670
3 3.819908
Name: m_woz, dtype: float64

Related

Write ORC using Pandas with all values of sequence None

I want to write a simple dataframe as an ORC file. The only sequence is of an integer type. If I set all values to None, an exception is raised on to_orc.
I understand that pyarrow cannot infer datatype from None values but what can I do to fix the datatype for output? Attempts to use .astype() only brought TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Bonus points if the solution also works for
empty dataframes
nested types
Script:
data = {'a': [1, 2]}
df = pd.DataFrame(data=data)
print(df)
df.to_orc('a.orc') # OK
df['a'] = None
print(df)
df.to_orc('a.orc') # fails
Output:
a
0 1
1 2
a
0 None
1 None
Traceback (most recent call last):
File ... line 9, in <module>
...
File "pyarrow/_orc.pyx", line 443, in pyarrow._orc.ORCWriter.write
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unknown or unsupported Arrow type: null
This is a known issue, see https://github.com/apache/arrow/issues/30317. The problem is that the ORC writer does not yet support writing a column of all-nulls without specific dtype (not object dtype). If you cast the column to, for example, float first, then the writing works.
Using the df from your example:
>>> df.dtypes
a object
dtype: object
# the column has generic object dtype, cast to float
>>> df['a'] = df['a'].astype("float64")
>>> df.dtypes
a float64
dtype: object
# now writing to ORC and reading back works
>>> df.to_orc('a.orc')
>>> pd.read_orc('a.orc')
a
0 NaN
1 NaN

DataFrame.drop leads to ufunc loop error with numpy.sin

Introduction
My code is supposed to import the data from .xlsx files and make calculations based on this. The problem lies in that the unit of each column is saved in the second row of the sheet and is imported as first entry of the data column. Resulting in something like this:
import pandas as pd
import numpy as np
data = pd.DataFrame(data = {'alpha' : ['[°]', 180, 180, 180]})
data['sin'] = np.sin(data['alpha'])
Problem
Because the first cell is str type, the column becomes object type. I thought I could solve this problem by rearranging the dataframe by adding the following code between the two lines:
data = data.drop([0]).reset_index(drop = True)
data.astype({'alpha' : 'float64'})
The dataframe now looks like I want it to look and I suppose it should work as intendet, but instead I get an AttributeError and a TypeError:
AttributeError: 'float' object has no attribute 'sin'TypeError: loop of ufunc does not support argument 0 of type float which has no callable sin method
TypeError: loop of ufunc does not support argument 0 of type float which has no callable sin method
Any insight on why I get these errors and how to solve them would be appreciated!
You can use Pandas' conversion function like this:
data = pd.DataFrame(data = {'alpha' : ['[°]', 180, 180, 180]})
data['alpha'] = pd.to_numeric(data['alpha'], errors='coerce')
# is your alpha degrees or radians?
data['sin'] = np.sin(np.deg2rad(data['alpha']))
Output:
alpha sin
0 NaN NaN
1 180.0 1.224647e-16
2 180.0 1.224647e-16
3 180.0 1.224647e-16

Check which value in Pandas Dataframe Column is String

I have a Dataframe that consists of around 0.2 Million Records. When I'm inputting this Dataframe as an input to a model, it's throwing this error:
Cast string to float is not supported.
Is there any way I can check which particular value in the data frame is causing this error?
I've tried running this command and checking if any value is a string in the column.
False in map((lambda x: type(x) == str), trainDF['Embeddings'])
Output:
True
In panda when we convert those type mix column we do
df['col'] = pd.to_numeric(df['col'],errors = 'coerce')
Which will return NaN for those item can not be convert to float, you can drop then with dropna or fill some default value with fillna
You should loop over trainDF's indices and find the rows that have errors using try except.
>>> import pandas as pd
>>> trainDF = pd.DataFrame({'Embeddings':['100', '23.2', '44a', '453.2']})
>>> trainDF
Embeddings
0 100
1 23.2
2 44a
3 453.2
>>> error_indices = []
>>> for idx, row in trainDF.iterrows():
... try:
... trainDF.loc[idx, 'Embeddings'] = float(row['Embeddings'])
... except:
... error_indices.append(idx)
...
>>> trainDF
Embeddings
0 100.0
1 23.2
2 44a
3 453.2
>>> trainDF.loc[error_indices]
Embeddings
2 44a

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced

I am trying to convert a csv into numpy array. In the numpy array, I am replacing few elements with NaN. Then, I wanted to find the indices of the NaN elements in the numpy array. The code is :
import pandas as pd
import matplotlib.pyplot as plyt
import numpy as np
filename = 'wether.csv'
df = pd.read_csv(filename,header = None )
list = df.values.tolist()
labels = list[0]
wether_list = list[1:]
year = []
month = []
day = []
max_temp = []
for i in wether_list:
year.append(i[1])
month.append(i[2])
day.append(i[3])
max_temp.append(i[5])
mid = len(max_temp) // 2
temps = np.array(max_temp[mid:])
temps[np.where(np.array(temps) == -99.9)] = np.nan
plyt.plot(temps,marker = '.',color = 'black',linestyle = 'none')
# plyt.show()
print(np.where(np.isnan(temps))[0])
# print(len(pd.isnull(np.array(temps))))
When I execute this, I am getting a warning and an error. The warning is :
wether.py:26: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
temps[np.where(np.array(temps) == -99.9)] = np.nan
The error is :
Traceback (most recent call last):
File "wether.py", line 30, in <module>
print(np.where(np.isnan(temps))[0])
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
This is a part of the dataset which I am using:
83168,2014,9,7,0.00000,89.00000,78.00000, 83.50000
83168,2014,9,22,1.62000,90.00000,72.00000, 81.00000
83168,2014,9,23,0.50000,87.00000,74.00000, 80.50000
83168,2014,9,24,0.35000,82.00000,73.00000, 77.50000
83168,2014,9,25,0.60000,85.00000,75.00000, 80.00000
83168,2014,9,26,0.76000,89.00000,77.00000, 83.00000
83168,2014,9,27,0.00000,89.00000,79.00000, 84.00000
83168,2014,9,28,0.00000,90.00000,81.00000, 85.50000
83168,2014,9,29,0.00000,90.00000,79.00000, 84.50000
83168,2014,9,30,0.50000,89.00000,75.00000, 82.00000
83168,2014,10,1,0.02000,91.00000,75.00000, 83.00000
83168,2014,10,2,0.03000,93.00000,77.00000, 85.00000
83168,2014,10,3,1.40000,93.00000,75.00000, 84.00000
83168,2014,10,4,0.06000,89.00000,75.00000, 82.00000
83168,2014,10,5,0.22000,91.00000,68.00000, 79.50000
83168,2014,10,6,0.00000,84.00000,68.00000, 76.00000
83168,2014,10,7,0.17000,85.00000,73.00000, 79.00000
83168,2014,10,8,0.06000,84.00000,73.00000, 78.50000
83168,2014,10,9,0.00000,87.00000,73.00000, 80.00000
83168,2014,10,10,0.00000,88.00000,80.00000, 84.00000
83168,2014,10,11,0.00000,87.00000,80.00000, 83.50000
83168,2014,10,12,0.00000,88.00000,80.00000, 84.00000
83168,2014,10,13,0.00000,88.00000,81.00000, 84.50000
83168,2014,10,14,0.04000,88.00000,77.00000, 82.50000
83168,2014,10,15,0.00000,88.00000,77.00000, 82.50000
83168,2014,10,16,0.09000,89.00000,72.00000, 80.50000
83168,2014,10,17,0.00000,85.00000,67.00000, 76.00000
83168,2014,10,18,0.00000,84.00000,65.00000, 74.50000
83168,2014,10,19,0.00000,84.00000,65.00000, 74.50000
83168,2014,10,20,0.00000,85.00000,69.00000, 77.00000
83168,2014,10,21,0.77000,87.00000,76.00000, 81.50000
83168,2014,10,22,0.69000,81.00000,71.00000, 76.00000
83168,2014,10,23,0.31000,82.00000,72.00000, 77.00000
83168,2014,10,24,0.71000,79.00000,73.00000, 76.00000
83168,2014,10,25,0.00000,81.00000,68.00000, 74.50000
83168,2014,10,26,0.00000,82.00000,67.00000, 74.50000
83168,2014,10,27,0.00000,83.00000,64.00000, 73.50000
83168,2014,10,28,0.00000,83.00000,66.00000, 74.50000
83168,2014,10,29,0.03000,86.00000,76.00000, 81.00000
83168,2014,10,30,0.00000,85.00000,69.00000, 77.00000
83168,2014,10,31,0.00000,85.00000,69.00000, 77.00000
83168,2014,11,1,0.00000,86.00000,59.00000, 72.50000
83168,2014,11,2,0.00000,77.00000,52.00000, 64.50000
83168,2014,11,3,0.00000,70.00000,52.00000, 61.00000
83168,2014,11,4,0.00000,77.00000,59.00000, 68.00000
83168,2014,11,5,0.02000,79.00000,73.00000, 76.00000
83168,2014,11,6,0.02000,82.00000,75.00000, 78.50000
83168,2014,11,7,0.00000,83.00000,66.00000, 74.50000
83168,2014,11,8,0.00000,84.00000,65.00000, 74.50000
83168,2014,11,9,0.00000,84.00000,65.00000, 74.50000
83168,2014,11,10,1.20000,72.00000,65.00000, 68.50000
83168,2014,11,11,0.08000,77.00000,61.00000, 69.00000
83168,2014,11,12,0.00000,80.00000,61.00000, 70.50000
83168,2014,11,13,0.00000,83.00000,63.00000, 73.00000
83168,2014,11,14,0.00000,83.00000,65.00000, 74.00000
83168,2014,11,15,0.00000,82.00000,64.00000, 73.00000
83168,2014,11,16,0.00000,83.00000,64.00000, 73.50000
83168,2014,11,17,0.07000,84.00000,64.00000, 74.00000
83168,2014,11,18,0.00000,86.00000,71.00000, 78.50000
83168,2014,11,19,0.57000,78.00000,55.00000, 66.50000
83168,2014,11,20,0.05000,72.00000,56.00000, 64.00000
83168,2014,11,21,0.05000,77.00000,63.00000, 70.00000
83168,2014,11,22,0.22000,77.00000,69.00000, 73.00000
83168,2014,11,23,0.06000,79.00000,76.00000, 77.50000
83168,2014,11,24,0.02000,84.00000,78.00000, 81.00000
83168,2014,11,25,0.00000,86.00000,78.00000, 82.00000
83168,2014,11,26,0.07000,85.00000,77.00000, 81.00000
83168,2014,11,27,0.21000,82.00000,55.00000, 68.50000
83168,2014,11,28,0.00000,73.00000,53.00000, 63.00000
83168,2015,1,8,0.00000,80.00000,57.00000,
83168,2015,1,9,0.05000,72.00000,56.00000,
83168,2015,1,10,0.00000,72.00000,57.00000,
83168,2015,1,11,0.00000,80.00000,57.00000,
83168,2015,1,12,0.05000,80.00000,59.00000,
83168,2015,1,13,0.85000,81.00000,69.00000,
83168,2015,1,14,0.05000,81.00000,68.00000,
83168,2015,1,15,0.00000,81.00000,64.00000,
83168,2015,1,16,0.00000,78.00000,63.00000,
83168,2015,1,17,0.00000,73.00000,55.00000,
83168,2015,1,18,0.00000,76.00000,55.00000,
83168,2015,1,19,0.00000,78.00000,55.00000,
83168,2015,1,20,0.00000,75.00000,56.00000,
83168,2015,1,21,0.02000,73.00000,65.00000,
83168,2015,1,22,0.00000,80.00000,64.00000,
83168,2015,1,23,0.00000,80.00000,71.00000,
83168,2015,1,24,0.00000,79.00000,72.00000,
83168,2015,1,25,0.00000,79.00000,49.00000,
83168,2015,1,26,0.00000,79.00000,49.00000,
83168,2015,1,27,0.10000,75.00000,53.00000,
83168,2015,1,28,0.00000,68.00000,53.00000,
83168,2015,1,29,0.00000,69.00000,53.00000,
83168,2015,1,30,0.00000,72.00000,60.00000,
83168,2015,1,31,0.00000,76.00000,58.00000,
83168,2015,2,1,0.00000,76.00000,58.00000,
83168,2015,2,2,0.05000,77.00000,58.00000,
83168,2015,2,3,0.00000,84.00000,56.00000,
83168,2015,2,4,0.00000,76.00000,56.00000,
I am unable to rectify the error. How to overcome the warning in the 26th line? How can one solve this error?
Update :
when I try the same thing in different way like reading dataset from file instead of converting to dataframes, I am not getting the error. What would be the reason for that? The code is :
weather_filename = 'wether.csv'
weather_file = open(weather_filename)
weather_data = weather_file.read()
weather_file.close()
# Break the weather records into lines
lines = weather_data.split('\n')
labels = lines[0]
values = lines[1:]
n_values = len(values)
# Break the list of comma-separated value strings
# into lists of values.
year = []
month = []
day = []
max_temp = []
j_year = 1
j_month = 2
j_day = 3
j_max_temp = 5
for i_row in range(n_values):
split_values = values[i_row].split(',')
if len(split_values) >= j_max_temp:
year.append(int(split_values[j_year]))
month.append(int(split_values[j_month]))
day.append(int(split_values[j_day]))
max_temp.append(float(split_values[j_max_temp]))
# Isolate the recent data.
i_mid = len(max_temp) // 2
temps = np.array(max_temp[i_mid:])
year = year[i_mid:]
month = month[i_mid:]
day = day[i_mid:]
temps[np.where(temps == -99.9)] = np.nan
# Remove all the nans.
# Trim both ends and fill nans in the middle.
# Find the first non-nan.
i_start = np.where(np.logical_not(np.isnan(temps)))[0][0]
temps = temps[i_start:]
year = year[i_start:]
month = month[i_start:]
day = day[i_start:]
i_nans = np.where(np.isnan(temps))[0]
print(i_nans)
What is wrong in the first code and why the second doesn't even give a warning?
Posting as it might help future users.
As correctly pointed out by others, np.isnan won't work for object or string dtypes. If you're using pandas, as mentioned here you can directly use pd.isnull, which should work in your case.
import pandas as pd
import numpy as np
var1 = ''
var2 = np.nan
>>> type(var1)
<class 'str'>
>>> type(var2)
<class 'float'>
>>> pd.isnull(var1)
False
>>> pd.isnull(var2)
True
Try replacing np.isnan with pd.isna. Pandas' isna supports category dtypes
What's the dtype of temps. I can reproduce your warning and error with a string dtype:
In [26]: temps = np.array([1,2,'string',0])
In [27]: temps
Out[27]: array(['1', '2', 'string', '0'], dtype='<U21')
In [28]: temps==-99.9
/usr/local/bin/ipython3:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
#!/usr/bin/python3
Out[28]: False
In [29]: np.isnan(temps)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-2ff7754ed926> in <module>()
----> 1 np.isnan(temps)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
First, comparing strings with the number gives this future warning.
Second, testing for nan produces the error.
Note that given the dtype, the nan assignment assigns a string value, not a float (np.nan is a float).
In [30]: temps[-1] = np.nan
In [31]: temps
Out[31]: array(['1', '2', 'string', 'nan'], dtype='<U21')
isnan(ndarray) fails on ndarray dtype of "object"
isnan(ndarray.astype(np.float)), but strings cannot be coerced to float.
This is likely a result of an unwanted float to string conversion. To repair it, just reverse it by adding string-to-float conversion (assuming data is convertible to a number) using float or np.float64:
np.isnan(float(str(np.nan)))
True
or
np.isnan(float(str("nan")))
True
rather than:
np.isnan(str(np.nan))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [164], line 1
----> 1 np.isnan(str(np.nan))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Note that if your data is NOT convertible to numbers (floats), you need to use a string-compatible function such as pd.isna instead of np.isnan.
I came across this error when trying to transform my dataset using sklearn.preprocessing.OneHotEncoder. The error was thrown by _check_unknown function defined in sklearn.utils._encode.
This was caused by the fact that, at transform time, one of the columns to be transformed had a type float64 as opposed to object - in my case an entire column was NaN.
The solution was to cast the dataframe to object type before invoking transform:
ohe.transform(data.astype("O"))
Note: This answer is somewhat related to the title of the question because this error prompts when working with Decimal types.
I got the same error when considering Decimal type values. For some reason, one column of the dataframe I'm considering comes as decimal. For example, when calling .unique() on this column I got
[Decimal('0'), Decimal('95'), Decimal('38'), Decimal('25'),
Decimal('42'), Decimal('11'), Decimal('18'), Decimal('22'),
.....Decimal('220'), Decimal('724')]
As the traceback of the error showed me that it failed when calling some numpy function. I manage to reproduce the error by considering the min and maxvalues of the above array
from decimal import Decimal
xmin, xmax = Decimal('0'), Decimal('724')
np.isnan([xmin, xmax])
it will prompt the error
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The solution in this case was to cast all these values to int.
df.astype({col:int for col in desired_columns_to_convert})

Role of name of pandas.Series while doing difference

I have two pandas.Series objects, say a and b, having the same index, and when performing the difference a - b I get the error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
which I don't understand where is coming from.
The Series a is obtained as a slice of a DataFrame whose index is a MultiIndex, and when I do a renaming
a.name = 0
the operation works fine (but if I rename to a tuple I get the same error).
Unfortunately, I am not able to reproduce a minimal example of the phenomenon (the difference of ad-hoc Series with name a tuple seems to work fine).
Any ideas on why this is happening?
If relevant, pandas version is 0.22.0
EDIT
The full traceback of the error:
----------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-e4efbf202d3c> in <module>()
----> 1 one - two
~/venv/lib/python3.4/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op)
727
728 if isinstance(rvalues, ABCSeries):
--> 729 name = _maybe_match_name(left, rvalues)
730 lvalues = getattr(lvalues, 'values', lvalues)
731 rvalues = getattr(rvalues, 'values', rvalues)
~/venv/lib/python3.4/site-packages/pandas/core/common.py in _maybe_match_name(a, b)
137 b_has = hasattr(b, 'name')
138 if a_has and b_has:
--> 139 if a.name == b.name:
140 return a.name
141 else:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
EDIT 2
Some more details on how a and b are obtained:
I have a DataFrame df whose index is a multyindex (year, id_)
I have a Series factors whose index are the columns of df (something like the standard deviation of the columns)
Then:
tmp = df.loc[(year, id_)]
a = tmp[factors != 0]
b = factors[factors != 0]
diff = a - b
and executing the last line the error happens.
EDIT 3
And it keeps happening also if I reduce the columns: the original df has around 1000 rows and columns, but reducing to the last 5 lines and columns, the problem persists!
For example, by doing
df = df.iloc[-10:][df.columns[-5:]]
line = df.iloc[-3]
factors = factors[df.columns]
a = line[factors != 0]
b = factors[factors != 0]
diff = a - b
I keep getting the same error, while printing a and b I obtain
a:
end_bin_68.750_100.000 0.002413
end_bin_75.000_100.000 0.002614
end_bin_81.250_100.000 0.001810
end_bin_87.500_100.000 0.002313
end_bin_93.750_100.000 0.001609
Name: (2015, 10000030), dtype: float64
b:
end_bin_68.750_100.000 0.001244
end_bin_75.000_100.000 0.001242
end_bin_81.250_100.000 0.000918
end_bin_87.500_100.000 0.000659
end_bin_93.750_100.000 0.000563
Name: 1, dtype: float64
While if I manually create df and factors with these same values (also in the indices) the error does not happen.
EDIT 4
While debugging, when one gets to the function _maybe_match_name one obtains the following:
ipdb> type(a.name)
<class 'tuple'>
ipdb> type(b.name)
<class 'numpy.int64'>
ipdb> a.name == b.name
a = end_bin_68.750_100.000 0.002413
end_bin_75.000_100.000 0.002614
end_bin_81.250_100.000 0.001810
end_bin_87.500_100.000 0.002313
end_bin_93.750_100.000 0.001609
Name: (2015, 10000030), dtype: float64
b = end_bin_68.750_100.000 0.001244
end_bin_75.000_100.000 0.001242
end_bin_81.250_100.000 0.000918
end_bin_87.500_100.000 0.000659
end_bin_93.750_100.000 0.000563
Name: 1, dtype: float64
ipdb> (a.name == b.name)
array([False, False])
EDIT 5
Finally I got to a minimal example:
a = pd.Series([1, 2, 3])
a.name = np.int64(13)
b = pd.Series([4, 5, 6])
b.name = (123, 789)
a - b
this raises the error to me, np.__version__ == 1.14.0 and pd.__version__ == 0.22.0
When an operation is made between two pandas Series it tries to give a name to the resulting Series.
s1 = pd.Series(np.random.randn(5))
s2 = pd.Series(np.random.randn(5))
s1.name = "hello"
s2.name = "hello"
s3 = s1-s2
s3.name
>>> "hello"
If the name is not the same, then the resulting Series has no name.
s1 = pd.Series(np.random.randn(5))
s2 = pd.Series(np.random.randn(5))
s1.name = "hello"
s2.name = "goodbye"
s3 = s1-s2
s3.name
>>>
This is done by comparing Series names with the function _maybe_match_name(), than is here on GitHub.
The comparison operator compares apparently in your case an array with a tuple, which is not possible (I haven't been able to reproduce the error), and raise the ValueError exception.
I guess it is a bug, what is weird is that np.int64(42) == ("A", "B")doesn't raise an exception for me.
But I have a FutureWarning from numpy:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison.
Which makes me think that you are using a extremely recent numpy version (you compiled it from the master branch on GitHub ?).
The bug will likely be corrected in next pandas release as it is a result of a future change in the behavior of numpy.
My guess is that the best thing to do is just to rename your Series before making operation as you already did b.name = None, or to change your numpy version (1.15.0works well).

Categories

Resources