Has anyone encountered similar cases as below, where if we let a be a Timestamp, b to be datetime64, then comparing a < b is fine, but b < a returns error.
If a can be compared to b, I thought we should be able to compare the other way around?
For example (Python 2.7):
>>> a
Timestamp('2013-03-24 05:32:00')
>>> b
numpy.datetime64('2013-03-23T05:33:00.000000000')
>>> a < b
False
>>> b < a
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pandas\_libs\tslib.pyx", line 1080, in pandas._libs.tslib._Timestamp.__richcmp__ (pandas\_libs\tslib.c:20281)
TypeError: Cannot compare type 'Timestamp' with type 'long'
Many thanks in advance!
That's an interesting question. I've done some digging around and did my best to explain some of this, although one thing i still don't get is why we get pandas throwing an error instead of numpy when we do b<a.
Regards to your question:
If a can be compared to b, I thought we should be able to compare the other way around?
That's not necesserily true. It just depends on the implementation of the comparison operators.
Take this test class for example:
class TestCom(int):
def __init__(self, a):
self.value = a
def __gt__(self, other):
print('TestComp __gt__ called')
return True
def __eq__(self, other):
return self.a == other
Here I have defined my __gt__ (<) method to always return true no matter what the other value is. While __eq__ (==) left the same.
Now check the following comparisons out:
a = TestCom(9)
print(a)
# Output: 9
# my def of __ge__
a > 100
# Ouput: TestComp __gt__ called
# True
a > '100'
# Ouput: TestComp __gt__ called
# True
'100' < a
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-486-8aee1b1d2500> in <module>()
1 # this will not use my def of __ge__
----> 2 '100' > a
TypeError: '>' not supported between instances of 'str' and 'TestCom'
So going back to your case. Looking at the timestamps_sourceCode the only thing i can think of is pandas.Timestamp does some type checking and conversion if possible.
When we're comparing a with b (pd.Timestamp against np.datetime64), Timestamp.__richcmp__ function does the comparison, if it is of type np.datetime64 then it converts it to pd.Timestamp type and does the comparison.
# we can do the following to have a comparison of say b > a
# this converts a to np.datetime64 - .asm8 is equivalent to .to_datetime64()
b > a.asm8
# or we can confert b to datetime64[ms]
b.astype('datetime64[ms]') > a
# or convert to timestamp
pd.to_datetime(b) > a
What i found surprising was, as i thought the issue is with nanoseconds not in Timestamp, is that even if you do the following the comparison between np.datetime64 with pd.Timestamp fails.
a = pd.Timestamp('2013-03-24 05:32:00.00000001')
a.nanosecond # returns 10
# doing the comparison again where they're both ns still fails
b < a
Looking at the source code it seems like we can use == and != operators. But even they dont work as expected. Take a look at the following for an example:
a = pd.Timestamp('2013-03-24 05:32:00.00000000')
b = np.datetime64('2013-03-24 05:32:00.00000000', 'ns')
b == a # returns False
a == b # returns True
I think this is the result of lines 149-152 or 163-166. Where they return False if your using == and True for !=, without actually comparing the values.
Edit:
The nanosecond feature was added in version 0.23.0. So you can do something like pd.Timestamp('2013-03-23T05:33:00.000000022', unit='ns'). So yes when you compare np.datetime64 it will be converted to pd.Timestamp with ns precision.
Just note that pd.Timestamp is supposed to be a replacement for python`s datetime:
Timestamp is the pandas equivalent of python's Datetime
and is interchangeable with it in most cases.
But python's datetime doesn't support nanoseconds - good answer here explaining why SO_Datetime.pd.Timestamp have support for comparison between the two even if your Timestamp has nanoseconds in it. When you compare a datetime object agains pd.Timestamp object with ns they have _compare_outside_nanorange that will do the comparison.
Going back to np.datetime64, one thing to note here as explained nicely in this post SO is that it's a wrapper on an int64 type. So not suprising if i do the following:
1 > a
a > 1
Both will though an error Cannot compare type 'Timestamp' with type 'int'.
So under the hood when you do b > a the comparison most be done on an int level, this comparison will be done by np.greater() function np.greater - also take a look at ufunc_docs.
Note: I'm unable to confirm this, the numpy docs are too complex to go through. If any numpy experts can comment on this, that'll be helpful.
If this is the case, if the comparison of np.datetime64 is based on int, then the example above with a == b and b == a makes sense. Since when we do b == a we compare the int value of b against pd.Timestamp this will always return Flase for == and True for !=.
Its the same as doing say 123 == '123', this operation will not fail, it will just return False.
Related
Is it good code style if a Python function returns different types depending on the arguments provided?
def foo(bar):
if bar is None:
return None
elif bar == 1:
return 1*1
else:
return [b*b for b in bar]
foo returns None if bar is None
foo return 1 if bar == 1
foo returns a List of int if bar is a Tuple / List of integers
Examples:
>> foo(None)
None
>> foo(1)
1
>> foo(1, 2, 3, 4)
[1, 4, 9, 16]
Returning None or an int should be OK, but is it OK to return an int or a list of ints depending on the function arguments? You could argue that it would be OK because the user knows which return types to expect and doesn't need type checking (in which case I would say it's not OK), but one could argue that it would be better to split the function into two functions, one expecting a int and return an int and one expecting a list of int and returning a list of ints.
It depends entirely on your use case. Here's an example in the standard library where the type of the result depends on the type of the input:
>>> import operator
>>> operator.add(1, 2)
3
>>> operator.add(1.0, 2.0)
3.0
This type of behavior is usually ok, and can be documented with #typing.overload.
Here's an example where the type of the result depends on the value of the input:
>>> import json
>>> json.loads('1')
1
>>> json.loads('[1]')
[1]
This type of behavior is usually reserved for serialization and deserialization, or APIs which blur the type / value boundary such as astype in np.int_(3).astype(bool).
On the other hand, here's an example of a function which is obviously poorly designed:
from typing import Union # make sure to document the mixed return type
def is_even(x: int) -> Union[bool, str]:
if x % 2 == 0:
return True
else:
return "no"
Without knowing your specific use case, it's hard to give advice here.
There is no definite answer for this question and so formulating a clear comprehensive answer will be difficult, as can be seen in this very similar question.
In general, you should avoid returning different types from a function. As an example, I've once worked on a project where one function looked something like this:
def get_df(url):
data = get_data(url)
df = reformat_data(data)
if df_is_empty(df):
return 'Empty dataframe'
df_interp = interpolate_df(df)
return df_interp
You may imagine that when I got the error
AttributeError: 'str' object has no attribute 'iloc'
this was very confusing for me and it took about half a day to figure out where this error came from. A much better solution would have been to raise an ValueError.
Now, there always will be weird situations where this rule of thumb may be incorrect. For instance, when scraping websites, sometimes being able to retrieve images as well might be useful. In this case I'd include a flag in your function:
def scrape_website(url, get_images=False):
data = do_stuff(url)
ordered_data = order_data(data, get_images)
return ordered_data
However I would rather split this up into two functions, one which returns the non-image data and one which returns only image data.
Is there a casting function which takes both the variable and type to cast to? Such as:
cast_var = cast_fn(original_var, type_to_cast_to)
I want to use it as an efficient way to cycle through a list of input parameters parsed from a string, some will need to be cast as int, bool, float etc...
All Python types are callable
new_val = int(old_val)
so no special function is needed. Indeed, what you are asking for is effectively just an apply function
new_val = apply(int, old_val)
which exists in Python 2, but was removed from Python 3 as it was never necessary; any expression that could be passed as the first argument to apply can always be used with the call "operator":
apply(expr, *args) == expr(*args)
Short answer:
This works:
>>> t = int
>>> s = "9"
>>> t(s)
Or a full example:
def convert(type_id, value):
return type_id(value)
convert(int, "3") # --> 3
convert(str, 3.0) # --> '3.0'
We've all made this kind of mistake in python:
if ( number < string ):
python silently accepts this and just gives incorrect output.
Thank goodness python 3 finally warns us. But in some cases python 2.7 is needed. Is there any way in python 2.7 to guard against this mistake other than "just be careful" (which we all know doesn't work 100% of the time)?
You could explicitly convert both numbers to int. The string will get converted, and the number won't be effected (it's already an int). So this saves you the need to start remembering what type of value the number holds:
a = 11
b = "2"
print a > b # prints False, which isn't what you intended
print int(a) > int(b) # prints True
EDIT:
As noted in the comments, you cannot assume a number is an integer. However, applying the same train of though with the proper function - float should work just fine:
a = 11
b = "2"
print a > b # prints False, which isn't what you intended
print float(a) > float(b) # prints True
If you really, really want to be 100% sure that comparing strings and ints is impossible, you can overload the __builtin__.int (and __builtin__.float, etc. as necessary) method to disallow comparing ints (and floats, etc) with strings. It would look like this:
import __builtin__
class no_str_cmp_int(int):
def __lt__(self,other):
if type(other) is str:
raise TypeError
return super.__lt__(other)
def __gt__(self,other):
if type(other) is str:
raise TypeError
return super.__gt__(other)
# implement __gte__, __lte__ and others as necessary
# replace the builtin int method to disallow string comparisons
__builtin__.int = no_str_cmp_int
x = int(10)
Then, if you attempted to do something like this, you'd receive this error:
>>> print x < '15'
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
print x < '15'
File "tmp.py", line 7, in __lt__
raise TypeError
TypeError
There is a major caveat to this approach, though. It only replaces the int function, so every time you created an int, you'd have to pass it through the function, as I do in the declaration of x above. Literals will continue to be the original int type, and as far as I am aware there is no way to change this. However, if you properly create these objects, they will continue to work with the 100% assurance you desire.
Just convert the string or any data type to float first.
When two data types are same, then only we can compare them.
Suppose,
a = "10"
b= 9.3
c=9
We want to add a,b,c.. So,
So, the correct way to add these three is to convert them to same data type and then add.
a = float(a)
b = float(b)
c = float(c)
print a+b+c
You can check if each variable is an int like this :
if ( isinstance(number, int) and isinstance(string, int) ):
if (number < string):
Do something
else:
Do something else
else :
print "NaN"
*Edit:
To check for a float too the code should be :
if ( isinstance(number, (int,float )) and isinstance(string, (int,float) ) ):
i'm using python 2.7
consider the following snippet of code (the example is contrived):
import datetime
class ScheduleData:
def __init__(self, date):
self.date = date
def __eq__(self, other):
try:
return self.date == other.date
except AttributeError as e:
return self.date == other
def __hash__(self):
return hash(self.date)
schedule_set = set()
schedule_set.add(ScheduleData(datetime.date(2010, 8, 7)))
schedule_set.add(ScheduleData(datetime.date(2010, 8, 8)))
schedule_set.add(ScheduleData(datetime.date(2010, 8, 9)))
print (datetime.date(2010, 8, 8) in schedule_set)
schedule_list = list(schedule_set)
print (datetime.date(2010, 8, 8) in schedule_list)
the output from this is unexpected (to me, at least):
[08:02 PM toolscripts]$ python test.py
True
False
in the first case, the given date is found in the schedule_set as i have overridden the __hash__ and __eq__ functions.
from my understanding the in operator will check against hash and equality for sets, but for lists it will simply iterate over the items in the list and check equality.
so what is happening here? why does my second test for in on the list schedule_list fail?
do i have to override some other function for lists?
The issue is the comparison is invoking an __eq__ function opposite of what you're looking for. The __eq__ method defined works when you have a ScheduleData() == datetime.date() but the in operator is performing the comparison in the opposite order, datetime.date() == ScheduleData() which is not invoking your defined __eq__. Only the class acting as the left-hand side will have its __eq__ called.
The reason this problem occurs in python 2 and not 3 has to do with the definition of datetime.date.__eq__ in the std library. Take for example the following two classes:
class A(object):
def __eq__(self, other):
print ('A.__eq__')
return False
class B(object):
def __eq__(self, other):
print ('B.__eq__')
items = [A()]
B() in items
Running this code prints B.__eq__ under both Python 2 and Python 3. The B object is used as the lhs, just as your datetime.date object is used in Python 2. However, if I redefine B.__eq__ to resemble the Python 3 defintion of datetime.date.__eq__:
class B(object):
def __eq__(self, other):
print ('First B.__eq__')
if isinstance(self, other.__class__):
print ('B.__eq__')
return NotImplemented
Then:
First B.__eq__
A.__eq__
is printed under both Python 2 and 3. The return of NotImplemented causes the check with the arguments reversed.
Using timetuple in your class will fix this problem, as #TimPeters stated (interesting quirk I was unaware of), though it seems that it need not be a function
class ScheduleData:
timetuple = None
is all you'd need in addition to what you have already.
#RyanHaining is correct. For a truly bizarre workaround, add this method to your class:
def timetuple(self):
return None
Then your program will print True twice. The reasons for this are involved, having to do with an unfortunate history of comparisons in Python 2 being far too loose. The timetuple() workaround is mostly explained in this part of the docs:
Note In order to stop comparison from falling back to the
default scheme of comparing object addresses, datetime
comparison normally raises TypeError if the other comparand
isn’t also a datetime object. However, NotImplemented is
returned instead if the other comparand has a timetuple()
attribute. This hook gives other kinds of date objects a
chance at implementing mixed-type comparison. If not,
when a datetime object is compared to an object of a
different type, TypeError is raised unless the comparison
is == or !=. The latter cases return False or True,
respectively.
datetime was one of the first types added to Python that tried to offer less surprising comparison behavior. But, it couldn't become "really clean" until Python 3.
I'm currently parsing CSV tables and need to discover the "data types" of the columns. I don't know the exact format of the values. Obviously, everything that the CSV parser outputs is a string. The data types I am currently interested in are:
integer
floating point
date
boolean
string
My current thoughts are to test a sample of rows (maybe several hundred?) in order to determine the types of data present through pattern matching.
I am particularly concerned about the date data type - is their a python module for parsing common date idioms (obviously I will not be able to detect them all)?
What about integers and floats?
ast.literal_eval() can get the easy ones.
Dateutil comes to mind for parsing dates.
For integers and floats you could always try a cast in a try/except section
>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
9
>>> cf = float(f)
>>> cf
2.5
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
... cg = float(g)
... except:
... print "g is not a float"
...
g is not a float
>>>
The data types I am currently interested in are...
These do not exist in a CSV file. The data is only strings. Only. Nothing more.
test a sample of rows
Tells you nothing except what you saw in the sample. The next row after your sample can be a string which looks entirely different from the sampled strings.
The only way you can process CSV files is to write CSV-processing applications that assume specific data types and attempt conversion. You cannot "discover" much about a CSV file.
If column 1 is supposed to be a date, you'll have to look at the string and work out the format. It could be anything. A number, a typical Gregorian date in US or European format (there's not way to know whether 1/1/10 is US or European).
try:
x= datetime.datetime.strptime( row[0], some format )
except ValueError:
# column is not valid.
If column 2 is supposed to be a float, you can only do this.
try:
y= float( row[1] )
except ValueError:
# column is not valid.
If column 3 is supposed to be an int, you can only do this.
try:
z= int( row[2] )
except ValueError:
# column is not valid.
There is no way to "discover" if the CSV has floating-point digit strings except by doing float on each row. If a row fails, then someone prepared the file improperly.
Since you have to do the conversion to see if the conversion is possible, you might as well simply process the row. It's simpler and gets you the results in one pass.
Don't waste time analyzing the data. Ask the folks who created it what's supposed to be there.
You may be interested in this python library which does exactly this kind of type guessing on both general python data and CSVs and XLS files:
https://github.com/okfn/messytables
https://messytables.readthedocs.org/ - docs
It happily scales to very large files, to streaming data off the internet etc.
There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)
The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164
We tested ast.literal_eval() but rescuing from error is pretty slow, if you want to cast from data that you receive all as string, I think that regex would be faster.
Something like the following worked very well for us.
import datetime
import re
"""
Helper function to detect the appropriate type for a given string.
"""
def guess_type(s):
if s == ""
return None
elif re.match("\A[0-9]+\.[0-9]+\Z", s):
return float
elif re.match("\A[0-9]+\Z", s):
return int
# 2019-01-01 or 01/01/2019 or 01/01/19
elif re.match("\A[0-9]{4}-[0-9]{2}-[0-9]{2}\Z", s) or \
re.match("\A[0-9]{2}/[0-9]{2}/([0-9]{2}|[0-9]{4})\Z", s):
return datetime.date
elif re.match("\A(true|false)\Z", s):
return bool
else:
return str
Tests:
assert guess_type("") == None
assert guess_type("this is a string") == str
assert guess_type("0.1") == float
assert guess_type("true") == bool
assert guess_type("1") == int
assert guess_type("2019-01-01") == datetime.date
assert guess_type("01/01/2019") == datetime.date
assert guess_type("01/01/19") == datetime.date