Strange behavior of datetimes when loaded into pd.DataFrame

Strange behavior of datetimes when loaded into pd.DataFrame - python

I'm trying to construct simple DataFrames. Both have a date whereas the first has one additional column:
import pandas as pd
import datetime as dt
import numpy as np
a = pd.DataFrame(np.array([
[dt.datetime(2018, 1, 10), 5.0]]), columns=['date', 'amount'])
print(a)
# date_dt amount
# 2018-01-10 00:00:00 5
b = pd.DataFrame(np.array([
[dt.datetime(2018, 1, 10)]]), columns=['date'])
print(b)
# date_dt
# 2018-01-10
Why are the dates interpreted differently (with and without time)? It gives me problems when I later try to apply merges.

Ok, so here is what happens. I will use the following code:
import pandas as pd
import datetime as dt
import numpy as np
a_val = np.array([[dt.datetime(2018, 1, 10), 5.0]])
a = pd.DataFrame(a_val, columns=['date', 'amount'])
b_val = np.array([[dt.datetime(2018, 1, 10)]])
b = pd.DataFrame(b_val, columns=['date'])
I just split the contents of the pd dataframes and call to the dataframe themselves. First let's print thr a_val and b_val variables:
print(a_val, b_val)
# output: [[datetime.datetime(2018, 1, 10, 0, 0) 5.0]] [[datetime.datetime(2018, 1, 10, 0, 0)]]
So still good, the object are datetime.datetime.
Now let's access the values of the dataframe with .values:
print(a.values, b.values)
# output: [[datetime.datetime(2018, 1, 10, 0, 0) 5.0]] [['2018-01-10T00:00:00.000000000']]
Things are messed up here. Let's print the type of the date:
print(type(a.values[0][0]), type(b.values[0][0]))
# output: <class 'datetime.datetime'> <class 'numpy.datetime64'>
Ok, that's the thing: since in the second dataframe you just have a date object, and you call np.array(), the date is cast to a numpy.datetime64 object, which has a different formatting. Instead, in the first dataframe you have a datetime object plus an int, and the code left them as is.
Short version: if you have a collection of different objects like dates, strings, int etc. use a list, not a numpy array

Both columns in a are objects because of the numpy array that's an intermediate (and is of type object). I'd think that not implicitly interpreting mixed objects is probably good behavior.
a = pd.DataFrame([[dt.datetime(2018, 1, 10), 5.0]], columns=['date', 'amount'])
This seems to be more along the lines of what you want.

Related

Converting Data into pandas data frame format

I have following dummy calculation in Python language
from datetime import datetime
import pandas as pd
result = ### based on some calculation
print(result)
With this I am getting answer in below format:
(
(
'date', pywintypes.datetime(2020, 6, 15, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), pywintypes.datetime(2020, 7, 15, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True))
),
(
'var1', 200, 340
),
(
'var2', 1200, -340
)
)
I failed to understand what is this format exactly? How can I convert this data to a Pandas data-frame format for further calculation?
Any pointer will be very helpful.

seems like its a tuple of tuples , but if you run this :
print(type(result))
you can get a better idea

Given your tuple of tuples format you could use:
import pandas as pd
df = pd.DataFrame(result).set_index(0).T
Output:
date var1 var2
1 2020-06-15 200 1200
2 2020-07-15 340 -340

You can try
pd.DataFrame(list(result))

How to numerate from id = 1 and name the first column as "Id" [duplicate]

I need the index to start at 1 rather than 0 when writing a Pandas DataFrame to CSV.
Here's an example:
In [1]: import pandas as pd
In [2]: result = pd.DataFrame({'Count': [83, 19, 20]})
In [3]: result.to_csv('result.csv', index_label='Event_id')
Which produces the following output:
In [4]: !cat result.csv
Event_id,Count
0,83
1,19
2,20
But my desired output is this:
In [5]: !cat result2.csv
Event_id,Count
1,83
2,19
3,20
I realize that this could be done by adding a sequence of integers shifted by 1 as a column to my data frame, but I'm new to Pandas and I'm wondering if a cleaner way exists.

Index is an object, and default index starts from 0:
>>> result.index
Int64Index([0, 1, 2], dtype=int64)
You can shift this index by 1 with
>>> result.index += 1
>>> result.index
Int64Index([1, 2, 3], dtype=int64)

Just set the index before writing to CSV.
df.index = np.arange(1, len(df) + 1)
And then write it normally.

source: In Python pandas, start row index from 1 instead of zero without creating additional column
Working example:
import pandas as pdas
dframe = pdas.read_csv(open(input_file))
dframe.index = dframe.index + 1

Another way in one line:
df.shift()[1:]

In my opinion best practice is to set the index with a RangeIndex
import pandas as pd
result = pd.DataFrame(
{'Count': [83, 19, 20]},
index=pd.RangeIndex(start=1, stop=4, name='index')
)
>>> result
Count
index
1 83
2 19
3 20
I prefer this, because you can define the range and a possible step and a name for the index in one line.

This worked for me
df.index = np.arange(1, len(df)+1)

You can use this one:
import pandas as pd
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index += 1
print(result)
or this one, by getting the help of numpy library like this:
import pandas as pd
import numpy as np
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index = np.arange(1, len(result)+1)
print(result)
np.arange will create a numpy array and return values within a given interval which is (1, len(result)+1) and finally you will assign that array to result.index.

use this
df.index = np.arange(1, len(df)+1)

Add ".shift()[1:]" while creating a data frame
data = pd.read_csv(r"C:\Users\user\path\data.csv").shift()[1:]

Fork from the original answer, giving some cents:
if I'm not mistaken, starting from version 0.23, index object is RangeIndex type
From the official doc:
RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges. Using RangeIndex may in some instances improve computing speed.
In case of a huge index range, that makes sense, using the representation of the index, instead of defining the whole index at once (saving memory).
Therefore, an example (using Series, but it applies to DataFrame also):
>>> import pandas as pd
>>>
>>> countries = ['China', 'India', 'USA']
>>> ds = pd.Series(countries)
>>>
>>>
>>> type(ds.index)
<class 'pandas.core.indexes.range.RangeIndex'>
>>> ds.index
RangeIndex(start=0, stop=3, step=1)
>>>
>>> ds.index += 1
>>>
>>> ds.index
RangeIndex(start=1, stop=4, step=1)
>>>
>>> ds
1 China
2 India
3 USA
dtype: object
>>>
As you can see, the increment of the index object, changes the start and stop parameters.

This adds a column that accomplishes what you want
df.insert(0,"Column Name", np.arange(1,len(df)+1))

Following on from TomAugspurger's answer, we could use list comprehension rather than np.arrange(), which removes the requirement for importing the module: numpy. You can use the following instead:
df.index = [i+1 for i in range(len(df))]

Element-wise maximum with date values

I have a dataframe with date values and would like to manipulate them to 1 Jan or later. Since I need to do this element-wise, I use np.maximum(). The code below however gives
TypeError: Cannot compare type 'Timestamp' with type 'int'.
What's the appropriate method to deal with this kind of data type?
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': np.arange('1999-12', '2000-02', dtype='datetime64[D]')})
df['corrected_date'] = np.maximum(pd.to_datetime('20000101', format='%Y%m%d'), df['date'])

For me working comparing with Series:
s = pd.Series(pd.to_datetime('20000101', format='%Y%m%d'), index=df.index)
df['corrected_date'] = np.maximum(s, df['date'])
Or with DatetimeIndex:
i = np.repeat(pd.to_datetime(['20000101'], format='%Y%m%d'), len(df))
df['corrected_date'] = np.maximum(i, df['date'])

Python/Numpy: problems with type conversion in vectorize and item

I am writing a function to extract values from datetimes over arrays. I want the function to operate on a Pandas DataFrame or a numpy ndarray.
The values should be returned in the same way as the Python datetime properties, e.g.
from datetime import datetime
dt = datetime(2016, 10, 12, 13)
dt.year
=> 2016
dt.second
=> 0
For a DataFrame this is reasonably easy to handle using applymap() (although there may well be a better way). I tried the same approach for numpy ndarrays using vectorize(), and I'm running into problems. Instead of the values I was expecting, I end up with very large integers, sometimes negative.
This was pretty baffling at first, but I figured out what is happening: the vectorized function is using item instead of __get__ to get the values out of the ndarray. This seems to automatically convert each datetime64 object to a long:
nd[1][0]
=> numpy.datetime64('1986-01-15T12:00:00.000000000')
nd[1].item()
=> 506174400000000000L
The long seems to be the number of nanoseconds since epoch (1970-01-01T00:00:00). Somewhere along the line the values are converted to integers and they overflow, hence the negative numbers.
So that's the problem. Please can someone help me fix it? The only thing I can think of is doing the conversion manually, but this would effectively mean reimplementing a chunk of the datetime module.
Is there some alternative to vectorize that doesn't use item()?
Thanks!
Minimal code example:
## DataFrame works fine
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'dts': [datetime(1970, 1, 1, 1), datetime(1986, 1, 15, 12),
datetime(2016, 7, 15, 23)]})
exp = pd.DataFrame({'dts': [1, 15, 15]})
df_func = lambda x: x.day
out = df.applymap(df_func)
assert out.equals(exp)
## numpy ndarray is more difficult
from numpy import datetime64 as dt64, timedelta64 as td64, vectorize # for brevity
# The unary function is a little more complex, especially for days and months where the minimum value is 1
nd_func = lambda x: int((dt64(x, 'D') - dt64(x, 'M') + td64(1, 'D')) / td64(1, 'D'))
nd = df.as_matrix()
exp = exp.as_matrix()
=> array([[ 1],
[15],
[15]])
# The function works as expected on a single element...
assert nd_func(nd[1][0]) == 15
# ...but not on an ndarray
nd_vect = vectorize(nd_func)
out = nd_vect(nd)
=> array([[ -105972749999999],
[ 3546551532709551616],
[-6338201187830896640]])

In Py3 the error is OverflowError: Python int too large to convert to C long.
In [215]: f=np.vectorize(nd_func,otypes=[int])
In [216]: f(dts)
...
OverflowError: Python int too large to convert to C long
but if I change the datetime units, it runs ok
In [217]: f(dts.astype('datetime64[ms]'))
Out[217]: array([ 1, 15, 15])
We could dig into this in more depth, but this seems to be simplest solution.
Keep in mind that vectorize is a convenience function; it makes iterating over multidimensions easier. But for a 1d array it is basically
np.array([nd_func(i) for i in dts])
But note that we don't have to use iteration:
In [227]: (dts.astype('datetime64[D]') - dts.astype('datetime64[M]') + td64(1,'D')) / td64(1,'D').astype(int)
Out[227]: array([ 1, 15, 15], dtype='timedelta64[D]')

start index at 1 for Pandas DataFrame

I need the index to start at 1 rather than 0 when writing a Pandas DataFrame to CSV.
Here's an example:
In [1]: import pandas as pd
In [2]: result = pd.DataFrame({'Count': [83, 19, 20]})
In [3]: result.to_csv('result.csv', index_label='Event_id')
Which produces the following output:
In [4]: !cat result.csv
Event_id,Count
0,83
1,19
2,20
But my desired output is this:
In [5]: !cat result2.csv
Event_id,Count
1,83
2,19
3,20
I realize that this could be done by adding a sequence of integers shifted by 1 as a column to my data frame, but I'm new to Pandas and I'm wondering if a cleaner way exists.

Index is an object, and default index starts from 0:
>>> result.index
Int64Index([0, 1, 2], dtype=int64)
You can shift this index by 1 with
>>> result.index += 1
>>> result.index
Int64Index([1, 2, 3], dtype=int64)

Just set the index before writing to CSV.
df.index = np.arange(1, len(df) + 1)
And then write it normally.

source: In Python pandas, start row index from 1 instead of zero without creating additional column
Working example:
import pandas as pdas
dframe = pdas.read_csv(open(input_file))
dframe.index = dframe.index + 1

Another way in one line:
df.shift()[1:]

In my opinion best practice is to set the index with a RangeIndex
import pandas as pd
result = pd.DataFrame(
{'Count': [83, 19, 20]},
index=pd.RangeIndex(start=1, stop=4, name='index')
)
>>> result
Count
index
1 83
2 19
3 20
I prefer this, because you can define the range and a possible step and a name for the index in one line.

This worked for me
df.index = np.arange(1, len(df)+1)

You can use this one:
import pandas as pd
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index += 1
print(result)
or this one, by getting the help of numpy library like this:
import pandas as pd
import numpy as np
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index = np.arange(1, len(result)+1)
print(result)
np.arange will create a numpy array and return values within a given interval which is (1, len(result)+1) and finally you will assign that array to result.index.

use this
df.index = np.arange(1, len(df)+1)

Add ".shift()[1:]" while creating a data frame
data = pd.read_csv(r"C:\Users\user\path\data.csv").shift()[1:]

Fork from the original answer, giving some cents:
if I'm not mistaken, starting from version 0.23, index object is RangeIndex type
From the official doc:
RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges. Using RangeIndex may in some instances improve computing speed.
In case of a huge index range, that makes sense, using the representation of the index, instead of defining the whole index at once (saving memory).
Therefore, an example (using Series, but it applies to DataFrame also):
>>> import pandas as pd
>>>
>>> countries = ['China', 'India', 'USA']
>>> ds = pd.Series(countries)
>>>
>>>
>>> type(ds.index)
<class 'pandas.core.indexes.range.RangeIndex'>
>>> ds.index
RangeIndex(start=0, stop=3, step=1)
>>>
>>> ds.index += 1
>>>
>>> ds.index
RangeIndex(start=1, stop=4, step=1)
>>>
>>> ds
1 China
2 India
3 USA
dtype: object
>>>
As you can see, the increment of the index object, changes the start and stop parameters.

This adds a column that accomplishes what you want
df.insert(0,"Column Name", np.arange(1,len(df)+1))

Following on from TomAugspurger's answer, we could use list comprehension rather than np.arrange(), which removes the requirement for importing the module: numpy. You can use the following instead:
df.index = [i+1 for i in range(len(df))]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strange behavior of datetimes when loaded into pd.DataFrame - python

Related

Converting Data into pandas data frame format

How to numerate from id = 1 and name the first column as "Id" [duplicate]

Element-wise maximum with date values

Python/Numpy: problems with type conversion in vectorize and item

start index at 1 for Pandas DataFrame

Categories

Resources