Python pandas: automatic conversion of string to Timestamp when constructing DataFrame - python

>>> pd.version.short_version
'0.15.2'
>>> D = [{'time':pd.Timestamp('2000/01/01'), 'value':1},{'time':'----', 'value':3}]
>>> pd.DataFrame(D)
time value
0 2000-01-01 1
1 2015-03-03 3
The [1,'time'] cell has been automatically filled in. Seems this happens only when the non-time contains only characters like '-', '/' which are typically used in datetime.
>>> D = [{'time':'0', 'value':3}, {'time':pd.Timestamp('2000/01/01'), 'value':1}]
>>> pd.DataFrame(D)
time value
0 0 3
1 2000-01-01 00:00:00 1
Is this a bug or can I stop this?

Related

Inconsistent slicing [:] behavior on Pandas Dataframes

I have 2 data frames. First dataframe has numbers as index. Second dataframe has datetime as index. The slice operator (:) behaves differently on these dataframes.
Case 1
>>> df = pd.DataFrame({'A':[1,2,3]}, index=[0,1,2])
>>> df
A
0 1
1 2
2 3
>>> df [0:2]
A
0 1
1 2
Case 2
>>> a = dt.datetime(2000,1,1)
>>> b = dt.datetime(2000,1,2)
>>> c = dt.datetime(2000,1,3)
>>> df = pd.DataFrame({'A':[1,2,3]}, index = [a,b,c])
>>> df
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
>>> df[a:b]
A
2000-01-01 1
2000-01-02 2
Why does the final row gets excluded in case 1 but not in case 2?
Dont use it, better is use loc for consistency:
df = pd.DataFrame({'A':[1,2,3]}, index=[0,1,2])
print (df.loc[0:2])
A
0 1
1 2
2 3
a = datetime.datetime(2000,1,1)
b = datetime.datetime(2000,1,2)
c = datetime.datetime(2000,1,3)
df = pd.DataFrame({'A':[1,2,3]}, index = [a,b,c])
print (df.loc[a:b])
A
2000-01-01 1
2000-01-02 2
Reason, why last row is omitted is possible find in docs:
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.
print (df[0:2])
A
0 1
1 2
For selecting by datetimes exact indexing is used :
... In contrast, indexing with Timestamp or datetime objects is exact, because the objects have exact meaning. These also follow the semantics of including both endpoints.
Okay to understand this first let's run an experiment
import pandas as pd
import datetime as dt
a = dt.datetime(2000,1,1)
b = dt.datetime(2000,1,2)
c = dt.datetime(2000,1,3)
df = pd.DataFrame({'A':[4,5,6]}, index=[a,b,c])
Now let's use
df2[0:2]
Which gives us
A
2000-01-01 1
2000-01-02 2
Now this behavior is consistent through python and list slicing, but if you use
df[a:c]
You get
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
this is because df[a:c] overrides the default list slicing method as indexes do not correspond to integers, and in the function written in Pandas which also includes the last element, so if your indexes were integers, pandas defaults to inbuilt slicing, whereas if they are not integers, this effect is observed, as already mentioned in the answer by jezrael, it is better to use loc, as that has more consistency across the board.

how to replace string at specific index in pandas dataframe

I have following dataframe in pandas
code bucket
0 08:30:00-9:00:00
1 10:00:00-11:00:00
2 12:00:00-13:00:00
I want to replace 7th character 0 with 1, my desired dataframe is
code bucket
0 08:30:01-9:00:00
1 10:00:01-11:00:00
2 12:00:01-13:00:00
How to do it in pandas?
Use indexing with str:
df['bucket'] = df['bucket'].str[:7] + '1' + df['bucket'].str[8:]
Or list comprehension:
df['bucket'] = [x[:7] + '1' + x[8:] for x in df['bucket']]
print (df)
code bucket
0 0 08:30:01-9:00:00
1 1 10:00:01-11:00:00
2 2 12:00:01-13:00:00
Avoid string operations where possible
You lose a considerable amount of functionality by working with strings only. While this may be a one-off operation, you will find that repeated string manipulations will quickly become expensive in terms of time and memory efficiency.
Use pd.to_datetime instead
You can add additional series to your dataframe with datetime objects. Below is an example which, in addition, creates an object dtype series in the format you desire.
# split by '-' into 2 series
dfs = df.pop('bucket').str.split('-', expand=True)
# convert to datetime
dfs = dfs.apply(pd.to_datetime, axis=1)
# add 1s to first series
dfs[0] = dfs[0] + pd.Timedelta(seconds=1)
# create object series from 2 times
form = '%H:%M:%S'
dfs[2] = dfs[0].dt.strftime(form) + '-' + dfs[1].dt.strftime(form)
# join to original dataframe
res = df.join(dfs)
print(res)
code 0 1 2
0 0 2018-10-02 08:30:01 2018-10-02 09:00:00 08:30:01-09:00:00
1 1 2018-10-02 10:00:01 2018-10-02 11:00:00 10:00:01-11:00:00
2 2 2018-10-02 12:00:01 2018-10-02 13:00:00 12:00:01-13:00:00

Python: Converting datetime to ordinal

I have a list(actually a column in pandas DataFrame if this matters) of Timestamps and I'm trying to convert every element of the list to ordinal format. So I run a for loop through the list(is there a faster way?) and use:
import datetime as dt
a = a.toordinal()
or
import datetime as dt
a = dt.datetime.toordinal(a)
however the following happened(for simplicity):
In[1]: a
Out[1]: Timestamp('2019-12-25 00:00:00')
In[2]: b = dt.datetime.toordinal(a)
In[3]:b
Out[3]: 737418
In[4]:a = b
In[5]:a
Out[5]: Timestamp('1970-01-01 00:00:00.000737418')
The result makes absolutely non sense to me. Obviously what I was trying to get is:
In[1]: a
Out[1]: Timestamp('2019-12-25 00:00:00')
In[2]: b = dt.datetime.toordinal(a)
In[3]:b
Out[3]: 737418
In[4]:a = b
In[5]:a
Out[5]: 737418
What went wrong?
console output screenshot
What went wrong?
Your question is a bit misleading, and the screenshot shows what is going on.
Normally, when you write
a = b
in Python, it will bind the name a to the object bound to b. In this case, you will have
id(a) == id(b)
In your case, however, contrary to your question, you're actually doing the assignment
a[0] = b
This will call a method of a, assigning b to its 0 index. The object's class determines what happens in this case. Here, specifically, a is a pandas.Series, and it converts the object in order to conform to its dtype.
Please don't loop. It's not necessary.
#!/usr/bin/env python
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'dates': [datetime(1990, 4, 28),
datetime(2018, 4, 13),
datetime(2017, 11, 4)]})
print(df)
print(df['dates'].dt.weekday_name)
print(df['dates'].dt.weekday)
print(df['dates'].dt.month)
print(df['dates'].dt.year)
gives the dataframe:
dates
0 1990-04-28
1 2018-04-13
2 2017-11-04
And the printed values
0 Saturday
1 Friday
2 Saturday
Name: dates, dtype: object
0 5
1 4
2 5
Name: dates, dtype: int64
0 4
1 4
2 11
Name: dates, dtype: int64
0 1990
1 2018
2 2017
Name: dates, dtype: int64
For the toordinal, you need to "loop" with apply:
print(df['dates'].apply(lambda x: x.toordinal()))
gives the following pandas series
0 726585
1 736797
2 736637
Name: dates, dtype: int64

Pandas rounds number to 0

I'm trying to assign a value to a cell, yet Pandas rounds it to zero. (I'm using Python 3.6)
in: df['column1']['row1'] = 1 / 331616
in: print(df['column1']['row1'])
out: 0
But if I try to assign this value to a standard Python dictionary key, it works fine.
in: {'column1': {'row1': 1/331616}}
out: {'column1': {'row1': 3.0155360416867704e-06}}
I've already done this, but it didn't help:
pd.set_option('precision',50)
pd.set_option('chop_threshold',
.00000000005)
Please, help.
pandas appears to be presuming that your datatype is an integer (int).
There are several ways to address this, either by setting the datatype to a float when the DataFrame is constructed OR by changing (or casting) the datatype (also referred to as a dtype) to a float on the fly.
setting the datatype (dtype) during construction:
>>> import pandas as pd
In making this simple DataFrame, we provide a single example value (1) and the columns for the DataFrame are defined as containing floats during creation
>>> df = pd.DataFrame([[1]], columns=['column1'], index=['row1'], dtype=float)
>>> df['column1']['row1'] = 1 / 331616
>>> df
column1
row1 0.000003
converting the datatype on the fly:
>>> df = pd.DataFrame([[1]], columns=['column1'], index=['row1'], dtype=int)
>>> df['column1'] = df['column1'].astype(float)
>>> df['column1']['row1'] = 1 / 331616
df
column1
row1 0.000003
Your column's datatype most likely is set to int. You'll need to either convert it to float or mixed types object before assigning the value:
df = pd.DataFrame([1,2,3,4,5,6])
df.dtypes
# 0 int64
# dtype: object
df[0][4] = 7/125
df
# 0
# 0 1
# 1 2
# 2 3
# 3 4
# 4 0
# 5 6
df[0] = df[0].astype('O')
df[0][4] = 7 / 22
df
# 0
# 0 1
# 1 2
# 2 3
# 3 4
# 4 0.318182
# 5 6
df.dtypes
# 0 object
# dtype: object

Pandas Return Only Repeated Results

I have a Pandas DataFrame with the columns:
UserID, Date, (other columns that we can ignore here)
I'm trying to select out only users that have visited on multiple dates. I'm currently doing it with groupby(['UserID', 'Date']) and a for loop, where I drop users with only one result, but I feel like there is a much better way to do this.
Thanks
It depends on exact format of output you want to get, but you can count distinct Dates inside each UserID and get all where this count > 1 (like having count(distinct Date) > 1 in SQL):
>>> df
Date UserID
0 2013-01-01 00:00:00 1
1 2013-01-02 00:00:00 2
2 2013-01-02 00:00:00 2
3 2013-01-02 00:00:00 1
4 2013-01-02 00:00:00 3
>>> g = df.groupby('UserID').Date.nunique()
>>> g
UserID
1 2
2 1
3 1
>>> g > 1
UserID
1 True
2 False
3 False
dtype: bool
>>> g[g > 1]
UserID
1 2
you see that you get UserID = 1 as a result, it's the only user visited on multiple dates
To count unique date counts for every UserID:
df.groupby("UserID").Date.agg(lambda s:len(s.unique()))
The you can drop users with only one count.
For the sake of adding another answer, you can also use indexing with list comprehension
DF = pd.DataFrame({'UserID' : [1, 1, 2, 3, 4, 4, 5], 'Data': np.random.rand(7)})
DF.ix[[row for row in DF.index if list(DF.UserID).count(DF.UserID[row])>1]]
This might be as much work as your for loop, but its just another option for you to consider....

Categories

Resources