Python to convert different date formats in a column - python

I am trying to convert a column which has different date formats.
For example:
month
2018-01-01 float64
2018-02-01 float64
2018-03-01 float64
2018-03-01 00:00:00 float64
2018-04-01 01:00:00 float64
2018-05-01 01:00:00 float64
2018-06-01 01:00:00 float64
2018-07-01 01:00:00 float64
I want to convert everything in the column to just month and year. For example I would like Jan-18, Feb-18, Mar-18, etc.
I have tried using this code to first convert my column to datetime:
df['month'] = pd.to_datetime(df['month'], format='%Y-%m-%d')
But it returns a float64:
Out
month
2018-01-01 00:00:00 float64
2018-02-01 00:00:00 float64
2018-03-01 00:00:00 float64
2018-04-01 01:00:00 float64
2018-05-01 01:00:00 float64
2018-06-01 01:00:00 float64
2018-07-01 01:00:00 float64
In my output to CSV the month format has been changed to 01/05/2016 00:00:00. Can you please help me covert to just month and year e.g. Aug-18.
Thank you

I assume you have a Pandas dataframe. In this case, you can use pd.Series.dt.to_period:
s = pd.Series(['2018-01-01', '2018-02-01', '2018-03-01',
'2018-03-01 00:00:00', '2018-04-01 01:00:00'])
res = pd.to_datetime(s).dt.to_period('M')
print(res)
0 2018-01
1 2018-02
2 2018-03
3 2018-03
4 2018-04
dtype: object
As you can see, this results in a series of dtype object, which is generally inefficient. A better idea is to set the day to the last of the month and maintain a datetime series internally represented by integers.

Related

How to append hour:min:sec to the DateTime in pandas Dataframe

I have following dataframe, where date was set as the index col,
date
renormalized
2017-01-01
6
2017-01-08
5
2017-01-15
3
2017-01-22
3
2017-01-29
3
I want to append 00:00:00 to each of the datetime in the index column, make it like
date
renormalized
2017-01-01 00:00:00
6
2017-01-08 00:00:00
5
2017-01-15 00:00:00
3
2017-01-22 00:00:00
3
2017-01-29 00:00:00
3
It seems I got stuck for no solution to make it happen.... It will be great if anyone can help...
Thanks
AL
When your time is 0 for all instances, pandas doesn't show the time by default (although it's a Timestamp class, so it has the time!). Probably your data is already normalized, and you can perform delta time operations as usual.
You can see a target observation with df.index[0] for instance, or take a look at all the times with df.index.time.
You can use DatetimeIndex.strftime
df.index = pd.to_datetime(df.index).strftime('%Y-%m-%d %H:%M:%S')
print(df)
renormalized
date
2017-01-01 00:00:00 6
2017-01-08 00:00:00 5
2017-01-15 00:00:00 3
2017-01-22 00:00:00 3
2017-01-29 00:00:00 3
Or you can choose
df.index = df.index + ' 00:00:00'

Remove data timestamp and get data only every hours python

I have a bunch of timestamp data in a csv file like this:
2012-01-01 00:00:00, data
2012-01-01 00:01:00, data
2012-01-01 00:02:00, data
...
2012-01-01 00:59:00, data
2012-01-01 01:00:00, data
2012-01-01 01:01:00, data
I want to delete data every minute and only display every hour in python like the following:
2012-01-01 00:00:00, data
2012-01-01 01:00:00, data
2012-01-01 02:00:00, data
Could any one help me? Thank you.
I believe you need to use pandas resample, here's is an example of how it is used to achieve the output you desire. However, keep in mind that since this is a resampling operation during frequency conversion, you must pass a function on how the other columns will beahve (summing all values corresponding to the new timeframe, calculating an average, calculating the difference, etc...) otherwise you will get returned a DatetimeIndexResample. Here is an example:
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='40T')
series = pd.Series(range(9),index=index)
print(series)
Output:
2000-01-01 00:00:00 0
2000-01-01 00:40:00 1
2000-01-01 01:20:00 2
2000-01-01 02:00:00 3
2000-01-01 02:40:00 4
2000-01-01 03:20:00 5
2000-01-01 04:00:00 6
2000-01-01 04:40:00 7
2000-01-01 05:20:00 8
Applying resample hourly without passing the aggregation function:
print(series.resample('H'))
Output:
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
After passing .sum():
print(series.resample('H').sum())
Output:
2000-01-01 00:00:00 1
2000-01-01 01:00:00 2
2000-01-01 02:00:00 7
2000-01-01 03:00:00 5
2000-01-01 04:00:00 13
2000-01-01 05:00:00 8
Freq: H, dtype: int64

How to handle end of time series in pandas resample when upsampling?

I want to resample from hours to half-hours. I use .ffill() in the example, but I've tested .asfreq() as an intermediate step too.
The goal is to get intervals of half hours where the hourly values are spread among the upsampled intervals, and I'm trying to find a general solution for any ranges with the same problem.
import pandas as pd
index = pd.date_range('2018-10-10 00:00', '2018-10-10 02:00', freq='H')
hourly = pd.Series(range(10, len(index)+10), index=index)
half_hourly = hourly.resample('30min').ffill() / 2
The hourly series looks like:
2018-10-10 00:00:00 10
2018-10-10 01:00:00 11
2018-10-10 02:00:00 12
Freq: H, dtype: int64
And the half_hourly:
2018-10-10 00:00:00 5.0
2018-10-10 00:30:00 5.0
2018-10-10 01:00:00 5.5
2018-10-10 01:30:00 5.5
2018-10-10 02:00:00 6.0
Freq: 30T, dtype: float64
The problem with the last one is that there is no row for representing 02:30:00
I want to achieve something that is:
2018-10-10 00:00:00 5.0
2018-10-10 00:30:00 5.0
2018-10-10 01:00:00 5.5
2018-10-10 01:30:00 5.5
2018-10-10 02:00:00 6.0
2018-10-10 02:30:00 6.0
Freq: 30T, dtype: float64
I understand that the hourly series ends at 02:00, so there is no reason to expect pandas to insert the last half hour by default. However, after reading a lot of deprecated/old posts, some newer ones, the documentation, and cookbook, I still weren't able to find a straight-forward solution.
Lastly, I've also tested the use of .mean(), but that didn't fill the NaNs. And interpolate() didn't average by hour as I wanted it to.
My .ffill() / 2 almost works as a way to spread hour to half hours in this case, but it seems like a hack to a problem that I expect pandas already provides a better solution to.
Thanks in advance.
Your precise issue can be solved like this
>>> import pandas as pd
>>> index = pd.date_range('2018-10-10 00:00', '2018-10-10 02:00', freq='H')
>>> hourly = pd.Series(range(10, len(index)+10), index=index)
>>> hourly.reindex(index.union(index.shift(freq='30min'))).ffill() / 2
2018-10-10 00:00:00 5.0
2018-10-10 00:30:00 5.0
2018-10-10 01:00:00 5.5
2018-10-10 01:30:00 5.5
2018-10-10 02:00:00 6.0
2018-10-10 02:30:00 6.0
Freq: 30T, dtype: float64
>>> import pandas as pd
>>> index = pd.date_range('2018-10-10 00:00', '2018-10-10 02:00', freq='H')
>>> hourly = pd.Series(range(10, len(index)+10), index=index)
>>> hourly.reindex(index.union(index.shift(freq='30min'))).ffill() / 2
I suspect that this is a minimal example so I will try to generically solve as well. Lets say you have multiple points to fill in each day
>>> import pandas as pd
>>> x = pd.Series([1.5, 2.5], pd.DatetimeIndex(['2018-09-21', '2018-09-22']))
>>> x.resample('6h').ffill()
2018-09-21 00:00:00 1.5
2018-09-21 06:00:00 1.5
2018-09-21 12:00:00 1.5
2018-09-21 18:00:00 1.5
2018-09-22 00:00:00 2.5
Freq: 6H, dtype: float64
Employ a similar trick to include 6am, 12pm, 6pm on 2018-09-22 as well.
Re-index with a shift equal to that you want to have as an inclusive endpoint. In this case our shift is an extra day
>>> import pandas as pd
>>> x = pd.Series([1.5, 2.5], pd.DatetimeIndex(['2018-09-21', '2018-09-22']))
>>> res = x.reindex(x.index.union(x.index.shift(freq='1D'))).resample('6h').ffill()
>>> res[:res.last_valid_index()] # drop the start of next day
2018-09-21 00:00:00 1.5
2018-09-21 06:00:00 1.5
2018-09-21 12:00:00 1.5
2018-09-21 18:00:00 1.5
2018-09-22 00:00:00 2.5
2018-09-22 06:00:00 2.5
2018-09-22 12:00:00 2.5
2018-09-22 18:00:00 2.5
Freq: 6H, dtype: float64

Pandas: Adding varying numbers of days to a date in a dataframe

I have a dataframe with a date column and then a number of days that I want to add to that column. I want to create a new column, 'Recency_Date', with the resulting value.
df:
fan Community Name Count Mean_Days Date_Min
0 855 AAA Games 6 353 2013-04-16
1 855 First Person Shooters 2 420 2012-10-16
2 855 Playstation 3 108 2014-06-12
3 3148 AAA Games 1 0 2015-04-17
4 3148 Mobile Gaming 1 0 2013-01-19
df info:
merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4627415 entries, 0 to 4627414
Data columns (total 5 columns):
fan int64
Community Name object
Count int64
Mean_Days int32
Date_Min datetime64[ns]
dtypes: datetime64[ns](1), int32(1), int64(2), object(1)
memory usage: 194.2+ MB
Sample data as csv:
fan,Community Name,Count,Mean_Days,Date_Min
855,AAA Games,6,353,2013-04-16 00:00:00
855,First Person Shooters,2,420,2012-10-16 00:00:00
855,Playstation,3,108,2014-06-12 00:00:00
3148,AAA Games,1,0,2015-04-17 00:00:00
3148,Mobile Gaming,1,0,2013-01-19 00:00:00
3148,Power PCs,2,0,2014-06-17 00:00:00
3148,XBOX,1,0,2009-11-12 00:00:00
3860,AAA Games,1,0,2012-11-28 00:00:00
3860,Minecraft,3,393,2011-09-07 00:00:00
4044,AAA Games,5,338,2010-11-15 00:00:00
4044,Blizzard Games,1,0,2013-07-12 00:00:00
4044,Geek Culture,1,0,2011-06-03 00:00:00
4044,Indie Games,2,112,2013-01-09 00:00:00
4044,Minecraft,1,0,2014-01-02 00:00:00
4044,Professional Gaming,1,0,2014-01-02 00:00:00
4044,XBOX,2,785,2010-11-15 00:00:00
4827,AAA Games,1,0,2010-08-24 00:00:00
4827,Gaming Humour,1,0,2012-05-05 00:00:00
4827,Minecraft,2,10,2012-03-21 00:00:00
5260,AAA Games,4,27,2013-09-17 00:00:00
5260,Indie Games,8,844,2011-06-08 00:00:00
5260,MOBA,2,0,2012-10-27 00:00:00
5260,Minecraft,5,106,2012-02-17 00:00:00
5260,XBOX,1,0,2011-06-15 00:00:00
5484,AAA Games,21,1296,2009-08-01 00:00:00
5484,Free to Play,1,0,2014-12-08 00:00:00
5484,Indie Games,1,0,2014-05-28 00:00:00
5484,Music Games,1,0,2012-09-12 00:00:00
5484,Playstation,1,0,2012-02-22 00:00:00
I've tried:
merged['Recency_Date'] = merged['Date_Min'] + timedelta(days=merged['Mean_Days'])
and:
merged['Recency_Date'] = pd.DatetimeIndex(merged['Date_Min']) + pd.DateOffset(merged['Mean_Days'])
But am having trouble finding something that will work for a Series rather than an individual int value. Any and all help would be very much appreciated with this.
If 'Date_Min' dtype is already datetime then you can construct a Timedeltaindex from your 'Mean_Days' column and add these:
In [174]:
df = pd.DataFrame({'Date_Min':[dt.datetime.now(), dt.datetime(2015,3,4), dt.datetime(2011,6,9)], 'Mean_Days':[1,2,3]})
df
Out[174]:
Date_Min Mean_Days
0 2015-09-15 14:02:37.452369 1
1 2015-03-04 00:00:00.000000 2
2 2011-06-09 00:00:00.000000 3
In [175]:
df['Date_Min'] + pd.TimedeltaIndex(df['Mean_Days'], unit='D')
Out[175]:
0 2015-09-16 14:02:37.452369
1 2015-03-06 00:00:00.000000
2 2011-06-12 00:00:00.000000
Name: Date_Min, dtype: datetime64[ns]

Matching two similar dates with different times

I have two dates in pandas dataframes (df1.a_date & df2.another_date) read from CSV files. They match at the date level (YYYY-MM-DD) but not at the time (HH:MM:SS). Both are read in as dtype: object.
I need to merge the two dataframes on the dates, but since they aren't exact, i probably need to convert them first. Any ideas?
edit:
I've tried using diatomite.date to construct a new date from the pandas.datetime, but that doesn't seem to work.
datetime.date(df.a_date.year, df.a_date.month, df.a_date.day)
pandas datetime objects don't have year, month, day accessors, though.
You can normalize a date column/DatetimeIndex index:
Note: At the moment normalize isn't exported to the dt accessor so we need to wrap with DatetimeIndex.
In [11]: df = pd.DataFrame(pd.date_range('2015-01-01 05:00', periods=3), columns=['datetime'])
In [12]: df
Out[12]:
datetime
0 2015-01-01 05:00:00
1 2015-01-02 05:00:00
2 2015-01-03 05:00:00
In [13]: df["date"] = pd.DatetimeIndex(df["datetime"]).normalize()
In [14]: df
Out[14]:
datetime date
0 2015-01-01 05:00:00 2015-01-01
1 2015-01-02 05:00:00 2015-01-02
2 2015-01-03 05:00:00 2015-01-03
This works if it's a DatetimeIndex too, use df.index rather than df[col_name].
Format the datetime to only include YYYY-MM-DD:
assuming df is your dataframe:
'{:%Y-%m-%d}'.format(d)
Assume, dft is your dataframe and 'index' column contains datetime:
In [1804]: dft.head()
Out[1804]:
index A
0 2013-01-01 00:00:00 1.193366
1 2013-01-01 00:01:00 1.013425
2 2013-01-01 00:02:00 1.281902
3 2013-01-01 00:03:00 -0.043788
4 2013-01-01 00:04:00 -1.610164
You could convert the column to contain just the date and save it in a different column, if you want. And operate on that:
In [1805]: dft['index'].apply(lambda v:v.date()).head()
Out[1805]:
0 2013-01-01
1 2013-01-01
2 2013-01-01
3 2013-01-01
4 2013-01-01
Name: index, dtype: object

Categories

Resources