Data Cleaning Consequence on a Pre-Built Index

Data Cleaning Consequence on a Pre-Built Index - python

Objective:
To create an Index that accommodates a pre-existing set of price data from a csv file. I can build an index using list comprehensions. If it's done in that way, the construction would give me a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (i.e. 10 minute intervals). However, my data of prices coming from the csv is length: 62,034. Observe that the difference in length is due to data cleaning issues.
That said, I am not sure how to overcome the apparent mismatch between the real data and this pre-built (list comp) dataframe.
Attempt:
Am I using the first two lines incorrectly?
data=pd.read_csv('___.csv', parse_dates={'datetime':[0,1]}).set_index('datetime')
dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
ts = pd.Series(data.prices.values, dt_index)
Questions:
As I understand it, I should use 'combine' since I want the index construction to be completely informed by my csv file. And, 'combine' returns a new datetime object whose date components are equal to the given date object’s, and whose time components are equal to the given time object’s.
When I parse_dates, is it lumping the time and date together and considering it to be a 'date'?
Is there a better way to achieve the stated objective?
Traceback Error:
AttributeError: 'unicode' object has no attribute 'date'

You can write this neatly as follows:
ts = df1.prices
Here's an example:
In [1]: df = pd.read_csv('prices.csv',
parse_dates={'datetime': [0,1]}).set_index('datetime')
In [2]: df # dataframe
Out[2]:
prices duty
datetime
2012-11-12 10:00:00 1 0
2012-12-12 10:00:00 2 0
2012-12-12 10:00:00 3 1
In [3]: df.prices # timeseries
Out[3]:
datetime
2012-11-12 10:00:00 1
2012-12-12 10:00:00 2
2012-12-12 11:00:00 3
Name: prices
In [4]: ts = df.prices
You can groupby date like so (similar to this example from the docs):
In [5]: key = lambda x: x.date()
In [6]: df.groupby(key).sum()
Out[6]:
prices duty
2012-11-12 1 0
2012-12-12 5 1
In [7]: ts.groupby(key).sum()
Out[7]:
2012-11-12 1
2012-12-12 5
Where prices.csv contains:
date,time,prices,duty
11/12/2012,10:00,1,0
12/12/2012,10:00,2,0
12/12/2012,11:00,3,1

Related

How to work around the date range limit in Pandas for plotting?

sorry if this question has been asked before but I can't seem to find one that describes my current issue.
Basically, I have a large climate dataset that is not bound to "real" dates. The dataset starts at "year one" and goes to "year 9999". These dates are stored as strings such as Jan-01, Feb-01, Mar-01 etc, where the number indicates the year. When trying to convert this column to date time objects, I get an out of range error. (My reading into this suggests this is due to a 64bit limit on the possible datetime timestamps that can exist)
What is a good way to work around this problem/process the date information so I can effectively plot the associated data vs these dates, over this ~10,000 year period?
Thanks

the cftime library was created specifically for this purpose, and xarray has a convenient xr.cftime_range function that makes creating such a range easy:
In [3]: import xarray as xr, pandas as pd
In [4]: date_range = xr.cftime_range('0001-01-01', '9999-01-01', freq='D')
In [5]: type(date_range)
Out[5]: xarray.coding.cftimeindex.CFTimeIndex
This creates a CFTimeIndex object which plays nicely with pandas:
In [8]: df = pd.DataFrame({"date": date_range, "vals": range(len(date_range))})
In [9]: df
Out[9]:
date vals
0 0001-01-01 00:00:00 0
1 0001-01-02 00:00:00 1
2 0001-01-03 00:00:00 2
3 0001-01-04 00:00:00 3
4 0001-01-05 00:00:00 4
... ... ...
3651692 9998-12-28 00:00:00 3651692
3651693 9998-12-29 00:00:00 3651693
3651694 9998-12-30 00:00:00 3651694
3651695 9998-12-31 00:00:00 3651695
3651696 9999-01-01 00:00:00 3651696
[3651697 rows x 2 columns]

Pandas: how to change only one column which is in a series contain same column name [duplicate]

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?

If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.

The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.

You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object

If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])

SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')

Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month

You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')

#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4

Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108

There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year

df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01

df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!

Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Extract Day, Month and Hour from Timestamp string in Python [duplicate]

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?

If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.

The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.

You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object

If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])

SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')

Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month

You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')

#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4

Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108

There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year

df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01

df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!

Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Increment attributes of a datetime Series in pandas

I have a Series containing datetime64[ns] elements called series, and would like to increment the months. I thought the following would work fine, but it doesn't:
series.dt.month += 1
The error is
ValueError: modifications to a property of a datetimelike object are not supported. Change values on the original.
Is there a simple way to achieve this without needing to redefine things?

First, I created timeseries date example:
import datetime
t = [datetime.datetime(2015,4,18,23,33,58),datetime.datetime(2015,4,19,14,32,8),datetime.datetime(2015,4,20,18,42,44),datetime.datetime(2015,4,20,21,41,19)]
import pandas as pd
df = pd.DataFrame(t,columns=['Date'])
Timeseries:
df
Out[]:
Date
0 2015-04-18 23:33:58
1 2015-04-19 14:32:08
2 2015-04-20 18:42:44
3 2015-04-20 21:41:19
Now increment part, you can use offset option.
df['Date']+pd.DateOffset(days=30)
Output:
df['Date']+pd.DateOffset(days=30)
Out[66]:
0 2015-05-18 23:33:58
1 2015-05-19 14:32:08
2 2015-05-20 18:42:44
3 2015-05-20 21:41:19
Name: Date, dtype: datetime64[ns]

Extracting hours from a csv with pandas

I have a csv that looks like this
time,result
1308959819,1
1379259923,2
1318632821,3
1375216682,2
1335930758,4
times are in unix format. I want to extract the hours from such times and groupby the file with respect to such values.
I tried
times = pd.to_datetime(df.time, unit='s')
or even
times = pd.DataFrame(pd.to_datetime(df.time, unit='s'))
but in both cases I got an error with
times.hour
>>>AttributeError: 'DataFrame' object has no attribute 'hour'

You're getting that error because Series and DataFrames don't have hour attributes. You can access the information you want using the .dt convenience accessor (docs here):
>>> times = pd.to_datetime(df.time, unit='s')
>>> times
0 2011-06-24 23:56:59
1 2013-09-15 15:45:23
2 2011-10-14 22:53:41
3 2013-07-30 20:38:02
4 2012-05-02 03:52:38
Name: time, dtype: datetime64[ns]
>>> times.dt
<pandas.tseries.common.DatetimeProperties object at 0xb5de94c>
>>> times.dt.hour
0 23
1 15
2 22
3 20
4 3
dtype: int64

You can use the builtin datetime class to do this.
import datetime
# your code here
hours = datetime.datetime.fromtimestamp(df.time).hour

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Cleaning Consequence on a Pre-Built Index - python

Related

How to work around the date range limit in Pandas for plotting?

Pandas: how to change only one column which is in a series contain same column name [duplicate]

Extract Day, Month and Hour from Timestamp string in Python [duplicate]

Increment attributes of a datetime Series in pandas

Extracting hours from a csv with pandas

Categories

Resources