I have a dataframe like this, how to sort this.
df = pd.DataFrame({'Date':['Oct20','Nov19','Jan19','Sep20','Dec20']})
Date
0 Oct20
1 Nov19
2 Jan19
3 Sep20
4 Dec20
I familiar in sorting list of dates(string)
a.sort(key=lambda date: datetime.strptime(date, "%d-%b-%y"))
Any thoughts? Should i split it ?
First convert column to datetimes and get positions of sorted values by Series.argsort what is used for change ordering with DataFrame.iloc:
df = df.iloc[pd.to_datetime(df['Date'], format='%b%y').argsort()]
print (df)
Date
2 Jan19
1 Nov19
3 Sep20
0 Oct20
4 Dec20
Details:
print (pd.to_datetime(df['Date'], format='%b%y'))
0 2020-10-01
1 2019-11-01
2 2019-01-01
3 2020-09-01
4 2020-12-01
Name: Date, dtype: datetime64[ns]
I would like to count the number of instances in the timelog, grouped by month. I have the following Pandas column:
print df['date_unconditional'][:5]
0 2018-10-15T07:00:00
1 2018-06-12T07:00:00
2 2018-08-28T07:00:00
3 2018-08-29T07:00:00
4 2018-10-29T07:00:00
Name: date_unconditional, dtype: object
Then I transformed it to datetime format
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'].dt.strftime('%m/%d/%Y'))
print df['date_unconditional'][:5]
0 2018-10-15
1 2018-06-12
2 2018-08-28
3 2018-08-29
4 2018-10-29
Name: date_unconditional, dtype: datetime64[ns]
And then I tried counting them, but I keep getting a mistake
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'], errors='coerce')
print df['date_unconditional'].groupby(pd.Grouper(freq='M')).count()
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Use parameter key in Grouper:
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'], errors='coerce')
print (df.groupby(pd.Grouper(freq='M',key='date_unconditional'))['date_unconditional'].count())
2018-06-30 1
2018-07-31 0
2018-08-31 2
2018-09-30 0
2018-10-31 2
Freq: M, Name: date_unconditional, dtype: int64
Or create DatetimeIndex by DataFrame.set_index and then is possible use GroupBy.size - difference with between is count excluded missing values, size not.
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'], errors='coerce')
print (df.set_index('date_unconditional').groupby(pd.Grouper(freq='M')).size())
2018-06-30 1
2018-07-31 0
2018-08-31 2
2018-09-30 0
2018-10-31 2
Freq: M, dtype: int64
I have a column in my dataframe which I want to convert to a Timestamp. However, it is in a bit of a strange format that I am struggling to manipulate. The column is in the format HHMMSS, but does not include the leading zeros.
For example for a time that should be '00:03:15' the dataframe has '315'. I want to convert the latter to a Timestamp similar to the former. Here is an illustration of the column:
message_time
25
35
114
1421
...
235347
235959
Thanks
Use Series.str.zfill for add leading zero and then to_datetime:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_datetime(s, format='%H%M%S')
print (df)
message_time
0 1900-01-01 00:00:25
1 1900-01-01 00:00:35
2 1900-01-01 00:01:14
3 1900-01-01 00:14:21
4 1900-01-01 23:53:47
5 1900-01-01 23:59:59
In my opinion here is better create timedeltas by to_timedelta:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_timedelta(s.str[:2] + ':' + s.str[2:4] + ':' + s.str[4:])
print (df)
message_time
0 00:00:25
1 00:00:35
2 00:01:14
3 00:14:21
4 23:53:47
5 23:59:59
I would like to count how many unique weekdays exist in timestamp. Here's an input and I want output to be 4(since 8/5 and 8/6 are weekends).
captureTime
0 8/1/2017 0:05
1 8/2/2017 0:05
2 8/3/2017 0:05
3 8/4/2017 0:05
4 8/5/2017 0:05
5 8/6/2017 0:05
Using np.is_busday:
import numpy as np
import pandas as pd
df = pd.DataFrame( {
'captureTime':[ '8/1/2017 0:05', '8/2/2017 0:05', '8/3/2017 0:05',
'8/4/2017 0:05', '8/5/2017 0:05', '8/6/2017 0:05']})
df['captureTime'] = pd.to_datetime(df['captureTime'])
print(np.is_busday(df['captureTime'].values.astype('datetime64[D]')).sum())
prints
4
Above, all business days are counted once.
If you wish to count identical datetimes only once, you could use
np.is_busday(df['captureTime'].unique().astype('datetime64[D]')).sum()
Or, if you wish to remove datetimes that have identical date components, convert to datetime64[D] dtype before calling np.unique:
np.is_busday(np.unique(df['captureTime'].values.astype('datetime64[D]'))).sum()
One way is pandas series.dt.weekday
df['captureTime'] = pd.to_datetime(df['captureTime'])
np.sum(df['captureTime'].dt.weekday.isin([0,1,2,3,4]))
It returns 4
You can use boolean indexing in case you need to capture the dates
df[df['captureTime'].dt.weekday.isin([0,1,2,3,4])]
captureTime
0 2017-08-01 00:05:00
1 2017-08-02 00:05:00
2 2017-08-03 00:05:00
3 2017-08-04 00:05:00
Convert to date time using pd.to_datetime, get the unique dayofweek list, and count all those under 5.
out = (df.captureTime.apply(pd.to_datetime).dt.dayofweek.unique() < 5).sum()
print(out)
4
df.unique removes duplicates, leaving you with a unique array of daysofweek, on which count occurrences under 5 (0 - 4 -> weekdays).
Output of df.dayofweek:
out = df.captureTime.apply(pd.to_datetime).dt.dayofweek
print(out)
0 1
1 2
2 3
3 4
4 5
5 6
Name: captureTime, dtype: int64
Assuming you have captureTime as datetime object you can do this,
s = df['captureTime'].dt.weekday
s[s >= 5].count() # 5, 6 corresponds to saturday, sunday
I have stock data downloaded from yahoo finance. I want to pickup data in the row corresponding to monthly start and month end. I am trying to do it with python pandas data frame. But I am not getting correct method to get the starting & ending of the month. will be great full if somebody can help me in solving this.
Please note that if 1st of the month is holiday and there is no data for that, I need to pick up 2nd day's data. Same rule applies to last of the month also. Thanks in advance.
Example data is
2016-01-05,222.80,222.80,217.00,217.75,15074800,217.75
2016-01-04,226.95,226.95,220.05,220.70,14092000,220.70
2015-12-31,225.95,226.55,224.00,224.45,11558300,224.45
2015-12-30,229.00,229.70,224.85,225.80,11702800,225.80
2015-12-29,228.85,229.95,227.50,228.20,7263200,228.20
2015-12-28,229.05,229.95,228.00,228.90,8756800,228.90
........
........
2015-12-04,240.00,242.15,238.05,241.10,11115100,241.10
2015-12-03,244.15,244.50,240.40,241.10,7155600,241.10
2015-12-02,250.55,250.65,243.75,244.60,10881700,244.60
2015-11-30,249.65,253.00,245.00,250.20,12865400,250.20
2015-11-27,243.00,250.50,242.80,249.70,15149900,249.70
2015-11-26,241.95,244.90,241.00,242.50,13629800,242.50
First, you should convert your date column to datetime format, then group by month, then sort groupby Series by date and take the first/last from it using head/tail methods, like so:
In [37]: df
Out[37]:
0 1 2 3 4 5 6
0 2016-01-05 222.80 222.80 217.00 217.75 15074800 217.75
1 2016-01-04 226.95 226.95 220.05 220.70 14092000 220.70
2 2015-12-31 225.95 226.55 224.00 224.45 11558300 224.45
3 2015-12-30 229.00 229.70 224.85 225.80 11702800 225.80
4 2015-12-29 228.85 229.95 227.50 228.20 7263200 228.20
5 2015-12-28 229.05 229.95 228.00 228.90 8756800 228.90
In [25]: import datetime
In [29]: df[0] = df[0].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d')
)
In [36]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).head(1))
Out[36]:
0 1 2 3 4 5 6
0
1 1 2016-01-04 226.95 226.95 220.05 220.7 14092000 220.7
12 5 2015-12-28 229.05 229.95 228.00 228.9 8756800 228.9
In [38]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).tail(1))
Out[38]:
0 1 2 3 4 5 6
0
1 0 2016-01-05 222.80 222.80 217.0 217.75 15074800 217.75
12 2 2015-12-31 225.95 226.55 224.0 224.45 11558300 224.45
You can merge the result dataframes, using pd.concat()
For the first / last day of each month, you can use .resample() with 'BMS' and 'BM' for Business Month (Start) like so (using pandas 0.18 syntax):
df.resample('BMS').first()
df.resample('BM').last()
This assumes that your data have a DateTimeIndex as usual when downloaded from yahoo using pandas_datareader:
from datetime import datetime
from pandas_datareader.data import DataReader
df = DataReader('FB', 'yahoo', datetime(2015, 1, 1), datetime(2015, 3, 31))['Open']
df.head()
Date
2015-01-02 78.580002
2015-01-05 77.980003
2015-01-06 77.230003
2015-01-07 76.760002
2015-01-08 76.739998
Name: Open, dtype: float64
df.tail()
Date
2015-03-25 85.500000
2015-03-26 82.720001
2015-03-27 83.379997
2015-03-30 83.809998
2015-03-31 82.900002
Name: Open, dtype: float64
do:
df.resample('BMS').first()
Date
2015-01-01 78.580002
2015-02-02 76.110001
2015-03-02 79.000000
Freq: BMS, Name: Open, dtype: float64
and
df.resample('BM').last()
to get:
Date
2015-01-30 78.000000
2015-02-27 80.680000
2015-03-31 82.900002
Freq: BM, Name: Open, dtype: float64
Assuming you have downloaded data from Yahoo:
> import pandas.io.data as web
> import datetime
> start = datetime.datetime(2016,1,1)
> end = datetime.datetime(2016,5,1)
> df = web.DataReader("AAPL", "yahoo", start, end)
You simply pick the month end and start rows with:
df[df.index.is_month_end]
df[df.index.is_month_start]
If you want to access a specific row, like the first row of the first starting day of the selected starting days, you simply do:
df[df.index.is_month_start].ix[0]