How to exclude working days in df.sample? - python

I have a df like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 2),columns=['A', 'B'])
df['Date'] = pd.date_range("1/1/2000", periods=100)
df.set_index('Date', inplace = True)
I want to resample it by week and get the last value, but when I use this statement it returns results that actually include seven days of the week.
>>> df.resample('W').last()
A B
Date
2000-01-02 -0.233055 0.215712
2000-01-09 -0.031690 -1.194929
2000-01-16 -1.441544 -0.206924
2000-01-23 -0.225403 -0.058323
2000-01-30 -1.564966 -1.409034
2000-02-06 0.800451 -0.730578
2000-02-13 -0.265631 -0.161049
2000-02-20 0.252658 -0.458502
2000-02-27 1.982499 3.208221
2000-03-05 -0.391827 0.927733
2000-03-12 -0.723863 -0.076955
2000-03-19 -1.379905 0.259892
2000-03-26 -0.983180 1.734662
2000-04-02 0.139668 -0.834987
2000-04-09 0.854117 -0.421875
And I only want the results for 5 days a week(not including Saturday and Sunday),That is, the returned date interval should be 5 but 7. Can pandas implement this? Or do I have to resort to some 3rd party calendar library?

To select only the weekdays you can use:
df = df[df.index.weekday.isin(list(range(5)))]
This will give you your DataFrame only including Monday to Friday. The job afterwards can keep the same.
Comment
Calling resample('W') will create the missing indexes. I belive your want to do something else.

Related

Pandas: populating column values with date and time string based on conditions

I have a Pandas dataframe df that looks as follows:
created_time action_time
2021-03-05T07:18:12.281-0600 2021-03-05T08:32:19.153-0600
2021-03-04T15:34:23.373-0600 2021-03-04T15:37:32.360-0600
2021-03-01T04:57:47.848-0600 2021-03-01T08:37:39.083-0600
import pandas as pd
df = pd.DataFrame({'created_time':['2021-03-05T07:18:12.281-0600', '2021-03-04T15:34:23.373-0600', '2021-03-01T04:57:47.848-0600'],
'action_time':['2021-03-05T08:32:19.153-0600', '2021-03-04T15:37:32.360-0600', '2021-03-01T08:37:39.083-0600']})
I then create another column which represents the the difference in minutes between these two columns:
df['elapsed_time'] = (pd.to_datetime(df['action_time']) - pd.to_datetime(df['created_time'])).dt.total_seconds() / 60
df['elapsed_time']
elapsed_time
74.114533
3.149783
219.853917
We assume that "action" can only take place during business hours (which we assume to start 8:30am).
I would like to create another column named created_time_adjusted, which adjusts the created_time to 08:30am if the created_time is before 08:30am).
I can parse out the date and time string that I need, as follows:
df['elapsed_time'] = pd.to_datetime(df['created_time']).dt.date.astype(str) + 'T08:30:00.000-0600'
But, this doesn't deal with the conditional.
I'm aware of a few ways that I might be able to do this:
replace
clip
np.where
loc
What is the best (and least hacky) way to accomplish this?
Thanks!
First of all, I think your life would be easier if you convert the columns to datetime dtypes from the go. Then, its just a matter of running an apply op on the 'created_time' column.
df.created_time = pd.to_datetime(df.created_time)
df.action_time = pd.to_datetime(df.action_time)
df.elapsed_time = df.action_time-df.created_time
time_threshold = pd.to_datetime('08:30').time()
df['created_time_adjusted']=df.created_time.apply(lambda x:
x.replace(hour=8,minute=30,second=0)
if x.time()<time_threshold else x)
Output:
>>> df
created_time action_time created_time_adjusted
0 2021-03-05 07:18:12.281000-06:00 2021-03-05 08:32:19.153000-06:00 2021-03-05 08:30:00.281000-06:00
1 2021-03-04 15:34:23.373000-06:00 2021-03-04 15:37:32.360000-06:00 2021-03-04 15:34:23.373000-06:00
2 2021-03-01 04:57:47.848000-06:00 2021-03-01 08:37:39.083000-06:00 2021-03-01 08:30:00.848000-06:00
df['created_time']=pd.to_datetime(df['created_time'])#Coerce to datetime
df1=df.set_index(df['created_time']).between_time('00:00:00', '08:30:00', include_end=False)#Isolate earlier than 830 into df
df1['created_time']=df1['created_time'].dt.normalize()+ timedelta(hours=8,minutes=30, seconds=0)#Adjust time
df2=df1.append(df.set_index(df['created_time']).between_time('08:30:00','00:00:00', include_end=False)).reset_index(drop=True)#Knit before and after 830 together
df2

How to create visualization from time series data in a .txt file in python

I have a .txt file with three columns: Time, ticker, price. The time is spaced in 15 second intervals. It looks like this uploaded to jupyter notebook and put into a Pandas DF.
time ticker price
0 09:30:35 EV 33.860
1 00:00:00 AMG 60.430
2 09:30:35 AMG 60.750
3 00:00:00 BLK 455.350
4 09:30:35 BLK 451.514
... ... ... ...
502596 13:00:55 TLT 166.450
502597 13:00:55 VXX 47.150
502598 13:00:55 TSLA 529.800
502599 13:00:55 BIDU 103.500
502600 13:00:55 ON 12.700
# NOTE: the first set of data has the data at market open for -
# every other time point, so that's what the 00:00:00 is.
#It is only limited to the 09:30:35 data.
I need to create a function that takes an input (a ticker) and then creates a bar chart that displays the data with 5 minute ticks ( the data is every 20 seconds, so for every 15 points in time).
So far I've thought about separating the "mm" part of the hh:mm:ss to just get the minutes in another column and then right a for loop that looks something like this:
for num in df['mm']:
if num %5 == 0:
print('tick')
then somehow appending the "tick" to the "time" column for every 5 minutes of data (I'm not sure how I would do this), then using the time column as the index and only using data with the "tick" index in it (some kind of if statement). I'm not sure if this makes sense but I'm drawing a blank on this.
You should have a look at the built-in functions in pandas. In the following example I'm using a date + time format but it shouldn't be hard to convert one to the other.
Generate data
%matplotlib inline
import pandas as pd
import numpy as np
dates = pd.date_range(start="2020-04-01", periods=150, freq="20S")
df1 = pd.DataFrame({"date":dates,
"price":np.random.rand(len(dates))})
df2 = df1.copy()
df1["ticker"] = "a"
df2["ticker"] = "b"
df = pd.concat([df1,df2], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
Resample Timeseries every 5 minutes
Here you can try to see the output of
df1.set_index("date")\
.resample("5T")\
.first()\
.reset_index()
Where we are considering just the first element at 05:00, 10:00 and so on. In general to do the same for every ticker we need a groupby
out = df.groupby("ticker")\
.apply(lambda x: x.set_index("date")\
.resample("5T")\
.first()\
.reset_index())\
.reset_index(drop=True)
Plot function
def plot_tick(data, ticker):
ts = data[data["ticker"]==ticker].reset_index(drop=True)
ts.plot(x="date", y="price", kind="bar", title=ticker);
plot_tick(out, "a")
Then you can improve the plot or, eventually, try to use plotly.

Delete word from column in python

I have a CSV file which have four column
Like this
Freq ID Date Name
0 2053 1998 apple
2 2054 1998 May-June. orange
3 2055 2019 apple
5 2056 1999 Oct-Nov orange
It is large file and I have to remove May-Jun from Date column and all which have year with month I have to keep only year
How can I remove it from python
you can use pandas to read and extract year from date column. you can use split() function and split on space the first item will be your year
like this
import pandas as pd
df = pd.read_csv(filename)
df['Date'] = df["Date"].str.split(" ").str.get(0)
print(df)
Hi I have recently come across the similar kind of problem, I tried to resolve this using below snippet code, you can probable try using it will work and it most optimized solution I think.
import pandas as pd
import csv
from datetime import datetime
to_datetime = lambda d: datetime.strptime(d[:4] , '%Y')
path = "D:\python_poc"
filename="\Input.csv"
df = pd.read_csv(path+filename,parse_dates=['Date'])
df = pd.read_csv(path+filename, converters={'Date': to_datetime})
df.to_csv(path+filename,index=False,quoting=csv.QUOTE_ALL)

How to add automatically data to the missing days in historical stock prices?

I would like to write a python script that will check if there is any missing day. If there is it should take the price from the latest day and create a new day in data. I mean something like shown below. My data is in CSV files. Any ideas how it can be done?
Before:
MSFT,5-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,29-May-07,243.31
After:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,253.28
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,249.95
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,243.31
MSFT,29-May-07,243.31
My solution:
import pandas as pd
df = pd.read_csv("path/to/file/file.csv",names=list("abc")) # read string as file
cols = df.columns # store column order
df.b = pd.to_datetime(df.b) # convert col Date to datetime
df.set_index("b",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df = df[cols] # revert order
df.sort_values(by="b",ascending=False,inplace=True) # sort by date
df["b"] = df["b"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv("data.csv",index=False,header=False) #specify outputfile if needed
print(df.to_string())
Using pandas library this operation can be made on a single line. But first we need to read in your data to the right formats:
import io
import pandas as pd
s = u"""name,Date,Close
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,23-Dec-16,789.91
MSFT,16-Dec-16,790.8
MSFT,15-Dec-16,797.85
MSFT,14-Dec-16,797.07"""
#df = pd.read_csv("path/to/file.csv") # read from file
df = pd.read_csv(io.StringIO(s)) # read string as file
cols = df.columns # store column order
df.Date = pd.to_datetime(df.Date) # convert col Date to datetime
df.set_index("Date",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df
Returns:
Date name Close
0 2016-12-14 MSFT 797.07
1 2016-12-15 MSFT 797.85
2 2016-12-16 MSFT 790.80
3 2016-12-17 MSFT 790.80
4 2016-12-18 MSFT 790.80
5 2016-12-19 MSFT 790.80
6 2016-12-20 MSFT 790.80
7 2016-12-21 MSFT 790.80
8 2016-12-22 MSFT 790.80
9 2016-12-23 MSFT 789.91
10 2016-12-24 MSFT 789.91
11 2016-12-25 MSFT 789.91
12 2016-12-26 MSFT 789.91
13 2016-12-27 MSFT 791.55
14 2016-12-28 MSFT 785.05
15 2016-12-29 MSFT 782.79
16 2016-12-30 MSFT 771.82
Return back to csv with:
df = df[cols] # revert order
df.sort_values(by="Date",ascending=False,inplace=True) # sort by date
df["Date"] = df["Date"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv(index=False,header=False) #specify outputfile if needed
Output:
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,26-Dec-16,789.91
MSFT,25-Dec-16,789.91
MSFT,24-Dec-16,789.91
MSFT,23-Dec-16,789.91
...
To do this, you would need to iterate through your dataframe using nested for loops. That would look something like:
for column in df:
for row in df:
do_something()
To give you an idea, the
do_something()
part of your code would probably be something like checking if there was a gap between the dates. Then you would copy the other columns from the row above and insert a new row using:
df.loc[row] = [2, 3, 4] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort() # sorting by index
Hope this helped give you an idea of how you would solve this. Let me know if you want some more code!
This code uses standard routines.
from datetime import datetime, timedelta
Input lines will have to be split on commas, and the dates parsed in two places in the main part of the code. I have therefore put this work in a single function.
def glean(s):
msft, date_part, amount = s.split(',')
if date_part.find('-')==1:
date_part = '0'+date_part
date = datetime.strptime(date_part, '%d-%b-%y')
return date, amount
Similarly, dates will have to be formatted for output with other pieces of data in a number of places in the main code.
def out(date,amount):
date_str = date.strftime('%d-%b-%y')
print(('%s,%s,%s' % ('MSFT', date_str, amount)).replace('MSFT,0', 'MSFT,'))
with open('before.txt') as before:
I read the initial line of data on its own to establish the first date for comparison with date in the next line.
previous_date, previous_amount = glean(before.readline().strip())
out(previous_date, previous_amount)
for line in before.readlines():
date, amount = glean(line.strip())
I calculate the elapsed time between the current line and the previous line, to know how many lines to output in place of missing lines.
elapsed = previous_date - date
setting_date is decremented from previous_date for the number of days that elapsed without data. One line is omitted for each day, if there were any.
setting_date = previous_date
for i in range(-1+elapsed.days):
setting_date -= timedelta(days=1)
out(setting_date, previous_amount)
Now the available line of data is output.
out(date, amount)
Now previous_date and previous_amount are reset to reflect the new values, for use against the next line of data, if any.
previous_date, previous_amount = date, amount
Output:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,248.71
MSFT,29-May-07,243.31

The frequency of DatetimeIndex is missing after groupby operation

Code
import pandas as pd
import numpy as np
dates = pd.date_range('20140301',periods=6)
id_col = np.array([[0, 1, 2, 0, 1, 2]])
data_col = np.random.randn(6,4)
data = np.concatenate((id_col.T, data_col), axis=1)
df = pd.DataFrame(data, index=dates, columns=list('IABCD'))
print df
print "before groupby:"
for index in df.index:
if not index.freq:
print "key:%f, no freq:%s" % (key, index)
print "after groupby:"
gb = df.groupby('I')
for key, group in gb:
#group = group.resample('1D', how='first')
for index in group.index:
if not index.freq:
print "key:%f, no freq:%s" % (key, index)
The output:
I A B C D
2014-03-01 0 0.129348 1.466361 -0.372673 0.045254
2014-03-02 1 0.395884 1.001859 -0.892950 0.480944
2014-03-03 2 -0.226405 0.663029 0.355675 -0.274865
2014-03-04 0 0.634661 0.535560 1.027162 1.637099
2014-03-05 1 -0.453149 -0.479408 -1.329372 -0.574017
2014-03-06 2 0.603972 0.754232 0.692185 -1.267217
[6 rows x 5 columns]
before groupby:
after groupby:
key:0.000000, no freq:2014-03-01 00:00:00
key:0.000000, no freq:2014-03-04 00:00:00
key:1.000000, no freq:2014-03-02 00:00:00
key:1.000000, no freq:2014-03-05 00:00:00
key:2.000000, no freq:2014-03-03 00:00:00
key:2.000000, no freq:2014-03-06 00:00:00
But after I uncomment the statement:
#group = group.resample('1D', how='first')
It seems no problem. The thing is, when I running on a large dataset with some operations on the timestamp, there is always an error "cannot add integral value to timestamp without offset". Is it a bug, or did I miss some thing?
You are treating a groupby object as a DataFrame.
It is like a dataframe, but requires apply to generate a new structure (either reduced or an actual DataFrame).
The idiom is:
df.groupby(....).apply(some_function)
Doing something like: df.groupby(...).sum() is syntactic sugar for using apply. Functions which are naturally applicable to using this kind of sugar are enabled; otherwise they will raise an error.
In particular you are accessing a group.index which can be but is not guaranteed to be a DatetimeIndex (when time grouping). The freq attributes of a datetimeindex are inferred when required (via inferred_freq).
You code is very confusing, you are grouping, then resampling; resample does this for you, so you don't need the former step at all.
resample is de-facto equivalent of a groupby-apply (but has special handling for the time-domain).

Categories

Resources