Python for loop returns first value - python

Background: the code is supposed to reference a ticker symbol and the time of a trade, and then pull in the subsequent closing prices after the time of the trade (next_price_1m and next_price_2m)
Problem: when a ticker repeats, the next_price_1m and next_price_2m repeat from the prior call for that ticker, even though the time of the trade has changed. My initial call to get barset is working, but only for the first instance of the ticker.
Example output:
symbol | transaction_time | next price 1m | next price 2m |
--------|------------------|---------------|---------------|--
JPM | 10:00 a.m. | $90 | $91 |
SPY | 10:25 a.m. | $260 | $261 |
JPM | 11:37 a.m. | $90 | $91 |
AAPL | 2:25 p.m. | $330 | $335 |
JPM | 3:02 p.m. | $90 | $91 |
JPM should have different next_price_1m and next_price_2m on 2nd and 3rd calls.
Code:
trades_list = api.get_activities(date='2020-04-06')
data = []
for trade in trades_list:
my_list_of_trade_order_ids = trade.order_id
price = trade.price
qty = trade.qty
side = trade.side
symbol = trade.symbol
transaction_time = trade.transaction_time
client_order_id = api.get_order(trade.order_id).client_order_id
barset = api.get_barset(timeframe='minute',symbols=trade.symbol,limit=15,after=trade.transaction_time)
df_bars = pd.DataFrame(barset)
next_price_1m = df_bars.iat[0,0].c
next_price_2m = df_bars.iat[1,0].c
data.append({'price':price, 'qty':qty, 'side':side,'symbol':symbol,'transaction_time':transaction_time, 'client_order_id':client_order_id, 'next price 1m':next_price_1m,'next price 2m':next_price_2m})
df = pd.DataFrame(data)
df

Thanks for the responses.
The issue was that the 'after' parameter was looking for a timestamp in a different format than the timestamp format being provided with trade.transaction_time
Rather than providing an error, the API would ignore the time parameters, and just return the latest data.

Related

Very simple pandas column/row transform that I cannot figure out

I need to do a simple calculation on values in a dataframe, but I need some column transposed first. Once they are transposed I want to take the most recent amount / 2nd most recent amount and then the binary result if it less than or equal to .5
By most recent I mean most recent to the date in the Date 2 column
Have This
| Name | Amount | Date 1 | Date 2 |
| -----| ---- |------------------------|------------|
| Jim | 100 | 2021-06-10 | 2021-06-15 |
| Jim | 200 | 2021-05-11 | 2021-06-15 |
| Jim | 150 | 2021-03-5 | 2021-06-15 |
| Bob | 350 | 2022-06-10 | 2022-08-30 |
| Bob | 300 | 2022-08-12 | 2022-08-30 |
| Bob | 400 | 2021-07-6 | 2022-08-30 |
I Want this
| Name | Amount | Date 2| Most Recent Amount(MRA) | 2nd Most Recent Amount(2MRA) | MRA / 2MRA| Less than or equal to .5 |
| -----| -------|------------------------|----------------|--------------------|-------------|--------------------------|
| Jim | 100 | 2021-06-15 | 100 | 200 | .5 | 1 |
| Bob | 300 | 2022-08-30 | 300 | 350 | .85 | 0 |
This is the original dataframe.
df = pd.DataFrame({'Name':['Jim','Jim','Jim','Bob','Bob','Bob'],
'Amount':[100,200,150,350,300,400],
'Date 1':['2021-06-10','2021-05-11','2021-03-05','2022-06-10','2022-08-12','2021-07-06'],
'Date 2':['2021-06-15','2021-06-15','2021-06-15','2022-08-30','2022-08-30','2022-08-30']
})
And this is the results.
# here we take the gropby of the 'Name' column
g = df.sort_values('Date 1', ascending=False).groupby(['Name'])
# then we use the agg function to get the first of 'Date 2' and 'Amount' columns
# and then rename result of the 'Amount' column to 'MRA'
first = g.agg({'Date 2':'first','Amount':'first'}).rename(columns={'Amount':'MRA'}).reset_index()
# Similarly, we take the second values by applying a lambda function
second = g.agg({'Date 2':'first','Amount':lambda t: t.iloc[1]}).rename(columns={'Amount':'2MRA'}).reset_index()
df_T = pd.merge(first, second, on=['Name','Date 2'], how='left')
# then we use this function to add two desired columns
def operator(x):
return x['MRA']/x['2MRA'], [1 if x['MRA']/x['2MRA']<=.5 else 0][0]
# we apply the operator function to add 'MRA/2MRA' and 'Less than or equal to .5' columns
df_T['MRA/2MRA'], df_T['Less than or equal to .5'] = zip(*df_T.apply(operator, axis=1))
Hope this helps. :)
One way to do what you've asked is:
df = ( df[df['Date 1'] <= df['Date 2']]
.groupby('Name', sort=False)['Date 1'].nlargest(2)
.reset_index(level=0)
.assign(**{
'Amount': df.Amount,
'Date 2': df['Date 2'],
'recency': ['MRA','MRA2']*len(set(df.Name.tolist()))
})
.pivot(index=['Name','Date 2'], columns='recency', values='Amount')
.reset_index().rename_axis(columns=None) )
df = df.assign(**{'Amount':df.MRA, 'MRA / MRA2': df.MRA/df.MRA2})
df = df.assign(**{'Less than or equal to .5': (df['MRA / MRA2'] <= 0.5).astype(int)})
df = pd.concat([df[['Name', 'Amount']], df.drop(columns=['Name', 'Amount'])], axis=1)
Input:
Name Amount Date 1 Date 2
0 Jim 100 2021-06-10 2021-06-15
1 Jim 200 2021-05-11 2021-06-15
2 Jim 150 2021-03-05 2021-06-15
3 Bob 350 2022-06-10 2022-08-30
4 Bob 300 2022-08-12 2022-08-30
5 Bob 400 2021-07-06 2022-08-30
Output:
Name Amount Date 2 MRA MRA2 MRA / MRA2 Less than or equal to .5
0 Bob 300 2022-08-30 300 350 0.857143 0
1 Jim 100 2021-06-15 100 200 0.500000 1
Explanation:
Filter only for rows where Date 1 <= Date 2
Use groupby() and nlargest() to get the 2 most recent Date 1 values per Name
Use assign() to add back the Amount and Date 2 columns and create a recency column containing MRA and MRA2 for the pair of rows corresponding to each Name value
Use pivot() to turn the recency values MRA and MRA2 into column labels
Use reset_index() to restore Name and Date 2 to columns, and use rename_axis() to make the columns index anonymous
Use assign() once to restore Amount and add column MRA / MRA2, and again to add column named Less than or equal to .5
Use concat(), [] and drop() to rearrange the columns to match the output sequence shown in the question.
Here's the rough procedure you want:
sort_values by Name and Date 1 to get the data in order.
shift to get the previous date and 2nd most recent amount fields
Filter the dataframe for Date 1 <= Date 2.
group_by by Name and use head to get only the first row.
Now, your Amount column is your Most Recent Amount and your Shifted Amount column is the 2nd Most Recent amount. From there, you can do a simple division to get the ratio.

Make date column into standard format using pandas

How can I use pandas to make dates column into a standard format i.e. 12-08-1996. The data I have is:
I've tried some methods by searching online but haven't found the one where it detects the format and make it standard.
Here is what I've coded:
df = pd.read_excel(r'date cleanup.xlsx')
df.head(10)
df.DOB = pd.to_datetime(df.DOB) #Error is in this line
The error I get is:
ValueError: ('Unknown string format:', '20\ \december\ \1992')
UPDATE:
Using
for date in df.DOB:
print(parser.parse(date))
Works great but there is a value 20\\december \\1992 for which it gives the above highlighted error. So I'm not familiar with all the formats that are in the data this is why I was looking for a technique that can auto-detect it and convert it to standard format.
You could use dateparser library:
import dateparser
df = pd.DataFrame(["12 aug 1996", "24th december 2006", "20\\ december \\2007"], columns = ['DOB'])
df['date'] = df['DOB'].apply(lambda x :dateparser.parse(x))
Output
| | DOB | date |
|---|--------------------|------------|
| 0 | 12 aug 1996 | 1996-08-12 |
| 1 | 24th december 2006 | 2006-12-24 |
| 2 | 20\ december \2007 | 2020-12-07 |
EDIT
Note, there is a STRICT_PARSING setting which can be used to handle exceptions :
You can also ignore parsing incomplete dates altogether by setting STRICT_PARSING
df['date'] = df['DOB'].apply(lambda x : dateparser.parse(x, settings={'STRICT_PARSING': True}) if len(str(x))>6 else None)

Dask: Memory efficient way aggregating different slices of dataframe and merging together

I have a Dask dataframe (df) of order histories for customers that I read in from a csv file.
Except:
| customer_id | order_date | order_number | sales |
|-------------|------------|--------------|----------|
| 109230900 | 20190104 | order101 | 210.50 |
I first get total sum of sales per customer. Next I want to get aggregates of sales for different date intervals: sales within the last 7 days, last 14 days, and last 30 days. Today's date is 20190104 (YYYYMMDD).
Here's the result I want:
| customer_id | total_sales | sales_for_past_7_days | sales_for_past_14_days | sales_for_past_30_days |
|-------------|-------------|-----------------------|--------------------------|--------------------------|
| 109230900 | 5105.10 | 210.50 | 320.00 | 1045.05 |
My attempt:
import datetime as dt
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
df = dd.read_csv('order_history.csv', blocksize = 10e6)
total_sales = df.groupby('customer_id').agg({'sales':'sum'}, split_out=100).rename(columns={'sales':'total_sales'})`
total_sales = client.persist(total_sales)
end_date = dt.datetime.strptime('20190104','%Y%m%d')
for interval in [7,14,30]:
start_date = int(end_date - dt.timedelta(days=interval )).strftime('%Y%m%d'))
newcol = 'sales_for_past_{}_days'.format(interval)
tempdf = df[df.order_date > start_date].groupby('CUSTOMER_ID')
tempagg = tempdf.agg({'sales':'sum'}, split_out=100).rename(columns={'sales':newcol})
total_sales = dd.merge(total_sales, tempagg, how='left', left_index=True, right_index=True)
total_sales = client.persist(total_sales)
print(total_sales.head())
Is there a smarter way of doing this? When running above code, I get numerous warning messages about memory issues and there being too many tasks. There seem to be about 2 million tasks, btw.
All of this running as single-machine cluster on a Linux box with 16GB of RAM and 16 cores. The code is run as a Python script and not in Jupyter Notebook.
python==3.7.0
dask==1.0.0
distributed=1.25.1
tornado==5.1

Pandas - Evaluate a condition for each row of a series

I have a dataset like this:
Policy | Customer | Employee | CoveragDate | LapseDate
123 | 1234 | 1234 | 2011-06-01 | 2015-12-31
124 | 1234 | 1234 | 2016-01-01 | ?
125 | 1234 | 1234 | 2011-06-01 | 2012-01-01
124 | 5678 | 5555 | 2014-01-01 | ?
I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.
So, expected output would be:
Policy | Customer | Employee
123 | 1234 | 1234
because policy 123's lapse date was within 5 days of policy 124's covered date.
So far, I've used this code:
import pandas
import datetime
#Pull in data from query
wd = pandas.read_csv('DATA')
wd=wd.set_index('Policy#')
wd = wd.rename(columns={'Policy#':'Policy'})
Resultlist=[]
for EMPID in wd.groupby(['EMPID', 'Customer']):
for Policy in wd.groupby(['EMPID','Customer']):
EffDate = pandas.to_datetime(wd['CoverageEffDate'])
for Policy in wd.groupby(['EMPID','Customer']):
check=wd['LapseDate'].astype(str)
if check.any() =='?': #here lies the problem - it's evaluating if ANY of the items ='?'
print(check)
continue
else:
LapseDate = pandas.to_datetime(wd['LapseDate']) + datetime.timedelta(days=5)
if EffDate < LapseDate:
Resultlist.append(wd['Policy','Customer'])
print(Resultlist)
I'm trying to use the pandas .any() function to evaluate if the current row is a '?' (which means null data, i.e. the policy hasn't lapsed). However, it appears that this statement just evaluates if there is a '?' row in the entire column, not the current row. I need to determine this because if I compare the '?' value against a date I get an error.
Is there a way to reference just the row I'm iterating on for a conditional check? To my knowledge, I can't use the pandas apply function technique because I need each employee's policy data compared against any other policies they hold.
Thank you!
check.str.contains('?') would return a boolean array showing which entries had a '?' in them. Otherwise you might consider just iterating through i.e
check=wd['LapseDate'].astype(str)
for row in check:
if row == '?':
print(check)
but there's really no difference between checking for any match and returning if there's a match and iterating through all and returning if there's a match.

Pandas group time into periods starting at different times

I have a time series with irregular frequency of samples. To get useful data of this, I need to find 10 minute periods with roughly evenly spaced samples (this I have defined the average timedelta between 2 samples is less than 20s).
Example Data:
(For the sake of this example, I will make it 10s intervals with avg 2s deltas.)
| timestamp | speed |
| 2010-01-01 09:20:12 | 10 |
| 2010-01-01 09:20:14 | 14 |
| 2010-01-01 09:20:16 | 12 |
| 2010-01-01 09:20:27 | 18 |
| 2010-01-01 09:20:28 | 19 |
| 2010-01-01 09:20:29 | 19 |
The result I am hoping for is a grouping like follows. Note that the second group does not get included because the samples are bunched together at the end of the 10s period (27, 28, 29) which means an implicit extra time gap of 7s which makes the average delta 3s.
| timestamp | avg | std | std_over_avg |
| 2010-01-01 09:20:10 | 12 | 1.63 | 0.136 |
EDIT:
I think I was combining multiple things in my question (and some incorrectly) so I would like to correct/clarify what I am looking for.
Referring back to the example data, I would like to group it into irregular peiords of 10s; that is, if there is a gap of data the next 10s period should start from the timestamp of the next viable rcord. (Please ignore the previous mention of evenly spaced samples, turns out I misinterpreted that requirement, and I can always filter it out at a later stage if need be). So I would want something like this:
| period | count | avg | std | std_over_avg |
| 2010-01-01 09:20:12 - 2010-01-01 09:20:22 | 3 | 12 | 1.63 | 0.136 |
| 2010-01-01 09:20:27 - 2010-01-01 09:20:37 | 3 | 18.6 | 0.577| 0.031 |
I have found a method for achieving most of what I wanted but it is ugly and slow. Hopefully someone can use this as a starting point to develop something more useful:
group_num = 0
cached_future_time = None
def group_by_time(df, ind):
global group_num
global cached_future_time
curr_time = ind
future_time = ind + datetime.timedelta(minutes=10)
# Assume records are sorted chronologically ascending for this to work.
end = df.index.get_loc(future_time, method='pad')
start = df.index.get_loc(curr_time)
num_records = end - start
if cached_future_time is not None and curr_time < cached_future_time:
pass
elif cached_future_time is not None and curr_time >= cached_future_time:
group_num += 1
# Only increase the cached_future_time mark if we have sufficient data points to make this group useful.
if num_records >= 30:
cached_future_time = future_time
elif cached_future_time is None:
cached_future_time = future_time
return group_num
grp = df.groupby(lambda x: group_by_time(df, x))
Edit:
Ok I found a much more Pandas-ic way to do this which also significantly faster than the ugly loop above. My downfall in the above answer was thinking that I needed to do most of the work for calculating the groups in the groupby function (and thinking there wasn't a way to apply such a method across all the rows intelligently).
# Add 10min to our timestamp and shift the values in that column 30 records
# into the future. We can then find all the timestamps that are 30 records
# newer but still within 10min of the original timestamp (ensuring that we have a 10min group with
# at least 30 records).
records["future"] = records["timestamp"] + datetime.timedelta(minutes=10)
starts = list(records[(records["timestamp"] <= records.future.shift(30)) & records.group_num.isnull()].index)
group_num = 1
# For each of those starting timestamps, grab a slice up to 10min in the future
# and apply a group number.
for start in starts:
group = records.loc[start:start + datetime.timedelta(minutes=10), 'group_num']
if len(group[group.isnull()]) >= 30:
# Only apply group_num to null values so that we get disjoint groups (no overlaps).
group[group.isnull()] = group_num
group_num += 1

Categories

Resources