I am working with a large data set containing portfolio holdings of clients per date (i.e. in each time period, I have a number of stock investments for each person). My goal is to try and identify 'buys' and 'sells'. A buy happens when a new stock appears in a person's portfolio (compared to the previous period). A sell happens when a stock disappears in a person's portfolio (compared to the previous period). Is there an easy/efficient way to do this in Python? I can only think of a cumbersome way via for-loops.
Suppose we have the following dataframe:
which can be computed with the following code:
df = pd.DataFrame({'Date_ID':[1,1,1,1,2,2,2,2,2,2,3,3,3,3], 'Person':['a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'b'], 'Stock':['x1', 'x2', 'x2', 'x3', 'x1', 'x2', 'x3', 'x4', 'x2', 'x3', 'x1', 'x2', 'x3', 'x3']})
I would like to create the 'bought' and 'sell' columns which identify stocks that have been added or are going to be removed from the portfolio. The buy column equals 'True' if the stock newly appears in the persons portfolio (compared to the previous date). The Sell column equals True in case the stock disappears from the person's portfolio the next date.
How to accomplish this (or something similar to identify trades efficiently) in Python?
You can group your dataframe by 'Person' first because
people are completely independent from each other.
After that, for each person - group by 'Date_ID', and for each stock in a group determine if it is present in the next group:
def get_person_indicators(df):
"""`df` here contains info for 1 person only."""
g = df.groupby('Date_ID')['Stock']
prev_stocks = g.agg(set).shift()
was_bought = g.transform(lambda s: ~s.isin(prev_stocks[s.name])
if not pd.isnull(prev_stocks[s.name])
else False)
next_stocks = g.agg(set).shift(-1)
will_sell = g.transform(lambda s: ~s.isin(next_stocks[s.name])
if not pd.isnull(next_stocks[s.name])
else False)
return pd.DataFrame({'was_bought': was_bought, 'will_sell': will_sell})
result = pd.concat([df, df.groupby('Person').apply(get_person_indicators)],
axis=1)
Note:
For better memory usage you can change the dtype of the 'Stock' column from str to Categorical:
df['Stock'] = df['Stock'].astype('category')
Related
I have this data:
import pandas as pd
data = {
'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', '2022-11-03 00:00:35', '2022-11-03 00:00:46', '2022-11-03 00:01:21', '2022-11-03 00:01:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']
}
df = pd.DataFrame(data)
I want to create two sets of CSVs:
One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day) and by "FROM" column - there will be no "TO" column here.
One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day), by "FROM" column and by "TO" column.
The difference between the two sets is that one will count all FROM items and the other will group them and count them by pairs of FROM and TO.
The output will be an aggregated sum of vehicles of a given type for 15 minute intervals summed up by FROM column and also a combination of FROM and TO column.
1st output can look like this for each vehicle type:
2nd output:
I tried using Pandas groupby() and resample() but due to my limited knowledge to no success. I can do this in Excel but very inefficiently. I want to learn Python more and be more efficient, therefore I would like to code it in Pandas.
I tried df.groupby(['FROM', 'TO']).count() but I lack the knowledge to usit for what I need. I keep either getting error when I do something I should not or the output is not what I need.
I tried df.groupby(pd.Grouper(freq='15Min', )).count() but it seems I perhaps have incorrect data type.
And I don't know if this is applicable.
If I understand you correctly, one approach could be as follows:
Data
import pandas as pd
# IIUC, you want e.g. '2022-11-03 00:00:06' to be in the `00:15` bucket, we need `to_offset`
from pandas.tseries.frequencies import to_offset
# adjusting last 2 timestamps to get a diff interval group
data = {'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33',
'2022-11-03 00:00:35', '2022-11-03 00:00:46',
'2022-11-03 00:20:21', '2022-11-03 00:21:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']}
df = pd.DataFrame(data)
print(df)
timestamp from to type
0 2022-11-03 00:00:06 A B Car
1 2022-11-03 00:00:33 A B Car
2 2022-11-03 00:00:35 A B Van
3 2022-11-03 00:00:46 A C Car
4 2022-11-03 00:20:21 B C HGV
5 2022-11-03 00:21:30 C B Van
# e.g. for FROM we want: `A`, `4` (COUNT), `00:15` (TIME-END)
# e.g. for FROM-TO we want: `A-B`, 3 (COUNT), `00:15` (TIME-END)
# `A-C`, 1 (COUNT), `00:15` (TIME-END)
Code
# convert time strings to datetime and set column as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
# add `15T (== mins) offset to datetime vals
df.index = df.index + to_offset('15T')
# create `dict` for conversion of `col names`
cols = {'timestamp': 'TIME-END', 'from': 'FROM', 'to': 'TO'}
# we're doing basically the same for both outputs, so let's use a for loop on a nested list
nested_list = [['from'],['from','to']]
for item in nested_list:
# groupby `item` (i.e. `['from']` and `['from','to']`)
# use `.agg` to create named output (`COUNT`), applied to `item[0]`, so 2x on: `from`
# and get the `count`. Finally, reset the index
out = df.groupby(item).resample('15T').agg(COUNT=(item[0],'count')).reset_index()
# rename the columns using our `cols` dict
out = out.rename(columns=cols)
# convert timestamps like `'2022-11-03 00:15:00' to `00:15`
out['TIME-END'] = out['TIME-END'].dt.strftime('%H:%M:%S')
# rearrange order of columns; for second `item` we need to include `to` (now: `TO`)
if 'TO' in out.columns:
out = out.loc[:, ['FROM', 'TO', 'COUNT', 'TIME-END']]
else:
out = out.loc[:, ['FROM', 'COUNT', 'TIME-END']]
# write output to `csv file`; e.g. use an `f-string` to customize file name
out.to_csv(f'output_{"_".join(item)}.csv') # i.e. 'output_from', 'output_from_to'
# `index=False` avoids writing away the index
Output (loaded in excel)
Relevant documentation:
pd.to_datetime, df.set_index, .to_offset
df.groupby, .resample
df.rename
.dt.strftime
df.to_csv
I need to evaluate several million lines of performance logging for a manufacturing execution system. I need to group the data by date, class and name with finding the total "wait time" of numerous concurrently running transactions. The data comes in looking similar to what is in this dataframe:
import pandas as pd
d = {'START_DATE': ['2021-08-07 19:11:40', '2021-08-07 19:11:40', '2021-08-07 19:11:40',
'2021-08-07 19:20:40', '2021-08-07 19:20:40', '2021-08-07 19:20:40',
'2021-08-07 19:21:40', '2021-08-07 19:21:40', '2021-08-07 19:21:40',
'2021-08-10 19:20:40', '2021-08-10 19:20:40', '2021-08-10 19:20:40',
'2021-08-10 19:21:40', '2021-08-10 19:21:40', '2021-08-10 19:21:40'
],
'ELAPSED_TIME': ['00:00:00.465', '00:00:01.000', '00:00:00.165',
'00:00:00.100', '00:00:00.200', '00:03:00.000',
'00:05:00.000', '00:00:00.200', '00:00:03.000',
'00:00:00.100', '00:00:00.200', '00:03:00.000',
'00:05:00.000', '00:00:00.200', '00:00:03.000'
],
'TRANSACTION': ['a', 'b', 'c',
'a', 'd', 'c',
'e', 'a', 'b',
'a', 'd', 'c',
'e', 'a', 'b'
],
'USER': ['Bob', 'Bob', 'Bob',
'Biff', 'Biff', 'Biff',
'Biff', 'Biff', 'Biff',
'Bob', 'Bob', 'Bob',
'Bob', 'Bob', 'Bob'
],
'CLASS': ['AA', 'AA', 'AA',
'BB', 'BB', 'BB',
'BB', 'BB', 'BB',
'AA', 'AA', 'AA',
'AA', 'AA', 'AA'
]}
df = pd.DataFrame(data=d)
See how the transaction times will start at the same time and run concurrent with each other, but will be "done" at different times. E.g. Bob's first set of transactions (rows 0-2) all take a different amount of time, but when I group by DATE, CLASS, and USER--I want to show the total wait time to be 1000ms (based on the second line's wait time).
On 08/07/2021, Biff has two sets of transactions starting at different times, but they will still overlap into one wait time--6000ms.
Expected output would look something like:
DATE CLASS USER Wait
2021-08-07 AA Bob 1000
2021-08-07 BB Biff 360000
2021-08-10 AA Bob 360000
Like I mentioned the actual data has several millions lines of transactions--I am looking for help in finding something better (and hopefully faster than what I have/found):
def getSecs1(grp):
return pd.DatetimeIndex([]).union_many([ pd.date_range(
row.START_DATE, row.END_DATE, freq='25ms', closed='left')
for _, row in grp.iterrows() ]).size
I add an END_DATE column by adding the milliseconds to the START_DATE. I have to do it with chunks of 25ms otherwise it would take wwaayy too long to do.
Any help/advice would be greatly appreciated.
###Edit
Change the overlap to minutes
This solution uses a package called staircase which is built on pandas and numpy for working with (mathematical) step functions. You can think of an interval as being a step function which goes from value 0 to 1 at the start of an interval and 1 to 0 at the end of an interval.
additional setup
convert START_DATE and ELAPSED_TIME to appropriate pandas time objects
df["START_DATE"] = pd.to_datetime(df["START_DATE"])
df["ELAPSED_TIME"] = pd.to_timedelta(df["ELAPSED_TIME"])
define daily bins
dates = pd.period_range("2021-08-07", "2021-08-10")
solution
Define a function which takes a dataframe, makes a step function from start and end times (calculated as start + duration), sets non-zero values to 1, slices the step function with the bins, and integrates.
import staircase as sc
def calc_dates_for_user(df_):
return (
sc.Stairs( # creating step function
start=df_["START_DATE"],
end=df_["START_DATE"] + df_["ELAPSED_TIME"],
)
.make_boolean() # where two intervals overlap the value of the step function will be 2. This sets all non-zero values to 1 (effectively creating a union of intervals).
.slice(dates) # analogous to groupby
.integral()/pd.Timedelta("1s") # for each slice integrate (which will equal the length of the interval) and divide by seconds
)
When we groupby USER and CLASS and apply this function we get a dataframe, indexed by these variables, with a column index corresponding to intervals in the period range
USER CLASS [2021-08-07, 2021-08-08) [2021-08-08, 2021-08-09) [2021-08-09, 2021-08-10) [2021-08-10, 2021-08-11)
Biff BB 360000.0 0.0 0.0 0.0
Bob AA 1000.0 0.0 0.0 360000.0
We'll clean it up like so
result = (
df.groupby(["USER", "CLASS"])
.apply(calc_dates_for_user)
.melt(ignore_index=False, var_name="DATE", value_name="WAIT") # melt column index into a single column of daily intervals
.query("WAIT != 0") # filter out days where no time recorded
.reset_index() # move USER and CLASS from index to columns
)
result then looks like this
USER CLASS DATE WAIT
0 Biff BB [2021-08-07, 2021-08-08) 360000.0
1 Bob AA [2021-08-07, 2021-08-08) 1000.0
2 Bob AA [2021-08-10, 2021-08-11) 360000.0
To get your expected result you can replace the DATE column with the timestamps relating to day-start with
result["DATE"] = pd.IntervalIndex(result["DATE"]).left
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
The project I'm working on requires me to find out which 'project' has been updated since the last time it was processed. For this purpose I have two dataframes which both contain three columns, the last one of which is a date signifying the last time a project is updated. The first dataframe is derived from a query on a database table which records the date a 'project' is updated. The second is metadata I store myself in a different table about the last time my part of the application processed a project.
I think I came pretty far but I'm stuck on the following error, see the code provided below:
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
unprocessed = lastmatch[~lastmatch.isin(processed)].dropna()
processed.set_index(['projectid', 'stage'], inplace=True)
lastmatch.set_index(['projectid', 'stage'], inplace=True)
processed.sort_index(inplace=True)
lastmatch.sort_index(inplace=True)
print(lastmatch['lastmatchdate'])
print(processed['process_date'])
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
The result I want to achieve is a dataframe containing the rows where the 'lastmatchdate' is greater than the date that the project was last processed (process_date). However this line:
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
produces a ValueError: Can only compare identically-labeled Series objects. I think it might be a syntax I don't know of or got wrong.
The output I expect is in this case:
lastmatchdate
projectid stage
1 c 2020-08-31
So concretely the question is: how do I get a dataframe containing only the rows of another dataframe having the (datetime) value of column a greater than column b of the other dataframe.
merged = pd.merge(processed, lastmatch, left_index = True, right_index = True)
merged = merged.assign(to_process = merged['lastmatchdate']> merged['process_date'])
You will get the following:
process_date lastmatchdate to_process
projectid stage
1 c 2020-08-31 2020-08-31 False
2 v 2013-11-24 2013-11-24 False
you 've receiver ValueError because you tried to compare two different dataframes, if you want to compare row by row two dataframes, merge them before
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
df=pd.merge(lastmatch,processed,on=['stage','projectid'])
df=df[
df.lastmatchdate>df.process_date
]
print(df)
projectid stage lastmatchdate process_date
0 1 c 2020-08-31 2020-08-30
I wan't to allocate downloaded data (csv) into for simplicity say 3 categories. Has anyone got any tips or similar projects i could look at or python tools i should look at.
3 categories are...
Shares: Include the following a,b,c
Bonds: Include the following d,e,f
Cash: g
My downloaded data may have any combination of the above investments with any value.
https://docs.google.com/spreadsheets/d/1GU7jVLA-YzqRTxyLMdbymdJ6b1RtB09bpOjIDX6eJok/edit?usp=sharing
Thats 2 basic example of what the data will be downloaded as and what I want it to be converted to.
The real data will have 10-15 investments and approx 4 catergories I just want to know is possible to sort like this? It gets tricky as we have longer investment names and some are similar but sorted into different catergories.
If some one could point me in the right direction, i.e do i need a dictionary or some basic framework or code to look at that would be awesome.
Keen to learn but don't know where to start cheers - this is my first proper coding project.
Im not to fussed about the formatting of the output, as long as it clearly categorises the info and sums each category i'm happy :)
You don't need a framework, just the builtins will do (as usual in Python).
from collections import defaultdict
# Input data "rows". These would probably be loaded from a file.
raw_data = [
('a', 1000.00),
('b', 2000.00),
('d', 3000.00),
('e', 4000.00),
('g', 5000.00),
('g', 10000.00),
('c', 5000.00),
('d', 2000.00),
('a', 4000.00),
('e', 5000.00),
]
# Category definitions, mapping a category name to the row "types" (first column).
categories = {
'Shares': {'a', 'b', 'c'},
'Bonds': {'d', 'e', 'f'},
'Cash': {'g'},
}
# Build an inverse map that makes lookups faster later.
# This will look like e.g. {"a": "Shares", "b": "Shares", ...}
category_map = {}
for category, members in categories.items():
for member in members:
category_map[member] = category
# Initialize an empty defaultdict to group the rows with.
rows_per_category = defaultdict(list)
# Iterate through the raw data...
for row in raw_data:
type = row[0] # Grab the first column per row,
category = category_map[type] # map it through the category map (this will crash if the category is undefined),
rows_per_category[category].append(row) # and put it in the defaultdict.
# Iterate through the now collated rows in sorted-by-category order:
for category, rows in sorted(rows_per_category.items()):
# Sum the second column (value) for the total.
total = sum(row[1] for row in rows)
# Print header.
print("###", category)
# Print each row.
for row in rows:
print(row)
# Print the total and an empty line.
print("=== Total", total)
print()
This will output something like
### Bonds
('d', 3000.0)
('e', 4000.0)
('d', 2000.0)
('e', 5000.0)
=== Total 14000.0
### Cash
('g', 5000.0)
('g', 10000.0)
=== Total 15000.0
### Shares
('a', 1000.0)
('b', 2000.0)
('c', 5000.0)
('a', 4000.0)
=== Total 12000.0
Suppose I'm managing many stock brokerage account, each account have different types of stock in it. I'm trying to write some code to perform a stress test.
What I'm trying to do is, I have 2 dataframes:
Account information (dataframe):
account = {'account':['1', '1', '1', '2', '2'], 'Stock type':['A', 'A', 'B', 'B', 'C'], 'share value' = '100', '150', '200', '175', '85']}
stress test scenario(dataframe):
test = {'stock type':['A', 'B', 'C', 'D'], 'stress shock':['0.8', '0.7', '0.75', 0.6']}
Given these 2 dataframes, I want to calculate for each account, what's the share value after the stress shock.
i.e. for account #1, after shock value = 100*0.8 + 150*0.8 + 200*0.7 = 340
I tried some basic for loop, but my jupyter notebook will soon crush (out of memory) after the run.
shocked = []
for i in range(len(account)):
for j in range(len(test)):
if account.loc[i,'Stock type'] == test.loc[j,'stock type']:
shocked.append(account.loc[i,'share value']*test.loc[j, 'stock type']
We can first do a merge to get the data of the two dataframes together. Then we calculate the after shock value and finally get the sum of each account:
merge = account.merge(test, on='Stock type')
merge['after_stress_shock'] = pd.to_numeric(merge['share value']) * pd.to_numeric(merge['stress shock'])
merge.groupby('account')['after_stress_shock'].sum()
account
1 340.00
2 186.25
Name: after_stress_shock, dtype: float64
Note I used pandas.to_numeric since your values are in string type.
Create a Series to map "stock type" to "stress shock".
Then use pandas.groupby.apply with a lambda function for desired result:
stress_map = test.set_index('stock type')['stress shock']
account.groupby('account').apply(lambda x: (x['Stock type'].map(stress_map) * x['share value']).sum())
[output]
account
1 340.00
2 186.25
dtype: float64