How to find last 24 hours data from pandas data frame - python

I have a data in which we have two columns, one is description and another is publishedAt. I applied sort function on publishedAt column and get the output of descending order of date. Here is the sample of my data frame:
description publishedAt
13 Bitcoin price has failed to secure momentum in... 2018-05-06T15:22:22Z
16 Brian Kelly, a long-time contributor to CNBC’s... 2018-05-05T15:56:48Z
2 The bitcoin price is less than $100 away from ... 2018-05-05T13:14:45Z
12 Mati Greenspan, a senior analyst at eToro and ... 2018-05-04T16:05:37Z
52 A Singaporean startup developing ‘smart bankno... 2018-05-04T14:02:30Z
75 Cryptocurrencies are set to make a comeback on... 2018-05-03T08:10:19Z
76 The bitcoin price is hovering near its best le... 2018-04-30T16:26:57Z
74 In today’s climate of ICOs with 100 billion to... 2018-04-30T12:03:31Z
27 Investment guru Warren Buffet remains unsold o... 2018-04-29T17:22:19Z
22 The bitcoin price has increased by around $400... 2018-04-28T12:28:35Z
68 Bitcoin futures volume reached an all-time hig... 2018-04-27T16:32:44Z
14 Biotech-company-turned-cryptocurrency-investme... 2018-04-27T14:25:15Z
67 The bitcoin price has rebounded to $9,200 afte... 2018-04-27T06:24:42Z
Now i want to description of last 3 hours, 6 hours, 12 hours and 24 hours.
How can i find it?
Thanks

As a simple solution within Pandas, you can use the DataFrame.last(offset) function. Be sure to set the PublishedAt column as the dataframe DateTimeIndex. A similar function to get rows a the start of a dataframe is the DataFrame.first(offset) function.
Here is an example using the provided offsets:
df.last('24h')
df.last('12h')
df.last('6h')
df.last('3h')

Assuming that the dataframe is called df
import datetime as dt
df[df['publishedAt']>=(dt.datetime.now()-dt.timedelta(hours=3))]['description'] #hours = 6,12, 24
if you need the intervals exclusive, thus the description withing the last 6 hours but not the ones within 3 hours, you'll need to use array-like logical operators from numpy like numpy.logicaland(arr1, arr2) in the first breaket.

Related

How to find the maximum date value with conditions in python?

I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).

Python Pandas Web Scraping

I'm trying to turn a list of tables on this page into a Pandas DataFrame:
https://intermediaries.hsbc.co.uk/products/product-finder/
I want to select the customer type box only and select one of the elements (from first to last) and then click find product to display the table for each one before concatenating all the DataFrames into 1 DataFrame.
So far I have managed to select the first element and print the table but I can't seem to turn it into a pandas DataFrame as I get a value error: Must pass 2-d input. shape=(1, 38, 12)
This is my code:
def product_type_button(self):
select = Select(self.driver.find_element_by_id('Availability'))
try:
select.select_by_visible_text('First time buyer')
except NoSuchElementException:
print('The item does not exist')
time.sleep(5)
self.driver.find_element_by_xpath('//button[#type="button" and (contains(text(),"Find product"))]').click()
time.sleep(5)
def create_dataframe(self):
data1 = pd.read_html(self.driver.page_source)
print(data1)
data2 = pd.DataFrame(data1)
time.sleep(5)
data2.to_csv('Data1.csv')
I would like to find a way to print the table for each element, maybe selecting by index instead? and then concatenating into one DataFrame. Any help would be appreciated.
All data for the table is located inside javascript file. You can use re/json to parse it and then construct the dataframe:
import re
import json
import requests
import pandas as pd
js_src = "https://intermediaries.hsbc.co.uk/component---src-pages-products-product-finder-js-9c7004fb8446c3fe0a07.js"
data = requests.get(js_src).text
data = re.search(r"JSON\.parse\('(.*)'\)", data).group(1)
data = json.loads(data)
df = pd.DataFrame(data)
print(df.head().to_markdown(index=False))
df.to_csv("data.csv", index=False)
Prints:
Changed
NewProductCode
PreviousProductCode
ProductType
Deal Period (Fixed until)
ProductDescription1
ProductTerm
Availability
Repayment Basis
Min LTV %
MaxLTV
Minimum Loan ?
Fees Payable Per
Rate
Reversionary Rate %
APR
BookingFee
Cashback
CashbackValue
ERC - Payable
Unlimited lump sum payments Premitted (without fees)
Unlimited overpayment permitted (without fees)
Overpayments
ERC
Portable
Completionfee
Free Legals for Remortgage
FreeStandardValuation
Loading
Continued
4071976
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR and IO
0%
60%
10,000
5,000,000
5.99%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071977
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR and IO
0%
70%
10,000
2,000,000
6.04%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071978
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR and IO
0%
75%
10,000
2,000,000
6.04%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071979
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR
0%
80%
10,000
1,000,000
6.14%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071980
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR
0%
85%
10,000
750,000
6.19%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
and saves data.csv (screenshot from LibreOffice):
A minimal change to your code: pd.read_html returns a list of dataframes for all tables found on the webpage.
Since there is only one table on your page, data1 is a list of 1 dataframe. This is where the error Must pass 2-d input. shape=(1, 38, 12) comes from – data1 contains 1 dataframe of shape (38, 12).
You probably want to do just:
data2 = data1[0]
data2.to_csv(...)
(Also, no need to sleep after reading from the webpage)

Best way to store pandas df data for recall in analysis

I have a large df in pandas that has a company's product information. Here is a small sample of rows with only the columns I believe are needed to get the information I desire.
df = pd.DataFrame({'Customers': [1,2,3,4,5,6]*3,
'Product':['Beer1','Beer2','Beer1','Beer4', 'Beer3', 'Beer5']*3,
'Packaging':['6pk','keg','big_keg','12pack','22 oz bottle','18pack']*3,
'Sale_Price':[25,50,75,34,54,99]*3}
)
I want to be able to pull the sale price:
def get_price(Customer, Product, Packaging):
abc = df[(df['Customers'] == Customer) & (df['Product'] == Product) & (df['Packaging'] == Packaging)]
price = abc.iloc[0]['Sale_Price']
return price
The function I wrote works for getting one value, but I was wondering if there is a better way to get and store pricing information for an entity's products for later use
Since I am usually using the prices as inputs in a multiplication formula using the examples below
Beer1 Run1: 365 12 packs, 43 big_kegs, 12 kegs
Beer2 Run1: 400 18 packs, 67 kegs
So Ex1 would look something like this: Revenue = (365 * 12 pack price + 43 * big_keg price + 12 * keg price)
My Question(s): How to alter the function above to account for the examples? How best to store all prices for later use?
More direct question based on comment:
I have three arguments (maybe more due to additional pack type possibilities): Customer name, Product Name, Packaging Type, (additional pack type)
I need the sale price, prices for multiple pack types.
So, I have these Beer1, Customer2, 12pack, big_keg: How would my function handle this? Is a function the best way or should I create and store a master pricing dictionary or another storage method?
Will probably need a weighted average at some point, but one question at time.
Thanks in advance for you help.
If what you are looking for is the total revenue each packaging type is bringing, you can simply do groupby.
df.groupby('Packaging')['Sale_Price'].sum()
output:
Packaging
12pack 102
18pack 297
22 oz bottle 162
6pk 75
big_keg 225
keg 150
Name: Sale_Price, dtype: int64
You can do the same for price info with unique function
df.groupby('Packaging')['Sale_Price'].unique()
Packaging
12pack [34]
18pack [99]
22 oz bottle [54]
6pk [25]
big_keg [75]
keg [50]
Name: Sale_Price, dtype: object
Which also help in checking if each type of packaging had one unique pricing or different sale price in the dataframe.

Lifetimes package gives inconsistent results

I am using Lifetimes to compute CLV of some customers of mine.
I have transactional data and, by means of summary_data_from_transaction_data (the implementation can be found here) I would like to compute
the recency, the frequency and the time interval T of each customer.
Unfortunately, it seems that the method does not compute correctly the frequency.
Here is the code for testing my dataset:
df_test = pd.read_csv('test_clv.csv', sep=',')
RFT_from_libray = summary_data_from_transaction_data(df_test,
'Customer',
'Transaction date',
observation_period_end='2020-02-12',
freq='D')
According to the code, the result is:
frequency recency T
Customer
1158624 18.0 389.0 401.0
1171970 67.0 396.0 406.0
1188564 12.0 105.0 401.0
The problem is that customer 1188564 and customer 1171970 did respectively 69 and 14 transaction, thus the frequency should have been 68 and 13.
Printing the size of each customer confirms that:
print(df_test.groupby('Customer').size())
Customer
1158624 19
1171970 69
1188564 14
I did try to use natively the underlying code in the summary_data_from_transaction_data like this:
RFT_native = df_test.groupby('Customer', sort=False)['Transaction date'].agg(["min", "max", "count"])
observation_period_end = (
pd.to_datetime('2020-02-12', format=None).to_period('D').to_timestamp()
)
# subtract 1 from count, as we ignore their first order.
RFT_native ["frequency"] = RFT_native ["count"] - 1
RFT_native ["T"] = (observation_period_end - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
RFT_native ["recency"] = (RFT_native ["max"] - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
As you can see, the result is indeed correct.
min max count frequency T recency
Customer
1171970 2019-01-02 15:45:39 2020-02-02 13:40:18 69 68 405.343299 395.912951
1188564 2019-01-07 18:10:55 2019-04-22 14:27:08 14 13 400.242419 104.844595
1158624 2019-01-07 10:52:33 2020-01-31 13:50:36 19 18 400.546840 389.123646
Of course my dataset is much bigger, and a slight difference in my frequency and/or recency alters a lot the computation of the BGF model.
What am I missing? Is there something that I should consider when using the method?
I might be a bit late to answer your query, but here it goes.
The documentation for the Lifestyles package defines frequency as:
frequency represents the number of repeat purchases the customer has made. This means that it’s one less than the total number of purchases. This is actually slightly wrong. It’s the count of time periods the customer had a purchase in. So if using days as units, then it’s the count of days the customer had a purchase on.
So, it's basically the number of time periods when the customer has made a repeat purchase, not the number of individual repeat purchases. A quick scan of your sample dataset confirmed that both 1188564 and 1171970 indeed made 2 purchases on a single day, 13Jan2019 and 15Jun2019, respectively. So these 2 transactions would be considered as 1 when calculating frequency that would result in the frequency calculated by summary_data_from_transaction_data function to be 2 less than your manual count.
According to documentation, you need to set:
include_first_transaction = True
include_first_transaction (bool, optional) – Default: False By default
the first transaction is not included while calculating frequency and
monetary_value. Can be set to True to include it. Should be False if
you are going to use this data with any fitters in lifetimes package

How to calculate the presence time of student during a class session with image processing

I am trying to calculate the total presence time of students using face recognition. Such that at the end of class i can get two things: 1, total time a student was present. 2, from which time to which time he was present, and same for when he was not present(i.e. 9:00-9:20(Present), 9:20-9:22(not present), 9:22-9:42(present))
This is the way I am doing it.
In a 40 min class a python file runs after every 2 mins for 40 seconds.
Each time the file runs it stores the ids of students that are present, in a list data structure and stores it in the DB. I made totaClassTime/2 columns in table as the file runs after every 2 mins. By the end of the class(after 40 mins) it read the data from DB and calculates the total presence time and save it too in DB.
Is there a better way to do this all such that I don't have to create classTime/2 columns in table? Another ambiguity arising:
if for a student we get this data from DB:
9:00 9:02 9:04 9:06 9:18 9:10 9:12 9:14 9:16...
p p - p - p p p p ...
when calculating the total presence time it will add time from 9:00 to 9:02 then it will consider 9:02-9:04 as absence time and same for 9:04-9:06. however the student might be present b/w 9:04-9:06. I have searched alot but couldn't find the way to calculate the presence time accurately.
You could store each observation in a row instead of a column. Such a table looks like this:
classId | studentId | observationTime | present
----------------------------------------------------
1 1 9:00 p
1 1 9:02 p
1 1 9:04 -
1 1 9:06 p
1 1 9:08 -
1 1 9:10 p
...
Then to evaluate a student's presence time all rows containing observations of this student in the particular class can be selected and ordered by time. This can be achieved with a select statement similar to this one:
SELECT observationTime, present FROM observations WHERE classID='1' AND studentID='1' ORDER BY observationTime
Now, you can simply iterate of over the result set and of this query calculate the presence times as you did before.
Your problem with the student having an unclear presence state between 9:04-9:06 can be solved by defining for which time frame for which your observation is considered to be valid.
You have already split your class into two minute frames (from 9:00 to 9:02, from 09:02 to 09:04 and so on). Now, you can say that the 9:00 observation is valid for the time frame from 09:00 to 09:02, the 09:02 observation is valid from the timeslot from 09:02 to 09:04 and so on. This enables you to clearly interpret the data from your example: the 09:04 observation is valid for the time from 09:04 to 09:06. As the student was not observed at 09:04 he is considered to be absent in this slot. At the next observation at 09:06 he is present, so we consider him being the class from 09:06 to 09:08.
Obviously the student was not really away the full time between 09:04 and 09:06 unless he magically materialized in his seat exactly at 09:06. But as we only look at the class every to minutes we can only account for the student's presence at a two-minute resolution.
You are basically taking a sample of the state of the class at point in time every two minutes and assume it represents the state of the class at the whole two minutes.

Categories

Resources