How to grab last row of datetime in Pandas dataframe? - python

II currently have a very large .csv with 2 million rows. I've read in the csv and only have 2 columns, number and timestamp (in unix). My goal is to grab the last and largest number for each day (eg. 1/1/2021, 1/2/2021, etc.)
I have converted unix to datetime and used df.groupby('timestamp').tail(1) but am still not able to return the last row per day. Am I using the groupby wrong?
import pandas as pd
def main():
df = pd.read_csv('blocks.csv', usecols=['number', 'timestamp'])
print(df.head())
df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')
x = df.groupby('timestamp').tail(1)
print(x)
if __name__ == '__main__':
main()
Desired Output:
number          timestamp
11,509,218          2021-01-01
11,629,315          2021-01-02
11,782,116          2021-01-03
12,321,123          2021-01-04
...

The "problem" lies in the grouper, use .dt.date for correct grouping (assuming your data is already sorted):
x = df.groupby(df['timestamp'].dt.date).tail(1)
print(x)

Doesn't seem like you're specifying the aggregation function, nor the aggregation frequency (hour, day, minute?)
My take would be something along the lines of
df.resample("D", on="timestamp").max()
There's a couple of ways to group by time, alternatively
df.groupby(pd.Grouper(key='timestamp', axis=0,
freq='D', sort=True)).max()
Regards

Related

How to correctly open txt timeseries files in pandas that has a comma and unseparated date in timestamp?

I have a dataset with txt files that contain timestamps with a comma. The data looks sometime like this
TimeStamp, open, high, low, close, volume
20220401,00:00:00,1.31457,1.31468,1.3141,1.31428,141
20220401,00:01:00,1.31429,1.3144,1.3139,1.31405,157
20220401,00:02:00,1.31409,1.3142,1.31369,1.31405,120
What would be the most efficient way to parse dates in pandas?
I want to merge the date and time columns and convert it into datetimeindex
In your case, I would just assume that instead of a single column that contains a comma, you have two columns: one with the date, one with the time. Currently the first column is being read as the index. You can create a DatetimeIndex by using that index and the TimeStamp column (the time values):
import pandas as pd
df = pd.read_clipboard(sep=",") # Insert pd.read_csv here ;)
idx = pd.DatetimeIndex(
pd.to_datetime(
df.index.astype(str) + " " + df["TimeStamp"]
)
)
out = df.set_index(idx).drop(columns="TimeStamp")
open high low close volume
2022-04-01 00:00:00 1.31457 1.31468 1.31410 1.31428 141
2022-04-01 00:01:00 1.31429 1.31440 1.31390 1.31405 157
2022-04-01 00:02:00 1.31409 1.31420 1.31369 1.31405 120

Iterating through a range of dates in Python with missing dates

Here I got a pandas data frame with daily return of stocks and columns are date and return rate.
But if I only want to keep the last day of each week, and the data has some missing days, what can I do?
import pandas as pd
df = pd.read_csv('Daily_return.csv')
df.Date = pd.to_datetime(db.Date)
count = 300
for last_day in ('2017-01-01' + 7n for n in range(count)):
Actually my brain stop working at this point with my limited imagination......Maybe one of the biggest point is "+7n" kind of stuff is meaningless with some missing dates.
I'll create a sample dataset with 40 dates and 40 sample returns, then sample 90 percent of that randomly to simulate the missing dates.
The key here is that you need to convert your date column into datetime if it isn't already, and make sure your df is sorted by the date.
Then you can groupby year/week and take the last value. If you run this repeatedly you'll see that the selected dates can change if the value dropped was the last day of the week.
Based on that
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = pd.date_range(start='04-18-2022',periods=40, freq='D')
df['return'] = np.random.uniform(size=40)
# Keep 90 percent of the records so we can see what happens when some days are missing
df = df.sample(frac=.9)
# In case your dates are actually strings
df['date'] = pd.to_datetime(df['date'])
# Make sure they are sorted from oldest to newest
df = df.sort_values(by='date')
df = df.groupby([df['date'].dt.isocalendar().year,
df['date'].dt.isocalendar().week], as_index=False).last()
print(df)
Output
date return
0 2022-04-24 0.299958
1 2022-05-01 0.248471
2 2022-05-08 0.506919
3 2022-05-15 0.541929
4 2022-05-22 0.588768
5 2022-05-27 0.504419

python pandas converting UTC integer to datetime

I am calling some financial data from an API which is storing the time values as (I think) UTC (example below):
enter image description here
I cannot seem to convert the entire column into a useable date, I can do it for a single value using the following code so I know this works, but I have 1000's of rows with this problem and thought pandas would offer an easier way to update all the values.
from datetime import datetime
tx = int('1645804609719')/1000
print(datetime.utcfromtimestamp(tx).strftime('%Y-%m-%d %H:%M:%S'))
Any help would be greatly appreciated.
Simply use pandas.DataFrame.apply:
df['date'] = df.date.apply(lambda x: datetime.utcfromtimestamp(int(x)/1000).strftime('%Y-%m-%d %H:%M:%S'))
Another way to do it is by using pd.to_datetime as recommended by Panagiotos in the comments:
df['date'] = pd.to_datetime(df['date'],unit='ms')
You can use "to_numeric" to convert the column in integers, "div" to divide it by 1000 and finally a loop to iterate the dataframe column with datetime to get the format you want.
import pandas as pd
import datetime
df = pd.DataFrame({'date': ['1584199972000', '1645804609719'], 'values': [30,40]})
df['date'] = pd.to_numeric(df['date']).div(1000)
for i in range(len(df)):
df.iloc[i,0] = datetime.utcfromtimestamp(df.iloc[i,0]).strftime('%Y-%m-%d %H:%M:%S')
print(df)
Output:
date values
0 2020-03-14 15:32:52 30
1 2022-02-25 15:56:49 40

Modifying format of rows values in Pandas Data-frame

I have a dataset of 70000+ data points (see picture)
As you can see, in the column 'date' half of the format is different (more messy) compared to the other half (more clear). How can I make the whole format as the second half of my data frame?
I know how to do it manually, but it will take ages!
Thanks in advance!
EDIT
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
Date is in a strange format
[
EDIT 2
two data formats:
2012-01-01 00:00:00
2020-07-21T22:45:00+00:00
I've tried the below and it works, note that this assuming two key assumptions:
1- Your date fromat follows one and ONLY ONE of the TWO formats in your example!
2- The final output is a string!
If so, this should do the trick, else, it's a starting point and can be altered to you want it to look like:
import pandas as pd
import datetime
#data sample
d = {'date':['20090602123000', '20090602124500', '2020-07-22 18:45:00+00:00', '2020-07-22 19:00:00+00:00']}
#create dataframe
df = pd.DataFrame(data = d)
print(df)
date
0 20090602123000
1 20090602124500
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
#loop over records
for i, row in df.iterrows():
#get date
dateString = df.at[i,'date']
#check if it's the undesired format or the desired format
#NOTE i'm using the '+' substring to identify that, this comes to my first assumption above that you only have two formats and that should work
if '+' not in dateString:
#reformat datetime
#NOTE: this is comes to my second assumption where i'm producing it into a string format to add the '+00:00'
df['date'].loc[df.index == i] = str(datetime.datetime.strptime(dateString, '%Y%m%d%H%M%S')) + '+00:00'
else:
continue
print(df)
date
0 2009-06-02 12:30:00+00:00
1 2009-06-02 12:45:00+00:00
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
you can format the first part of your dataframe
import datetime as dt
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
this checks if all characters of the value are digits, then format the date as the second part
EDIT
the timestamp seems to be in miliseconds while they should be in seconds => / 1000

GroupBy date for datetime row in python pandas

I want to generate the sum of distance and seconds traveled by day. I want to use a groupby function to calculate the sum of the orders per day.
I have the following code:
import pandas as pd
orders = pd.read_csv('complete.csv', delimiter=',', encoding='ISO-8859-1')
orders['datetime'] = pd.to_datetime(orders['datetime'])
orders.groupby(orders.datetime.dt.date).sum()
print(orders)
The complete csv file looks as follow:
datetime,restaurant,customer_address,amount,restaurant_address,meters,seconds
2018-01-01 15:41:37,Name,9711AR,50.5,9722AC,2268.3,606.0
2018-08-13 16:57:52,Name,9711AR,22.3,9722AC,2268.3,606.0
2018-09-21 17:38:53,Name,9711AR,66.89,9722AC,2268.3,606.0
2018-11-09 18:37:26,Name,9711AR,42.66,9722AC,2268.3,606.0
2018-01-01 18:28:04,Name,9711AJ,70.75,9746RD,4090.4,1039.5
I want to generate a sum of meters and seconds for each day.
I think I have some trouble with the 'datetime' object that it does not recognize it as a date or something.
Any ideas?
I think your code is good, the only issue is that orders.groupby(orders.datetime.dt.date).sum() do not update orders, you can add
orders = orders.groupby(orders.datetime.dt.date).sum() if you want to do so

Categories

Resources