Element-wise maximum with date values - python

I have a dataframe with date values and would like to manipulate them to 1 Jan or later. Since I need to do this element-wise, I use np.maximum(). The code below however gives
TypeError: Cannot compare type 'Timestamp' with type 'int'.
What's the appropriate method to deal with this kind of data type?
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': np.arange('1999-12', '2000-02', dtype='datetime64[D]')})
df['corrected_date'] = np.maximum(pd.to_datetime('20000101', format='%Y%m%d'), df['date'])

For me working comparing with Series:
s = pd.Series(pd.to_datetime('20000101', format='%Y%m%d'), index=df.index)
df['corrected_date'] = np.maximum(s, df['date'])
Or with DatetimeIndex:
i = np.repeat(pd.to_datetime(['20000101'], format='%Y%m%d'), len(df))
df['corrected_date'] = np.maximum(i, df['date'])

Related

How can I sort timestamp from following data dictionary?

Code:
import pandas as pd
from pycoingecko import CoinGeckoAPI
c=CoinGeckoAPI()
bdata=c.get_coin_market_chart_by_id(id='bitcoin',vs_currency='usd',days=30)
data_=pd.DataFrame(bdata)
print(data_)
data=pd.to_datetime(data_[prices],unit='ms')
print(data)
Output:
Requirement:
But I required output in which 4 columns:
Timestamp, Prices, Market_caps, Total_volume
And I want to change the timestamp format into to_datetime
In the above codes, I just sort the bitcoin data from pycoingecko
Example:
You can convert this into a dataframe format like this:
import pandas as pd
from pycoingecko import CoinGeckoAPI
c=CoinGeckoAPI()
bdata=c.get_coin_market_chart_by_id(id='bitcoin',vs_currency='usd',days=30)
prices = pd.DataFrame(bdata['prices'], columns=['TimeStamp', 'Price']).set_index('TimeStamp')
market_caps = pd.DataFrame(bdata['market_caps'], columns=['TimeStamp', 'Market Cap']).set_index('TimeStamp')
total_volumes = pd.DataFrame(bdata['total_volumes'], columns=['TimeStamp', 'Total Volumes']).set_index('TimeStamp')
# combine the separate dataframes
df_market = pd.concat([prices, market_caps, total_volumes], axis=1)
# convert the index to a datetime dtype
df_market.index = pd.to_datetime(df_market.index, unit='ms')
Code adapted from this answer.
You can extract the timestamp column and convert it into date as following with minimum change to your code, you can follow up by merging the new column to your array:
import pandas as pd
from pycoingecko import CoinGeckoAPI
c=CoinGeckoAPI()
bdata=c.get_coin_market_chart_by_id(id='bitcoin',vs_currency='usd',days=30)
data_=pd.DataFrame(bdata)
print(data_)
#data=pd.to_datetime(data_["prices"],unit='ms')
df = pd.DataFrame([pd.Series(x) for x in data_["prices"]])
df.columns = ["timestamp","data"]
df=pd.to_datetime(df["timestamp"],unit='ms')
print(df)

Calculate Pandas df timedelta of index

Would anyone know how to calculate the time delta of the time stamp of the index?
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='T')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
Am I on track using df['elapsed_time'] = pd.TimedeltaIndex(df) at all with this?
This will throw an error:
ValueError: Wrong number of items passed 2000, placement implies 1
This answer is beautiful!
This will create another pandas column which I called time_td where then I can cast it as a timedelta64 where m stands for minutes which I am looking for.
df['time_td'] = df.index.to_series().diff().astype('timedelta64[m]')
I can then sum this time_td column with:
df.time_td.sum()

Python - calculating difference between price extracting time

I need to create a new column and the value should be:
the current fair_price - fair_price 15 minutes ago(or the closest row)
I need to filter who is the row 15 minutes before then calculate the diff.
import numpy as np
import pandas as pd
from datetime import timedelta
df = pd.DataFrame(pd.read_csv('./data.csv'))
def calculate_15min(row):
end_date = pd.to_datetime(row['date']) - timedelta(minutes=15)
mask = (pd.to_datetime(df['date']) <= end_date).head(1)
price_before = df.loc[mask]
return price_before['fair_price']
def calc_new_val(row):
return 'show date 15 minutes before, maybe it will be null, nope'
df['15_min_ago'] = df.apply(lambda row: calculate_15min(row), axis=1)
myFields = ['pkey_id', 'date', '15_min_ago', 'fair_price']
print(df[myFields].head(5))
df[myFields].head(5).to_csv('output.csv', index=False)
I did it using nodejs but python is not my beach, maybe you have a fast solution...
pkey_id,date,fair_price,15_min_ago
465620,2021-05-17 12:28:30,45080.23,fair_price_15_min_before
465625,2021-05-17 12:28:35,45060.17,fair_price_15_min_before
465629,2021-05-17 12:28:40,45052.74,fair_price_15_min_before
465633,2021-05-17 12:28:45,45043.89,fair_price_15_min_before
465636,2021-05-17 12:28:50,45040.93,fair_price_15_min_before
465640,2021-05-17 12:28:56,45049.95,fair_price_15_min_before
465643,2021-05-17 12:29:00,45045.38,fair_price_15_min_before
465646,2021-05-17 12:29:05,45039.87,fair_price_15_min_before
465650,2021-05-17 12:29:10,45045.55,fair_price_15_min_before
465652,2021-05-17 12:29:15,45042.53,fair_price_15_min_before
465653,2021-05-17 12:29:20,45039.34,fair_price_15_min_before
466377,2021-05-17 12:42:50,45142.74,fair_price_15_min_before
466380,2021-05-17 12:42:55,45143.24,fair_price_15_min_before
466393,2021-05-17 12:43:00,45130.98,fair_price_15_min_before
466398,2021-05-17 12:43:05,45128.13,fair_price_15_min_before
466400,2021-05-17 12:43:10,45140.9,fair_price_15_min_before
466401,2021-05-17 12:43:15,45136.38,fair_price_15_min_before
466404,2021-05-17 12:43:20,45118.54,fair_price_15_min_before
466405,2021-05-17 12:43:25,45120.69,fair_price_15_min_before
466407,2021-05-17 12:43:30,45121.37,fair_price_15_min_before
466413,2021-05-17 12:43:36,45133.71,fair_price_15_min_before
466415,2021-05-17 12:43:40,45137.74,fair_price_15_min_before
466419,2021-05-17 12:43:45,45127.96,fair_price_15_min_before
466431,2021-05-17 12:43:50,45100.83,fair_price_15_min_before
466437,2021-05-17 12:43:55,45091.78,fair_price_15_min_before
466438,2021-05-17 12:44:00,45084.75,fair_price_15_min_before
466445,2021-05-17 12:44:06,45094.08,fair_price_15_min_before
466448,2021-05-17 12:44:10,45106.51,fair_price_15_min_before
466456,2021-05-17 12:44:15,45122.97,fair_price_15_min_before
466461,2021-05-17 12:44:20,45106.78,fair_price_15_min_before
466466,2021-05-17 12:44:25,45096.55,fair_price_15_min_before
466469,2021-05-17 12:44:30,45088.06,fair_price_15_min_before
466474,2021-05-17 12:44:35,45086.12,fair_price_15_min_before
466491,2021-05-17 12:44:40,45065.95,fair_price_15_min_before
466495,2021-05-17 12:44:45,45068.21,fair_price_15_min_before
466502,2021-05-17 12:44:55,45066.47,fair_price_15_min_before
466506,2021-05-17 12:45:00,45063.82,fair_price_15_min_before
466512,2021-05-17 12:45:05,45070.48,fair_price_15_min_before
466519,2021-05-17 12:45:10,45050.59,fair_price_15_min_before
466523,2021-05-17 12:45:16,45041.13,fair_price_15_min_before
466526,2021-05-17 12:45:20,45038.36,fair_price_15_min_before
466535,2021-05-17 12:45:25,45029.72,fair_price_15_min_before
466553,2021-05-17 12:45:31,45016.2,fair_price_15_min_before
466557,2021-05-17 12:45:35,45011.2,fair_price_15_min_before
466559,2021-05-17 12:45:40,45007.04,fair_price_15_min_before
This is the CSV
Firstly convert your date column to datetime dtype:
df['date']=pd.to_datetime(df['date'])
Then filter values:
date15min=df['date']-pd.offsets.DateOffset(minutes=15)
out=df.loc[df['date'].isin(date15min.tolist())]
Now Finally do your calculations:
df['price_before_15min']=df['fair_price'].where(df['date'].isin((out['date']+pd.offsets.DateOffset(minutes=15)).tolist()))
df['price_before_15min']=df['price_before_15min'].diff()
df['date_before_15min']=date15min
Now If you print df you will get your desired output
Update:
For that purpose just make a slightly change in the above method:
out=df.loc[df['date'].dt.minute.isin(date15min.dt.minute.tolist())]
df['price_before_15min']=df['fair_price'].where(df['date'].dt.minute.isin((out['date']+pd.offsets.DateOffset(minutes=15)).dt.minute.tolist()))

How to convert a column in a dataframe to an index datetime object?

I have a question about how to convert a column 'Timestamp' into an index&datetime. And then also drop the column once it's converted into an index.
df = {'Timestamp':['20/01/2021 01:00:00.12 AM','20/01/2021 01:00:00.21 AM','20/01/2021 01:00:01.34 AM],
'Value':['14','178','158','75']}
I tried the following, but obvious didn't work.
df.Timestamp = pd.to_datetime(df.Timestamp.str[0])
df=df.set_index(['Timestamp'], drop=True)
FYI. The df is actually a lot text processing so unfortunately I cannot just do read_csv and parse datetime object. :( So yes, the df is exactly as what's prescribed above.
Thank you.
Don't enclose 'Timestamp' in square brackets.
import pandas as pd
df = pd.DataFrame({'Timestamp':['20/01/2021 01:00:00.12 AM','20/01/2021 01:00:00.21 AM','20/01/2021 01:00:01.34 AM'],
'Value':['14','178','158']})
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.set_index('Timestamp')
print(df)
## Output
Value
Timestamp
20/01/2021 01:00:00.12 AM 14
20/01/2021 01:00:00.21 AM 178
20/01/2021 01:00:01.34 AM 158

Cannot format date column in dataframe to datetime and not filter on it as a consequence

I wanted to play around a bit with the corona virus dataset of the new york times.
I want to plot it and filter by date to only show the last weeks. However I get this error message:
TypeError: '>' not supported between instances of 'bool' and 'datetime.datetime' driven by this line: df_toplot = df[df['state'].isin(top_states) & df['state'] > da]. Somehow I can't manage to turn the date column into a datetime format, instead its format is pandas.core.series.Series. How can I change that?
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
df['date']= pd.to_datetime(df['date'])
print(type(df['date']))
#print(df['date'].iloc[1] > date_object)
variable = "deaths" # "cases"
#print(df.head())
d = pd.pivot_table(df, index= 'state', values= variable,aggfunc=np.sum)
top_states = d.nlargest(10, variable, keep='first').index.values
s = "2018-06-19 11:21:13.311"
da = datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')
print(type(df['date']))
df_toplot = df[df['state'].isin(top_states) & df['state'] > da]
df_toplot.pivot(index='date', columns='state', values=variable).plot()
plt.yscale('log')
I think you just made a small typo. You are comparing the date to the states column.
# Change this line
df_toplot = df[df['state'].isin(top_states) & df['state'] > da]
# To this (don't forget to separate conditions with parentheses)
df_toplot = df[(df['state'].isin(top_states)) & (df['date'] > da)]

Categories

Resources