I'm pulling data from an API and placing it into a Pandas dataframe. I want to then create a new df that includes only the rows that have today's date in. I know how to select between two static dates, but can't seem to filter by a 'today' timestamp.
from matplotlib import pyplot as plt
#Access API
r = requests.get('REMOVED')
x = r.json()
keys = x.keys()
old_df = pd.DataFrame(x['results'])
#set dataframe
df = old_df[['valid_from','valid_to','value_inc_vat']].copy()
df['valid_from'] = pd.to_datetime(df['valid_from'])
df['valid_to'] = pd.to_datetime(df['valid_to'])
#only today's rows
today = pd.Timestamp.today().date()
mask = (df['from'] == today)
df_today = df.loc[mask]```
Use Series.dt.date for compare by dates:
mask = (df['from'].dt.date == today)
df_today = df[mask]
Related
I am trying to download data and add statistics and economic indicators, however my data is on a daily basis and the indicators are on a yearly basis.
I tried to store year/indicator pairs as a dictionary, go through each day in the dates column returned from yfinance, and populate a list with the GDP Deflator for each day using the dictionary. Then I convert that list to a Dataframe and add it as a row to the dataframe returned from yfinance and save it as a csv.
However, when I look at the csv file, the GDP deflator for 2004 shows up for the last day in 2003, and for the last two days in 2004 the GDP Deflator is that of 2005.
What am I doing wrong?
code below:
import pandas as pd
import yfinance as yf
import world_bank_data as wb
df = pd.DataFrame() # Empty DataFrame
GDPD = []
df = yf.download(tickers = 'USDSGD=X' , period='max', interval='1d')
df.reset_index(inplace=True)
date = df['Date']
SGD_def_dict = {"Year":[],"GDP_Deflator":[]}
for i in range(len(date)):
if date[i].year in SGD_def_dict['Year']:
GDPD.append(list(SGD_def_dict.values())[-1][-1])
else:
SGD_def_dict["Year"].append(date[i].year)
try:
SGD_def_dict["GDP_Deflator"].append(wb.get_series('NY.GDP.DEFL.ZS', country= 'SGP', date=date[i].year, id_or_value='id', simplify_index=True))
except:
SGD_def_dict["GDP_Deflator"].append(float("nan"))
#GDPD.append(list(SGD_def_dict.values())[-1][-1])
df2 = pd.DataFrame({"GDP_Deflator":GDPD})
df["GDP_Deflator"] = df2
df.to_csv(r'C:..WBTEST.csv')`
You need to match the year of each day to the corresponding GDP deflator in the dictionary, and then use the same value for all days in that year.
import pandas as pd
import yfinance as yf
import world_bank_data as wb
df = pd.DataFrame() # Empty DataFrame
df = yf.download(tickers = 'USDSGD=X' , period='max', interval='1d')
df.reset_index(inplace=True)
date = df['Date']
SGD_def_dict = {"Year":[],"GDP_Deflator":[]}
for i in range(len(date)):
year = date[i].year
if year not in SGD_def_dict['Year']:
SGD_def_dict["Year"].append(year)
try:
SGD_def_dict["GDP_Deflator"].append(wb.get_series('NY.GDP.DEFL.ZS', country= 'SGP', date=year, id_or_value='id', simplify_index=True))
except:
SGD_def_dict["GDP_Deflator"].append(float("nan"))
df['Year'] = df['Date'].dt.year
df = df.merge(pd.DataFrame(SGD_def_dict), on='Year')
df.drop(['Year'], axis=1, inplace=True)
df.to_csv(r'C:..WBTEST.csv')
How do I specify a start date and end date? For example, I want to extract daily close price for AAPL from 01-01-2020 to 30-06-2021?
enter code here
from alpha.vantage.timeseries import TimeSeries
import pandas as pd
api_key = ''
ts = TimeSeries(key = api_key, output_format = 'pandas')
data - ts.get_daily('AAPL', outputsize = 'full')
print(data)
Based on their documentation, it doesn't seem to have an option to filter by date directly in the API call.
Once you retrieve the data you can filter it within the dataframe.
from alpha_vantage.timeseries import TimeSeries
import pandas as pd
api_key = ''
ts = TimeSeries(key = api_key, output_format = 'pandas')
data = ts.get_daily('AAPL', outputsize = 'full')
data[0][(data[0].index >= '2020-01-01') & (data[0].index <= '2021-06-30')]
output:
You can also use .loc when filtering the data since date is the index, but this is deprecated and it will throw an error in a future version of pandas.
data[0].loc['2020-01-01':'2021-06-30']
I have below dataframe called "df" and calculating the sum by unique id called "Id".
Can anyone help me in optimizing the code i have tried.
import pandas as pd
from datetime import datetime, timedelta
df= {'Date':['2019-01-11 10:23:45','2019-01-09 10:23:45', '2019-01-11 10:27:45',
'2019-01-11 10:25:45', '2019-01-11 10:30:45', '2019-01-11 10:35:45',
'2019-02-09 10:25:45'],
'Id':['100','200','300','100','100', '100','200'],
'Amount':[200,400,330,100,300,200,500],
}
df= pd.DataFrame(df)
df["Date"] = pd.to_datetime(df['Date'])
You can try to use groupby, after this each adjust within sub-groupby not to the whole df
s = {}
for x , y in df.groupby(['Id','NCC']):
for i in y.index:
start_date = y['Date'][i] - timedelta(seconds=300)
end_date = y['Date'][i]
mask = (y['Date'] >= start_date) & (y['Date'] < end_date)
count = y.loc[mask]
count = count.loc[(y['Sys'] == 1)]
if len(count) == 0:
s.update({i : 0})
else:
s.update({i : count['Amount'].sum()})
df['New']=pd.Series(s)
If the original data frame has 2 million rows, it would probably be faster to convert the 'Date' column to an index and sort it. Then you can sub select each 5-minute interval:
df = df.set_index('Date').sort_index()
df['Sum_Amt'] = 0
for end in df.index:
start = end - pd.Timedelta('5min')
current_window = df[start : end] # data frame with 5-minute look-back
sum_amt = <calc logic applied to `current_window` goes here>
df.at[end, 'Sum_Amt'] = sum_amt
print(current_window)
print()
I'm not following the logic for calculating Sum_Amt, so I left that out.
I have a 3 years dataset. I have split my dataset in days. now, I want to store each month's data in a separate list/variable.
SDD2=Restaurant[Restaurant.Item == ' Soft Drink '].groupby(pd.Grouper(key='Date',freq='D')).sum()
print(SDD2)
This a data which I get from above code now I want to store each month data in separate variable/list
You should store data into json format or csv format of each of month into file so it easily accessible from your python script.
For more information check python's module JSON and CSV.
You can just do df.groupby(pd.Grouper(key="Date", freq="M")) and then query on the groups to get your data with get_group('date') or optionally you could convert the grouped data to dict of lists with either .apply(list).to_dict() or dict(list(groups)).
Example:
import pandas as pd
import numpy as np
# create some random dates
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-12-31')
start_u = start.value//10**9
end_u = end.value//10**9
date_range = pd.to_datetime(np.random.randint(start_u, end_u, 30), unit='s')
# convert to DF
df = pd.DataFrame(date_range, columns=["Date"])
# Add random data
df['Data'] = np.random.randint(0, 100, size=(len(date_range)))
# Format to y-m-d
df['Date'] = pd.to_datetime(df['Date'].dt.strftime('%Y-%m-%d'))
print(df)
# group by month
grouped_df = df.groupby(pd.Grouper(key="Date", freq="M"))
# query the groups
print("\n\ngrouped data for feb 2018\n")
#print(grouped_df.get_group('2018-02-28'))
dict_of_list = dict(list(grouped_df))
feb_2018 = pd.Timestamp('2018-02-28')
if feb_2018 in dict_of_list:
print(dict_of_list[feb_2018])
My process is this:
Import csv of data containing dates, activations, and cancellations
subset the data by activated or cancelled
pivot the data with aggfunc 'sum'
convert back to data frames
Now, I need to merge the 2 data frames together but there are dates that exist in one data frame but not the other. Both data frames start Jan 1, 2017 and end Dec 31, 2017. Preferably, the output for any observation in which the index month needs to be filled with have a corresponding value of 0.
Here's the .head() from both data frames:
For reference, here's the code up to this point:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import datetime
%matplotlib inline
#import data
directory1 = "C:\python\Contracts"
directory_source = os.path.join(directory1, "Contract_Data.csv")
df_source = pd.read_csv(directory_source)
#format date ranges as times
#df_source["Activation_Month"] = pd.to_datetime(df_source["Activation_Month"])
#df_source["Cancellation_Month"] = pd.to_datetime(df_source["Cancellation_Month"])
df_source["Activation_Day"] = pd.to_datetime(df_source["Activation_Day"])
df_source["Cancellation_Day"] = pd.to_datetime(df_source["Cancellation_Day"])
#subset the data based on status
df_active = df_source[df_source["Order Status"]=="Active"]
df_active = pd.DataFrame(df_active[["Activation_Day", "Event_Value"]].copy())
df_cancelled = df_source[df_source["Order Status"]=="Cancelled"]
df_cancelled = pd.DataFrame(df_cancelled[["Cancellation_Day", "Event_Value"]].copy())
#remove activations outside 2017 and cancellations outside 2017
df_cancelled = df_cancelled[(df_cancelled['Cancellation_Day'] > '2016-12-31') &
(df_cancelled['Cancellation_Day'] <= '2017-12-31')]
df_active = df_active[(df_active['Activation_Day'] > '2016-12-31') &
(df_active['Activation_Day'] <= '2017-12-31')]
#pivot the data to aggregate by day
df_active_aggregated = df_active.pivot_table(index='Activation_Day',
values='Event_Value',
aggfunc='sum')
df_cancelled_aggregated = df_cancelled.pivot_table(index='Cancellation_Day',
values='Event_Value',
aggfunc='sum')
#convert pivot tables back to useable dataframes
activations_aggregated = pd.DataFrame(df_active_aggregated.to_records())
cancellations_aggregated = pd.DataFrame(df_cancelled_aggregated.to_records())
#rename the time columns so they can be referenced when merging into one DF
activations_aggregated.columns = ["index_month", "Activations"]
#activations_aggregated = activations_aggregated.set_index(pd.DatetimeIndex(activations_aggregated["index_month"]))
cancellations_aggregated.columns = ["index_month", "Cancellations"]
#cancellations_aggregated = cancellations_aggregated.set_index(pd.DatetimeIndex(cancellations_aggregated["index_month"]))
I'm aware there are many posts that address issues similar to this but I haven't been able to find anything that has helped. Thanks to anyone that can give me a hand with this!
You can try:
activations_aggregated.merge(cancellations_aggregated, how='outer', on='index_month').fillna(0)