I have a csv file with data every ~minute over 2 years, and am wanting to run code to calculate 24-hour averages. Ideally I'd like the code to iterate over the data, calculate averages and standard deviations, and R^2 between dataA and dataB, for every 24hr period and then output this new data into a new csv file (with datestamp and calculated data for each 24hr period).
The data has an unusual timestamp which I think might be tripping me up slightly. I've been trying different For Loops to iterate over the data, but I'm not sure how to specify that I want the averages,etc for each 24hr period.
This is the code I have so far, but I'm not sure how to complete the For Loop to achieve what I'm wanting. If anyone can help that would be great!
import math
import pandas as pd
import os
import numpy as np
from datetime import timedelta, date
# read the file in csv
data = pd.read_csv("Jacaranda_data_HST.csv")
# Extract the data columns from the csv
data_date = data.iloc[:,1]
dataA = data.iloc[:,2]
dataB = data.iloc[:,3]
# set the start and end dates of the data
start_date = data_date.iloc[0]
end_date = data_date.iloc[-1:]
# for loop to run over every 24 hours of data
day_count = (end_date - start_date).days + 1
for single_date in [d for d in (start_date + timedelta(n) for n in
range(day_count)) if d <= end_date]:
print np.mean(dataA), np.mean(dataB), np.std(dataA), np.std(dataB)
# output new csv file - **unsure how to call the data**
csvfile = "Jacaranda_new.csv"
outdf = pd.DataFrame()
#outdf['dataA_mean'] = ??
#outdf['dataB_mean'] = ??
#outdf['dataA_stdev'] = ??
#outdf['dataB_stdev'] = ??
outdf.to_csv(csvfile, index=False)
A simplified aproach could be to group by calendar day in a dict. I don't have much experience with pandas time management in DataFrames, so this could be an alternative.
You could create a dict where the keys are the dates of the data (without the time part), so you can later calculate the mean of all the data points that are under each key.
data_date = data.iloc[:,1]
data_a = data.iloc[:,2]
data_b = data.iloc[:,3]
import collections
dd_a = collections.defaultdict(list)
dd_b = collections.defaultdict(list)
for date_str, data_point_a, data_point_b in zip(data_date, data_a, data_b):
# we split the string by the first space, so we get only the date part
date_part, _ = date_str.split(' ', maxsplit=1)
dd_a[date_part].append(data_point_a)
dd_b[date_part].append(data_point_b)
Now you can calculate the averages:
for date, v_list in dd_a.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))
for date, v_list in dd_b.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))
Related
To make this simple, let's say I have 3 datasets, a 1min OHLCV dataset, a 5min, and a Daily. Given these 3 datasets:
1min
5min
Daily
...how can I merge them and turn them into something like this using Pandas/Python 3?
As you can see, every time a new 5 mins is hit going down the 1min dataframe, the matching time from the 5min chart <= the 1min time gets added to their respective columns. I've left the blanks in there to help visualize what's happening, but for simplicity's sake, I can just forward fill the values. No backfilling so as to not introduce a lookahead bias. The daily value would be the previous day's OHLCV data, except for 4:00 PM, that would be the current day's data. This is because of how Alpha Vantage structures their dataframes. I also have a 15min and 60min dataset to go along with this, but I think once the 1 to 5min merge is done, the same logic could apply.
I have included a reproducible code below to get the exact dataframes I'm working with, however you have to pip install alpha_vantage and get a free API key from here.
SIMPLE DESIRED SOLUTION - What I've outlined above.
ADVANCED DESIRED SOLUTION - Rather than forward filling the 5min data, ideally those empty spaces would consist of the respective running prices. For example, the next empty open price in the 5min_open column would be the same 1min_open at the beginning of that 5 minutes. Think of something like:
conversion = {'Open' : 'first', 'High' : 'max', 'Low' : 'min', 'Close' : 'last', 'Volume' : 'sum'}
That way the empty spaces in the 5min columns are updating as they should, but the simple solution would suffice.
I don't want to just upsample the 1min dataframe because I want to include all of the previous data from a past daily dataframe, so to upsample the 1min to a daily dataset, I'd lose a bunch of past daily data, so I want the solution to NOT include upsampling, but merging of different df's by datetime. Thanks!
Windows 10, Python 3:
from alpha_vantage.timeseries import TimeSeries
import pandas as pd
import sys, os
import os.path
from time import sleep
# Make a historical folder if it's not created already
if not os.path.exists("data/historical/"):
os.makedirs("data/historical/")
api_call_limit_daily = 500
total_api_count = 0
timeframes = ['1min','5min','15min','60min','daily']
ts = TimeSeries(key='YOUR_API_KEY_HERE', output_format='csv')
tickers = ['SPY']
for ticker in tickers:
# Ensure uppercase
ticker = str(ticker.upper())
for timeframe in timeframes:
print("Downloading dataset for", ticker, "-", timeframe, "...")
if total_api_count > api_call_limit_daily:
print("Daily API call limit reached. Closing...")
sys.exit(0)
if timeframe != 'daily':
data, meta_data = ts.get_intraday_extended(symbol=ticker, interval=timeframe)
elif timeframe == 'daily':
data, meta_data = ts.get_daily(symbol=ticker)
# Convert the data object to a dataframe, and reverse the order so oldest date's on top
csv_list = list(data)
data = pd.DataFrame.from_records(csv_list[1:], columns=csv_list[0])
data = data.iloc[::-1]
if timeframe == 'daily':
data = data.rename(columns={"timestamp": "time"})
print(data)
df = data.set_index('time')
total_api_count += 1
df.to_csv("data/historical/" + ticker + "_" + timeframe + ".csv", index=True)
print("Success...")
# Sleep if we're not through tickers/timeframes yet re: api limits
sleep(15)
print("Done!")
UPDATE
I have frankenstein'd an iterative process for the simple solution that works for each dataframe, except for the daily. Still trying to work that out. Basically for any time on the 1min rows, I want to display YESTERDAY's daily data, unless the time is >= 16:00:00, then it can be the current day's. I want that so not to introduce forward peaking. Anyway, here's the code that accomplishes what I'm looking for iteratively, but hoping there's a faster/cleaner way to do this:
import numpy as np
# Merge all timeframes together into one dataframe
for ticker in tickers:
# Ensure uppercase
ticker = str(ticker.upper())
for timeframe in timeframes:
# Define the 1min timeframe as the main df we'll add other columns to
if timeframe == "1min":
main_df = pd.read_csv("./data/historical/" + ticker + "_" + timeframe + ".csv")
main_df['time'] = pd.to_datetime(main_df['time'])
continue
# Now add in some nan's for the next timeframe's columns
main_df[timeframe + "_open"] = np.nan
main_df[timeframe + "_high"] = np.nan
main_df[timeframe + "_low"] = np.nan
main_df[timeframe + "_close"] = np.nan
main_df[timeframe + "_volume"] = np.nan
# read in the next timeframe's dataset
df = pd.read_csv("./data/historical/" + ticker + "_" + timeframe + ".csv")
df['time'] = pd.to_datetime(df['time'])
# Rather than doing a double for loop to iterate through both datasets, just keep a counter
# of what row we're at in the second dataframe. Used as a row locater
curr_df_row = 0
# Do this for all datasets except the daily one
if timeframe != 'daily':
# Iterate through the main_df
for i in range(len(main_df)):
# If the time in the main df is >= the current timeframe's row's df, add the values to their columns
if main_df['time'].iloc[i] >= df['time'].iloc[curr_df_row]:
main_df[timeframe + "_open"].iloc[i] = df['open'].iloc[curr_df_row]
main_df[timeframe + "_high"].iloc[i] = df['high'].iloc[curr_df_row]
main_df[timeframe + "_low"].iloc[i] = df['low'].iloc[curr_df_row]
main_df[timeframe + "_close"].iloc[i] = df['close'].iloc[curr_df_row]
main_df[timeframe + "_volume"].iloc[i] = df['volume'].iloc[curr_df_row]
curr_df_row += 1
# Daily dataset logic would go here
print(main_df)
main_df.to_csv("./TEST.csv", index=False)
I am working on a personal project collecting the data on Covid-19 cases. The data set only shows the total number of Covid-19 cases per state cumulatively. I would like to add a column that contains the new cases added that day. This is what I have so far:
import pandas as pd
from datetime import date
from datetime import timedelta
import numpy as np
#read the CSV from github
hist_US_State = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
#some code to get yesterday's date and the day before which is needed later.
today = date.today()
yesterday = today - timedelta(days = 1)
yesterday = str(yesterday)
day_before_yesterday = today - timedelta(days = 2)
day_before_yesterday = str(day_before_yesterday)
#Extracting yesterday's and the day before cases and combine them in one dataframe
yesterday_cases = hist_US_State[hist_US_State["date"] == yesterday]
day_before_yesterday_cases = hist_US_State[hist_US_State["date"] == day_before_yesterday]
total_cases = pd.DataFrame()
total_cases = day_before_yesterday_cases.append(yesterday_cases)
#Adding a new column called "new_cases" and this is where I get into trouble.
total_cases["new_cases"] = yesterday_cases["cases"] - day_before_yesterday_cases["cases"]
Can you please point out what I am doing wrong?
Because you defined total_cases as a concatenation (via append) of yesterday_cases and day_before_yesterday_cases, its number of rows is equal to the sum of the other two dataframes. It looks like yesterday_cases and day_before_yesterday_cases both have 55 rows, and so total_cases has 110 rows. Thus your last line is trying to assign 55 values to a series of 110 values.
You may either want to reshape your data so that each date is its own column, or work in arrays of dataframes.
I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.
I have a 3 years dataset. I have split my dataset in days. now, I want to store each month's data in a separate list/variable.
SDD2=Restaurant[Restaurant.Item == ' Soft Drink '].groupby(pd.Grouper(key='Date',freq='D')).sum()
print(SDD2)
This a data which I get from above code now I want to store each month data in separate variable/list
You should store data into json format or csv format of each of month into file so it easily accessible from your python script.
For more information check python's module JSON and CSV.
You can just do df.groupby(pd.Grouper(key="Date", freq="M")) and then query on the groups to get your data with get_group('date') or optionally you could convert the grouped data to dict of lists with either .apply(list).to_dict() or dict(list(groups)).
Example:
import pandas as pd
import numpy as np
# create some random dates
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-12-31')
start_u = start.value//10**9
end_u = end.value//10**9
date_range = pd.to_datetime(np.random.randint(start_u, end_u, 30), unit='s')
# convert to DF
df = pd.DataFrame(date_range, columns=["Date"])
# Add random data
df['Data'] = np.random.randint(0, 100, size=(len(date_range)))
# Format to y-m-d
df['Date'] = pd.to_datetime(df['Date'].dt.strftime('%Y-%m-%d'))
print(df)
# group by month
grouped_df = df.groupby(pd.Grouper(key="Date", freq="M"))
# query the groups
print("\n\ngrouped data for feb 2018\n")
#print(grouped_df.get_group('2018-02-28'))
dict_of_list = dict(list(grouped_df))
feb_2018 = pd.Timestamp('2018-02-28')
if feb_2018 in dict_of_list:
print(dict_of_list[feb_2018])
I'm new to python and pardon me if this question might sound silly -
I have csv file that has 2 columns - Value and Timestamp. I'm trying to write a code that would take 2 paramenters - start_date and end_date and traverse the csv file to obtain all the values between those 2 dates and print the sum of Value
Below is my code. I'm trying to read and store the values in a list.
f_in = open('Users2.csv').readlines()
Value1 = []
Created = []
for i in range(1, len(f_in)):
Value, created_date = f_in[i].split(',')
Value1.append(Value)
Created.append(created_date)
print Value1
print Created
My csv has the following format
10 2010-02-12 23:31:40
20 2010-10-02 23:28:11
40 2011-03-12 23:39:40
10 2013-09-10 23:29:34
420 2013-11-19 23:26:17
122 2014-01-01 23:41:51
When I run my code - File1.py as below
File1.py 2010-01-01 2011-03-31
The output should be 70
I'm running into the following issues -
The data in csv is in timestamp (created_date), but the parameter passed should be date and I need to convert and get the data between those 2 dates regardless of time.
Once I have it in list - as described above - how do I proceed to do my calculation considering the condition in point-1
You can try this:
import csv
data = csv.reader(open('filename.csv'))
start_date = 10
end_data = 30
times = [' '.join(i) for i in data if int(i[0]) in range(start_date, end_date)]
Depends on your file size, but you may consider putting values from csv file, into some database, and then query your results.
csv module has DictReader which allows you to predefine your column names, it greatly improves readability, specially while working on really big files.
from datetime import datetime
COLUMN_NAMES = ['value', 'timestamp']
def sum_values(start_date, end_date):
sum = 0
with open('Users2.csv', mode='r') as csvfile:
table = csv.DictReader(csvfile, fieldnames=COLUMN_NAMES)
for row in table:
if row['timestamp'] >= min_date and row['timestamp'] <= max_date:
sum += int(row['value'])
return sum
If you are open to using pandas, try this:
>>> import pandas as pd
>>> data = 'Users2.csv'
>>>
>>> dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
>>> df = pd.read_csv(data, names=['value', 'date'], parse_dates=['date'], date_parser=dateparse)
>>> result = df['value'][(df['date'] > '2010-01-01') &
... (df['date'] < '2011-03-31')
... ].sum()
>>> result
70
Since you said that dates are in timestamp, you can compare them like strings. By realizing that, what you want to achieve (sum the values if created is between start_date and end_date) can be done like this:
def sum_values(start_date, end_date):
sum = 0
with open('Users2.csv') as f:
for line in f:
value, created = line.split(' ', 1)
if created > start_date && created < end_date:
sum += int(value)
return sum
str.split(' ', 1) will split on ' ' but will stop splitting after 1 split has been done. start_date and end_date must be in format yyyy-MM-dd hh:mm:ss which I assume they are, cause they are in timestamp format. Just mind it.