I have ticks data of 2 scrips (scrip_names are abc and xyz). Since the ticks data is at a "second" level, I want to convert this to OHLC (Open, High, Low, Close) at 1 Minute level.
When the ticks data contains only 1 scrip, I use the following code (OHLC of Single Scrip.py) to get the OHLC at 1 Minute level. This code gives the desired result.
Code:
import os
import time
import datetime
import pandas as pd
import numpy as np
ticks=pd.read_csv(r'C:\Users\tech\Downloads\ticks.csv')
ticks=pd.DataFrame(ticks)
#ticks=ticks.where(ticks['scrip_name']=="abc")
#ticks=ticks.where(ticks['scrip_name']=="xyz")
ticks['timestamp'] = pd.to_datetime(ticks['timestamp'])
ticks=ticks.set_index(['timestamp'])
ohlc_prep=ticks.loc[:,['last_price']]
ohlc_1_min=ohlc_prep['last_price'].resample('1min').ohlc().dropna()
ohlc_1_min.to_csv(r'C:\Users\tech\Downloads\ohlc_1_min.csv')
Result:
However, when the ticks data contains more than 1 scrip, this code doesn't work. What modifications should be done to the code to get the following result (filename: expected_result.csv) which is grouped by scrip_name.
Expected Result:
Here is the link to ticks data, python code for single scrip, result of single scrip, and desired result of multiple scrips: https://drive.google.com/file/d/1Y3jngm94hqAW_IJm-FAsl3SArVhnjGJE/view?usp=sharing
Any help is much appreciated.
thank you.
I think you need groupby like:
ticks['timestamp'] = pd.to_datetime(ticks['timestamp'])
ticks=ticks.set_index(['timestamp'])
ohlc_1_min=ticks.groupby('scrip_name')['last_price'].resample('1min').ohlc().dropna()
Or:
ohlc_1_min=(ticks.groupby(['scrip_name',
pd.Grouper(freq='1min', level='timestamp')])['last_price']
.ohlc()
.dropna())
Related
I got a live data from yahoo finance as follows:
ndx = yf.Ticker("NDX")
# get stock info
print(ndx.info)
# get historical market data
hist = ndx.history(period="1825d")
I downloaded it and Exported to CSV file as follows:
#Download stock data then export as CSV
df = yf.download("NDX", start="2016-01-01", end="2022-11-02")
df.to_csv('ndx.csv')
Viewed the data as follows:
df = pd.read_csv("ndx.csv")
df
The data was displayed as seen in the picture:
THE PROBLEM....
Anytime i tried to use the Date column it throws an error as KeyError 'Date'. here is my Auto Arima Model and the error thrown. Please Help.
ERROR THROWN
i want to be able to use the Date column. i tried Parsing the Date column but throw the same error. i will need help parsing the data first so as to convert Date to day format or string. Thanks
Always great to see people trying to learn financial analysis:
Before I get into the solution I just want to remind you to make sure you put your imports in your question (yfinance isn't always aliased as yf). Also make sure you type or copy/paste your code so that we can easily grab it and run it!
So, I am going to assume the variable "orig_df" is just the call to pd.read_csv('ndx.csv') since that's what the screenshot looks like.
Firstly, always check your data types of your columns after reading in the file:
(assuming you are using Jupyter)
orig_df = pd.read_csv('ndx.csv')
orig_df.dtypes
Date is an object, which just means string in pandas.
if orig_df is the actual call to yf.ticker(...), then "Date" is your index, so it is does not act like a column.
How to fix and Run:
from statsmodels.api import tsa
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime as dt, timedelta
orig_df = pd.read_csv('ndx.csv', parse_dates=['Date'], index_col=0)
model = tsa.arima.ARIMA(np.log(orig_df['Close']), order=(10, 1, 10))
fitted = model.fit()
fc = fitted.get_forecast(5)
fc = (fc.summary_frame(alpha=0.05))
fc_mean = fc['mean']
fc_lower = fc['mean_ci_lower']
fc_upper = fc['mean_ci_upper']
orig_df.iloc[-50:,:].plot(y='Close', title='Nasdaq 100 Closing price', figsize=(10, 6))
# call orig_df.index[-1] for most recent trading day, not just today
future_5_days = [orig_df.index[-1] + timedelta(days=x) for x in range(5)]
plt.plot(future_5_days, np.exp(fc_mean), label='mean_forecast', linewidth=1.5)
plt.fill_between(future_5_days,
np.exp(fc_lower),
np.exp(fc_upper),
color='b', alpha=.1, label='95% confidence')
plt.title('Nasdaq 5 Days Forecast')
plt.legend(loc='upper left', fontsize=8)
plt.show()
I have data of concentrations for every day of year 2005 until 2018. I want to read three columns of three different files and combine them into one, so I can plot them.
Data:file 1
time, mean_OMNO2d_003_ColumnAmountNO2CloudScreened
2005-01-01,-1.267651e+30
2005-01-02,4.90778397e+15
...
2018-12-31,-1.267651e+30
Data:file 2
time, OMNO2d_003_ColumnAmountNO2TropCloudScreened
2005-01-01,-1.267651e+30
2005-01-02,3.07444147e+15
...
Data:file 3
time, OMSO2e_003_ColumnAmountSO2_PBL
2005-01-01,-1.267651e+30
2005-01-02,-0.0144000314
...
I want to plot time and mean_OMNO2d_003_ColumnAmountNO2CloudScreened, OMNO2d_003_ColumnAmountNO2TropCloudScreened, OMSO2e_003_ColumnAmountSO2_PBL into one graph.
import glob
import pandas as pd
file_list = glob.glob('*.csv')
no= []
no2=[]
so2=[]
for f in file_list:
df= pd.read_csv(f, skiprows=8, parse_dates =['time'], index_col ='time')
df.columns=['no','no2','so2']
no.append([df["no"]])
no2.append([df["no2"]])
so2.append([df["so2"]])
How do I solve the problem?
This is very doable. I had a similar problem with 3 files all in one plot. My understanding is that you want to compare levels of NO, NO2, and SO2, that each column is in comparable order, and that you want to compare across rows. If you are ok with importing matplotlib and numpy, something like this may work for you:
import numpy as np
import matplotlib as plt
NO = np.asarray(df["no1"])
NO2 = np.asarray(df["no2"]))
SO2 = np.asarray(df["so2"))
timestamp = np.asarray(df["your_time_stamp"])
plt.plot(timestamp, NO)
plt.plot(timestamp, NO2)
plt.plot(timestamp, SO2)
plt.savefig(name_of_plot)
This will need some adjusting for your specific data frame, but I hope you see what I am getting at!
I tried to run a function through multiple data frames, but I have a problem with it. My main questions are:
1) I tried to run a defined function with zip(df1, df2, df3,...) and the outputs are new DF1, DF2, DF3,...; however, I failed. Is it possible to run a function through multiple dataframes and outputs are also dataframes by "zip"?
2) If zip() is not a choice, how do I do to make my function running in a loop? Currently, I just have three dataframes and they are easy to be done separately. But I would like to know how to handle it when I have 50, 100, or even more dataframes.
Here are my codes:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import scipy.stats as ss
# *********** 3 City Temperature files from NOAA ***********
# City 1
df1 = pd.pandas.read_csv('https://docs.google.com/spreadsheets/d/1Uj5N363dEVJZ9WVy2a_kkbJKJnyyE5qnEqOfzO0UCQE/gviz/tq?tqx=out:csv')
# City 2
df2 = pd.pandas.read_csv('https://docs.google.com/spreadsheets/d/13CgTdDCDzB_3WIYIRVMeLu6E36xzHSzRR5T_Ku0vThA/gviz/tq?tqx=out:csv')
# City 3
df3 = pd.pandas.read_csv('https://docs.google.com/spreadsheets/d/17pNZFIaV_NpQfSed-msIGu9jzzqF6JBvCZrBRiU2ZkQ/gviz/tq?tqx=out:csv')
def CleanDATA(data):
data = data.drop(columns=['Annual'])
data = data.drop(data.index[29:-1])
data = data.drop(data.index[-1])
monthname=[]
Temp=[]
for row in range(0,len(data)):
for col in range(1,13):
#monthname.append(str(col)+"-"+str(data['Year'][row]))
monthname.append(str(data['Year'][row])+str(col))
Temp.append(data.iloc[row,col])
df0=pd.DataFrame()
df0['Month']=monthname
df0['Temperature']=Temp
df0['Month']=pd.to_datetime(df0['Month'],format='%Y.0%m') #change the date form
df0['Month'] = pd.to_datetime(df0['Month']).dt.date # remove time, only keep date
data =df0[df0.applymap(np.isreal).all(1)] # remove non-numerical
return data
data1 = CleanDATA(df1)
data2 = CleanDATA(df2)
data3 = CleanDATA(df3)
Also, I found an issue with Pandas while reading the following excel file:
https://drive.google.com/file/d/1V9fKpACbLrSi0NfB0FHSgc96PQerKkUF/view?usp=sharing (This is city 1 temperature data from 1990-2019)
2019 is ongoing, hence, NOAA stations only provide information till this May. The excel data labels all missing data by "M". I noticed that once the column comes with an "M", I cannot use boxplot directly even I already drop 2019 row. Spyder console will say "items [Jun to Dec]" are missing (and the wired thing is I can use the same data to plot XY line plot). To plot the boxplot, I have to manually remove 2019 information (1 row) in excel than read the new file.
I would do it using dictionaries (or lists or other iterable).
cities = {'city1': 'https://...', 'city2': 'https://...', 'city3': 'https://...'}
df = {}
data = {}
for city, url in iteritems(cities):
df[city] = pd.pandas.read_csv(url)
data[city] = CleanDATA(df[city])
I want to create a graph of temperature and time data from a MySQL database. Using matplotlib and pandas with python3 on raspbian I am trying to insert the temperature in the Y axis and the time in the X axis.
The Y axis works fine, it plots the temps (float) without any issues. However, when I try to add time (time), it outputs erroneous data I assume because it has a different data type. If I use another column such as ID (int), then it works. I am unsure if I need to convert time into a string or if there is another way around it.
The answer might lie in Change data type of columns in Pandas which seems similar, but because I am inserting data from MySQL, I am unsure how I could apply it to my own problem.
My end goal is to have a cron job that runs every five minutes and outputs an image based line chart with temps from the last 24 hours on the Y axis, and the time values along the X axis and then copies it to my WWW folder for display via HTML. I know the script is missing anything after image output, but that is easy and I've done that before. I just can't get the chart to display the X axis time values.
Any assistance would be appreciated.
import matplotlib.pyplot as plt
import pandas as pd
import MySQLdb
def mysql_select_all():
conn = MySQLdb.connect(host='localhost',
user='user',
passwd='password',
db='database')
cursor = conn.cursor()
sql = "SELECT id,time,date,temperature FROM table ORDER BY id DESC LIMIT 288"
cursor.execute(sql)
result = cursor.fetchall()
df = pd.DataFrame(list(result),columns=["id","time","date","temperature"])
x = df.time
y = df.temperature
plt.title("Temperature Data from Previous 24 Hours", fontsize="20")
plt.plot(x, y)
plt.xlabel("Time")
plt.ylabel("Temperature (\u2103)")
plt.tick_params(axis='both',which='major',labelsize=14)
plt.savefig('test.png')
cursor.close()
print("Start")
mysql_select_all()
print("End")
The above code currently outputs the below image.
First and Last lines from the DataFrame
id 688
time 0 days 09:55:01
date 2019-01-24
temperature 27.75
Name: 0, dtype: object
id 401
time 0 days 10:00:01
date 2019-01-23
temperature 24.4
Name: 287, dtype: object
try pandas.to_datetime() function. It could convert string or integer to datetime format.
original code
df = pd.DataFrame(list(result),columns=["time","temperature"])
x = df.time
y = df.temperature
new code
df = pd.DataFrame(list(result),columns=["time","temperature"])
df["time"]=df["time"].astype(np.datetime64)
#or below code.
#df["time"]=pd.to_datetime(df["time"])
#assuming that df.time could be converted to datetime format.
x = df.time
y = df.temperature
For other code you can keep it as original though df.plot() could show plot more convenient.
I have the following excel file and the time stamp in the format
20180821_2330
1) for a lot of days. How would I format it as standard time so that I can plot it versus the other sensor values ?
2) I would like to have a big plot with for example sensor 1 reading against all the days, is that possible ?
https://www.mediafire.com/file/m36ha4777d6epvd/median_data.xlsx/file
is this something you are looking for? I improvised and created 'n' column which could represent your 'timestamp' as the data frame. Basically, what I think you should do, is to apply another function - let's call it 'apply_fun' on your column which stores 'timestamps' a function which takes each element and transforms it into strptime() format.
import datetime
import pandas as pd
n = {'timestamp':['20180822_2330', '20180821_2334', '20180821_2334', '20180821_2330']}
data_series = pd.DataFrame(n)
def format_dates(n):
x = n.find('_')
y = datetime.datetime.strptime(n[:x]+n[x+1:], '%Y%m%d%H%M')
return y
def apply_fun(dataset):
dataset['timestamp2'] = dataset['timestamp'].apply(format_dates)
return dataset
print(apply_fun(data_series))
When it comes to 2nd point, I am not able to reach the site due to McAffe agent at work, which does not allow to open it. Once you have 1st, you can ask for 2nd separately.