My background is VBA and very new to Python, so please forgive me at the outset.
I have a .txt file with time series data.
My goal is to loop through the data and do simple comparisons, such as High - Close etc. From a VBA background this is straight forward for me in VBA, namely (in simple terms):
Sub Loop()
Dim arrTS() As Variant, i As Long
arrTS = Array("Date", "Time", ..)
For i = LBound(arrTS, 1) to UBound(arrTS, 1)
Debug.Print arrTS(i, "High") - arrTS(i, "Close")
Next i
End Sub
Now what I have in python is:
import os
import numpy as np
import urllib.request
import matplotlib.pyplot as plt
#load the .txt file
ES_D1 = np.loadtxt(fname = os.getcwd()+"\ES\D1\ES_10122007_04122019_D1.txt", dtype='str')
#now get the shape
print(ES_D1.shape)
Out: (3025, 8)
Can anyone recommend the best way to iterate through this file line by line, with reference to specific columns, and not iterate through each element?
Something like:
For i = 0 To 3025
print(ES_D1[i,4] - ES_D1[i,5])
Next i
The regular way to read csv/tsv files for me is this:
import os
filename = '...'
filepath = '...'
infile = os.path.join(filepath, filename)
with open(infile) as fin:
for line in fin:
parts = line.split('\t')
# do something with the list "parts"
But in your case, using the pandas function read_csv()might be a better way:
import pandas as pd
# Control delimiters, rows, column names with read_csv
data = pd.read_csv(infile)
# View the first 5 lines
data.head()
Creating the simple for loop was easier than I though, here for others.
import os
import numpy as np
import urllib.requests
import matplotlib.pyplot as plt
#load the .txt file
ES_D1 = np.loadtxt(fname = os.getcwd()+"\ES\D1\ES_10122007_04122019_D1.txt", dtype='str')
#now need to loop through the array
#this is the engine
for i in range(ES_D1.shape[0]):
if ES_D1[i,3] > ES_D1[i,6]:
print(ES_D1[i,0])
Related
Im trying to find a quick and easy way to read and plot the nth csv file in a folder,
im currently working with the following, to read all files in the folder>
import os
import glob
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
for file in csv_files:
# read adn plot the csv file
Data = pd.read_csv(file,header=33)
sns.lineplot(x=Data['x'],y=Data['y'],data=Data)
but is there a way to read and plot every 4th file for example?
for file, count in enumerate(csv_files, start=1):
if count % 4:
Data = pd.read_csv(file,header=33)
sns.lineplot(x=Data['x'],y=Data['y'],data=Data)
count will keep increasing and only read every 4th file.
You can use a counter variable (which is a very beginner approach, but easy to start).
counter = 0
for file in csv_files:
# Increase the counter value for each iteration
counter = counter +1
# read adn plot the csv file
Data = pd.read_csv(file,header=33)
# For example, you want to print the third file
if counter == 3:
sns.lineplot(x=Data['x'],y=Data['y'],data=Data)
I would like to loop over and access my files of type .bin that each contain three values of type double (pitch, yaw, roll).
So far I was able to access one file only by using with open('annotations/01/frame_00004_pose.bin', 'rb') as fid:
I am aware that I need to change that line of code for my loop to work properly. I am just unsure as to how I can proceed. So my file is annotations having 01-24 files each having many other files of type .bin
Here is what I have done so far.
import pandas as pd
import numpy as np
import os
pyr = pd.DataFrame(columns = ['pitch','yaw','roll'])
with os.scandir('annotations') as entries:
for i in entries:
with open('annotations/01/frame_00004_pose.bin', 'rb') as fid:
data_array = np.fromfile(fid, np.float32)
para = data_array[3:]
pyr = pyr.append({'pitch':para[0],'yaw':para[1],'roll':para[2]},ignore_index = True)
print(pyr)
Any help would be appreciated.
Yes, use glob is a good idea, see it:
import pandas as pd
import numpy as np
import os
import glob
pyr = pd.DataFrame(columns = ['pitch','yaw','roll'])
entries = glob.glob('annotations/**/*.bin', recursive=True)
for entry in entries:
with open(entry, 'rb') as fid:
data_array = np.fromfile(fid, np.float32)
para = data_array[3:]
pyr = pyr.append({'pitch':para[0],'yaw':para[1],'roll':para[2]},ignore_index = True)
print(pyr)
The goal for this program to accomplish is to read each column header and to read all of the data underneath each column. After reading this data it will then make a list of it and log it all into a text file. When doing this with small data it works but when working with large amounts of data (2000 lines and up) it records in the text file up to the number 30 then the next element is '...'. it then resumes recording correctly all the way up until the 2000th element.
I have tried all that i can do. Plz help. I almost punched a hole in the wall trying to fix this.
import csv
import pandas as pd
import os
import linecache
from tkinter import *
from tkinter import filedialog
def create_dict(df):
# Creates an empty text file for the dictionary if it doesn't exist
if not os.path.isfile("Dictionary.txt"):
open("Dictionary.txt", 'w').close()
# Opens the dictionary for reading and writing
with open("Dictionary.txt", 'r+') as dictionary:
column_headers = list(df)
i = 0
# Creates an entry in the dictionary for each header
for header in column_headers:
dictionary.write("==========================\n"
"\t=" + header + "=\n"
"==========================\n\n\n\n")
dictionary.write(str(df[str(column_headers[i])]))
#for line in column_info[:-1]:
# dictionary.write(line + '\n')
dictionary.write('\n')
i += 1
Some of these imports might not be used. I just included all of them.
you can directly write pandas dataframe to txt file ..
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low = 1, high = 100, size =3000), columns= ['Random Number'])
filename = 'dictionary.txt'
with open(filename,'w') as file:
df.to_string(file)
I have a for loop that I want to:
1) Make a pivot table out of the data
2) Convert the 5min data to 30min data
My code is below:
import numpy as np
import pandas as pd
import os
import glob
os.chdir('C:/Users/george/Desktop/testing/output/test')
for filename in os.listdir('C:/Users/george/Desktop/testing/output/test'):
data = pd.read_csv(filename,skiprows=[0])
table = pd.pivot_table(data, values='SCADAVALUE',columns=['DUID'],index='SETTLEMENTDATE', aggfunc=np.sum)
table.to_csv(filename+'pivoted.csv')
my_csv_files = []
for file in os.listdir("C:/Users/george/Desktop/testing/output/test"):
if file.endswith("*pivoted.csv"):
table.set_index(table.columns[0])
table.index = pd.to_datetime(table.index)
table_resampled = table.resample('30min',closed='right',label='right').mean()
table_resampled = table_resampled.reset_index()
table.to_csv(filename+'30min.csv')
The code performs the first loop, but the second loop does not work.Why is this? Whats wrong with my code?
EDIT1:
See comment below
import numpy as np
import pandas as pd
import os
import glob
os.chdir('C:/Users/george/Desktop/testing/output/test')
for filename in os.listdir('C:/Users/george/Desktop/testing/output/test'):
data = pd.read_csv(filename,skiprows=[0])
table = pd.pivot_table(data, values='SCADAVALUE',columns=['DUID'],index='SETTLEMENTDATE', aggfunc=np.sum)
table.to_csv(filename+'pivoted.csv')
my_csv_files = [] # what is this variable for?
for file in os.listdir("C:/Users/george/Desktop/testing/output/test"):
if file.endswith("*pivoted.csv"):
# At this point you are not reading the file, but you should.
# The 'table' variable is still making reference to the the last iteration
# of the 'for' loop a few lines above
# However, better than re-reading the file, you can remove
# the second 'for file in...' loop,
# and just merge the code with the first loop
table.set_index(table.columns[0])
table.index = pd.to_datetime(table.index)
table_resampled = table.resample('30min',closed='right',label='right').mean()
table_resampled = table_resampled.reset_index()
table.to_csv(filename+'30min.csv')
I wish to subtract rows from the preceding rows in a .dat file and then make a new column out of the result. In my file, I wish to do that with the first column time , I want to find time interval for each timestep and then make a new column out of it. I took help from stackoverflow community and wrote a pseudo code in pandas python. but it's not working so far:
import pandas as pd
import numpy as np
from sys import argv
from pylab import *
import csv
script, filename = argv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]
# write it as a new CSV file
with open("./flash.dat", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
columns_to_keep = ['#time']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
df = pd.DataFrame({"#time": pd.date_range("24 sept 2016"),periods=5*24,freq="1h")})
df["time"] = df["#time"] + [pd.Timedelta(minutes=m) for m in np.random.choice(a=range(60), size=df.shape[0])]
df["value"] = np.random.normal(size=df.shape[0])
df["prev_time"] = [np.nan] + df.iloc[:-1]["time"].tolist()
df["time_delta"] = df.time - df.prev_time
df
dataframe.plot(x='#time', y='time_delta', style='r')
print dataframe
show()
I am also sharing the file for your convenience, your help is mostly appreciated.
https://www.dropbox.com/s/w4jbxmln9e83355/flash.dat?dl=0