for loop performs first action but not the second - python

I have a for loop that I want to:
1) Make a pivot table out of the data
2) Convert the 5min data to 30min data
My code is below:
import numpy as np
import pandas as pd
import os
import glob
os.chdir('C:/Users/george/Desktop/testing/output/test')
for filename in os.listdir('C:/Users/george/Desktop/testing/output/test'):
data = pd.read_csv(filename,skiprows=[0])
table = pd.pivot_table(data, values='SCADAVALUE',columns=['DUID'],index='SETTLEMENTDATE', aggfunc=np.sum)
table.to_csv(filename+'pivoted.csv')
my_csv_files = []
for file in os.listdir("C:/Users/george/Desktop/testing/output/test"):
if file.endswith("*pivoted.csv"):
table.set_index(table.columns[0])
table.index = pd.to_datetime(table.index)
table_resampled = table.resample('30min',closed='right',label='right').mean()
table_resampled = table_resampled.reset_index()
table.to_csv(filename+'30min.csv')
The code performs the first loop, but the second loop does not work.Why is this? Whats wrong with my code?
EDIT1:

See comment below
import numpy as np
import pandas as pd
import os
import glob
os.chdir('C:/Users/george/Desktop/testing/output/test')
for filename in os.listdir('C:/Users/george/Desktop/testing/output/test'):
data = pd.read_csv(filename,skiprows=[0])
table = pd.pivot_table(data, values='SCADAVALUE',columns=['DUID'],index='SETTLEMENTDATE', aggfunc=np.sum)
table.to_csv(filename+'pivoted.csv')
my_csv_files = [] # what is this variable for?
for file in os.listdir("C:/Users/george/Desktop/testing/output/test"):
if file.endswith("*pivoted.csv"):
# At this point you are not reading the file, but you should.
# The 'table' variable is still making reference to the the last iteration
# of the 'for' loop a few lines above
# However, better than re-reading the file, you can remove
# the second 'for file in...' loop,
# and just merge the code with the first loop
table.set_index(table.columns[0])
table.index = pd.to_datetime(table.index)
table_resampled = table.resample('30min',closed='right',label='right').mean()
table_resampled = table_resampled.reset_index()
table.to_csv(filename+'30min.csv')

Related

Iterate through df and update based on prediction

I am not a Python programmer, so am struggling with the following;
def py_model(df):
import pickle
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
filename = 'C:/aaaTENNIS-DATA/votingC.pkl'
loaded_model = pickle.load(open(filename,'rb'))
for index, row in df.iterrows():
ab = row[['abc','def','ghi','jkl']]
input = np.array(ab)
df['Prediction'] =pd.DataFrame(loaded_model.predict([input]))
df['AccScore'] =??
return df
For each row of the dataframe, I wish to get a prediction and put it in df['Prediction'] and also get the model score and put it in another field.
You don't need to iterate
import pickle
filename = 'C:/aaaTENNIS-DATA/votingC.pkl'
loaded_model = pickle.load(open(filename,'rb'))
df['Prediction'] = loaded_model.predict(df[['abc','def','ghi','jkl']])
Tip #1: don't use input as a variable, it's a built-in function in python: https://docs.python.org/3/library/functions.html#input
Tip #2: don't put import statement in a function, put them all at the beginning of your file

Iterate through Time Series data from .txt file using Numpy Array

My background is VBA and very new to Python, so please forgive me at the outset.
I have a .txt file with time series data.
My goal is to loop through the data and do simple comparisons, such as High - Close etc. From a VBA background this is straight forward for me in VBA, namely (in simple terms):
Sub Loop()
Dim arrTS() As Variant, i As Long
arrTS = Array("Date", "Time", ..)
For i = LBound(arrTS, 1) to UBound(arrTS, 1)
Debug.Print arrTS(i, "High") - arrTS(i, "Close")
Next i
End Sub
Now what I have in python is:
import os
import numpy as np
import urllib.request
import matplotlib.pyplot as plt
#load the .txt file
ES_D1 = np.loadtxt(fname = os.getcwd()+"\ES\D1\ES_10122007_04122019_D1.txt", dtype='str')
#now get the shape
print(ES_D1.shape)
Out: (3025, 8)
Can anyone recommend the best way to iterate through this file line by line, with reference to specific columns, and not iterate through each element?
Something like:
For i = 0 To 3025
print(ES_D1[i,4] - ES_D1[i,5])
Next i
The regular way to read csv/tsv files for me is this:
import os
filename = '...'
filepath = '...'
infile = os.path.join(filepath, filename)
with open(infile) as fin:
for line in fin:
parts = line.split('\t')
# do something with the list "parts"
But in your case, using the pandas function read_csv()might be a better way:
import pandas as pd
# Control delimiters, rows, column names with read_csv
data = pd.read_csv(infile)
# View the first 5 lines
data.head()
Creating the simple for loop was easier than I though, here for others.
import os
import numpy as np
import urllib.requests
import matplotlib.pyplot as plt
#load the .txt file
ES_D1 = np.loadtxt(fname = os.getcwd()+"\ES\D1\ES_10122007_04122019_D1.txt", dtype='str')
#now need to loop through the array
#this is the engine
for i in range(ES_D1.shape[0]):
if ES_D1[i,3] > ES_D1[i,6]:
print(ES_D1[i,0])

How do I save my list to a dataframe keeping empty rows?

I'm trying to extract subject-verb-object triplets and then attach an ID. I am using a loop so my list of extracted triplets keeping the results for the rows were no triplet was found. So it looks like:
[]
[trump,carried,energy]
[]
[clinton,doesn't,trust]
When I print mylist it looks as expected.
However when I try and create a dataframe from mylist I get an error caused by the empty rows
`IndexError: list index out of range`.
I tried to include an if statement to avoid this but the problem is the same. I also tried using reindex instead but the df2 came out empty.
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import spacy
import textacy
import csv, string, re
import numpy as np
import pandas as pd
#Import csv file with pre-processing already carried out
import pandas as pd
df = pd.read_csv("pre-processed_file_1.csv", sep=",")
#Prepare dataframe to be relevant columns and unicode
df1 = df[['text_1', 'id']].copy()
import StringIO
s = StringIO.StringIO()
tweets = df1.to_csv(encoding='utf-8');
nlp = spacy.load('en')
count = 0;
df2 = pd.DataFrame();
for row in df1.iterrows():
doc = nlp(unicode(row));
text_ext = textacy.extract.subject_verb_object_triples(doc);
tweetID = df['id'].tolist();
mylist = list(text_ext)
count = count + 1;
if (mylist):
df2 = df2.append(mylist, ignore_index=True)
else:
df2 = df2.append('0','0','0')
Any help would be very appreciated. Thank you!
You're supposed to pass a DataFrame-shaped object to append. Passing the raw data doesn't work. So df2=df2.append([['0','0','0']],ignore_index=True)
You can also wrap your processing in a function process_row, then do df2 = pd.DataFrame([process_row(row) for row in df1.iterrows()]). Note that while append won't work with empty rows, the DataFrame constructor just fills them in with None. If you want empty rows to be ['0','0','0'], you have several options:
-Have your processing function return ['0','0','0'] for empty rows -Change the list comprehension to [process_row(row) if process_row(row) else ['0','0','0'] for row in df1.iterrows()] -Do df2=df2.fillna('0')

subtract consecutive rows from a .dat file

I wish to subtract rows from the preceding rows in a .dat file and then make a new column out of the result. In my file, I wish to do that with the first column time , I want to find time interval for each timestep and then make a new column out of it. I took help from stackoverflow community and wrote a pseudo code in pandas python. but it's not working so far:
import pandas as pd
import numpy as np
from sys import argv
from pylab import *
import csv
script, filename = argv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]
# write it as a new CSV file
with open("./flash.dat", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
columns_to_keep = ['#time']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
df = pd.DataFrame({"#time": pd.date_range("24 sept 2016"),periods=5*24,freq="1h")})
df["time"] = df["#time"] + [pd.Timedelta(minutes=m) for m in np.random.choice(a=range(60), size=df.shape[0])]
df["value"] = np.random.normal(size=df.shape[0])
df["prev_time"] = [np.nan] + df.iloc[:-1]["time"].tolist()
df["time_delta"] = df.time - df.prev_time
df
dataframe.plot(x='#time', y='time_delta', style='r')
print dataframe
show()
I am also sharing the file for your convenience, your help is mostly appreciated.
https://www.dropbox.com/s/w4jbxmln9e83355/flash.dat?dl=0

How to merge two programs with scheduled execution

I am trying to merge two programs or write a third program that will call these two programs as function. They are supposed to run one after the other and after interval of certain time in minutes. something like a make file which will have few more programs included later. I am not able to merge them nor able to put them into some format that will allow me to call them in a new main program.
program_master_id.py picks the *.csv file from a folder location and after computing appends the master_ids.csv file in another location of folder.
Program_master_count.py divides the count with respect to the count ofIds in the respective timeseries.
Program_1 master_id.py
import pandas as pd
import numpy as np
# csv file contents
# Need to change to path as the Transition_Data has several *.CSV files
csv_file1 = 'Transition_Data/Test_1.csv'
csv_file2 = '/Transition_Data/Test_2.csv'
#master file to be appended only
master_csv_file = 'Data_repository/master_lac_Test.csv'
csv_file_all = [csv_file1, csv_file2]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(csv_file) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
# do the subtraction
df_master = pd.read_csv(master_csv_file, index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
print(df_matched)
Program_2 master_count.py #This does not give any error nor gives any output.
import pandas as pd
import numpy as np
csv_file1 = '/Data_repository/master_lac_Test.csv'
csv_file2 = '/Data_repository/lat_lon_master.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
# do the division by number of occurence of each Ids
# and add column 00:00:00
def my_func(group):
num_obs = len(group)
# process with column name after 00:30:00 (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
I am trying to write a main program that will call master_ids.py first and then master_count.py. Is their a way to merge both in one program or write them as functions and call those functions in a new program ? Please suggest.
Okey, lets say you have program1.py:
import pandas as pd
import numpy as np
def main_program1():
csv_file1 = 'Transition_Data/Test_1.csv'
...
return df_matched
And then program2.py:
import pandas as pd
import numpy as np
def main_program2():
csv_file1 = '/Data_repository/master_lac_Test.csv'
...
result = temp.groupby(level='Ids').apply(my_func)
return result
You can now use these in a separate python program, say main.py
import time
import program1 # imports program1.py
import program2 # imports program2.py
df_matched = program1.main_program1()
print(df_matched)
# wait
min_wait = 1
time.sleep(60*min_wait)
# call the second one
result = program2.main_program2()
There are lots of ways to 'improve' these, but hopefully this will show you the gist. I would in particular recommend you use the What does if __name__ == "__main__": do?
in each of the files, so that they can easily be executed from the command-line or called from python.
Another option is a shell script, which for your 'master_id.py' and 'master_count.py' become (in its simplest form)
python master_id.py
sleep 60
python master_count.py
saved in 'main.sh' this can be executed as
sh main.sh

Categories

Resources