I am not a Python programmer, so am struggling with the following;
def py_model(df):
import pickle
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
filename = 'C:/aaaTENNIS-DATA/votingC.pkl'
loaded_model = pickle.load(open(filename,'rb'))
for index, row in df.iterrows():
ab = row[['abc','def','ghi','jkl']]
input = np.array(ab)
df['Prediction'] =pd.DataFrame(loaded_model.predict([input]))
df['AccScore'] =??
return df
For each row of the dataframe, I wish to get a prediction and put it in df['Prediction'] and also get the model score and put it in another field.
You don't need to iterate
import pickle
filename = 'C:/aaaTENNIS-DATA/votingC.pkl'
loaded_model = pickle.load(open(filename,'rb'))
df['Prediction'] = loaded_model.predict(df[['abc','def','ghi','jkl']])
Tip #1: don't use input as a variable, it's a built-in function in python: https://docs.python.org/3/library/functions.html#input
Tip #2: don't put import statement in a function, put them all at the beginning of your file
Related
My background is VBA and very new to Python, so please forgive me at the outset.
I have a .txt file with time series data.
My goal is to loop through the data and do simple comparisons, such as High - Close etc. From a VBA background this is straight forward for me in VBA, namely (in simple terms):
Sub Loop()
Dim arrTS() As Variant, i As Long
arrTS = Array("Date", "Time", ..)
For i = LBound(arrTS, 1) to UBound(arrTS, 1)
Debug.Print arrTS(i, "High") - arrTS(i, "Close")
Next i
End Sub
Now what I have in python is:
import os
import numpy as np
import urllib.request
import matplotlib.pyplot as plt
#load the .txt file
ES_D1 = np.loadtxt(fname = os.getcwd()+"\ES\D1\ES_10122007_04122019_D1.txt", dtype='str')
#now get the shape
print(ES_D1.shape)
Out: (3025, 8)
Can anyone recommend the best way to iterate through this file line by line, with reference to specific columns, and not iterate through each element?
Something like:
For i = 0 To 3025
print(ES_D1[i,4] - ES_D1[i,5])
Next i
The regular way to read csv/tsv files for me is this:
import os
filename = '...'
filepath = '...'
infile = os.path.join(filepath, filename)
with open(infile) as fin:
for line in fin:
parts = line.split('\t')
# do something with the list "parts"
But in your case, using the pandas function read_csv()might be a better way:
import pandas as pd
# Control delimiters, rows, column names with read_csv
data = pd.read_csv(infile)
# View the first 5 lines
data.head()
Creating the simple for loop was easier than I though, here for others.
import os
import numpy as np
import urllib.requests
import matplotlib.pyplot as plt
#load the .txt file
ES_D1 = np.loadtxt(fname = os.getcwd()+"\ES\D1\ES_10122007_04122019_D1.txt", dtype='str')
#now need to loop through the array
#this is the engine
for i in range(ES_D1.shape[0]):
if ES_D1[i,3] > ES_D1[i,6]:
print(ES_D1[i,0])
I have a for loop that I want to:
1) Make a pivot table out of the data
2) Convert the 5min data to 30min data
My code is below:
import numpy as np
import pandas as pd
import os
import glob
os.chdir('C:/Users/george/Desktop/testing/output/test')
for filename in os.listdir('C:/Users/george/Desktop/testing/output/test'):
data = pd.read_csv(filename,skiprows=[0])
table = pd.pivot_table(data, values='SCADAVALUE',columns=['DUID'],index='SETTLEMENTDATE', aggfunc=np.sum)
table.to_csv(filename+'pivoted.csv')
my_csv_files = []
for file in os.listdir("C:/Users/george/Desktop/testing/output/test"):
if file.endswith("*pivoted.csv"):
table.set_index(table.columns[0])
table.index = pd.to_datetime(table.index)
table_resampled = table.resample('30min',closed='right',label='right').mean()
table_resampled = table_resampled.reset_index()
table.to_csv(filename+'30min.csv')
The code performs the first loop, but the second loop does not work.Why is this? Whats wrong with my code?
EDIT1:
See comment below
import numpy as np
import pandas as pd
import os
import glob
os.chdir('C:/Users/george/Desktop/testing/output/test')
for filename in os.listdir('C:/Users/george/Desktop/testing/output/test'):
data = pd.read_csv(filename,skiprows=[0])
table = pd.pivot_table(data, values='SCADAVALUE',columns=['DUID'],index='SETTLEMENTDATE', aggfunc=np.sum)
table.to_csv(filename+'pivoted.csv')
my_csv_files = [] # what is this variable for?
for file in os.listdir("C:/Users/george/Desktop/testing/output/test"):
if file.endswith("*pivoted.csv"):
# At this point you are not reading the file, but you should.
# The 'table' variable is still making reference to the the last iteration
# of the 'for' loop a few lines above
# However, better than re-reading the file, you can remove
# the second 'for file in...' loop,
# and just merge the code with the first loop
table.set_index(table.columns[0])
table.index = pd.to_datetime(table.index)
table_resampled = table.resample('30min',closed='right',label='right').mean()
table_resampled = table_resampled.reset_index()
table.to_csv(filename+'30min.csv')
I'm trying to extract subject-verb-object triplets and then attach an ID. I am using a loop so my list of extracted triplets keeping the results for the rows were no triplet was found. So it looks like:
[]
[trump,carried,energy]
[]
[clinton,doesn't,trust]
When I print mylist it looks as expected.
However when I try and create a dataframe from mylist I get an error caused by the empty rows
`IndexError: list index out of range`.
I tried to include an if statement to avoid this but the problem is the same. I also tried using reindex instead but the df2 came out empty.
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import spacy
import textacy
import csv, string, re
import numpy as np
import pandas as pd
#Import csv file with pre-processing already carried out
import pandas as pd
df = pd.read_csv("pre-processed_file_1.csv", sep=",")
#Prepare dataframe to be relevant columns and unicode
df1 = df[['text_1', 'id']].copy()
import StringIO
s = StringIO.StringIO()
tweets = df1.to_csv(encoding='utf-8');
nlp = spacy.load('en')
count = 0;
df2 = pd.DataFrame();
for row in df1.iterrows():
doc = nlp(unicode(row));
text_ext = textacy.extract.subject_verb_object_triples(doc);
tweetID = df['id'].tolist();
mylist = list(text_ext)
count = count + 1;
if (mylist):
df2 = df2.append(mylist, ignore_index=True)
else:
df2 = df2.append('0','0','0')
Any help would be very appreciated. Thank you!
You're supposed to pass a DataFrame-shaped object to append. Passing the raw data doesn't work. So df2=df2.append([['0','0','0']],ignore_index=True)
You can also wrap your processing in a function process_row, then do df2 = pd.DataFrame([process_row(row) for row in df1.iterrows()]). Note that while append won't work with empty rows, the DataFrame constructor just fills them in with None. If you want empty rows to be ['0','0','0'], you have several options:
-Have your processing function return ['0','0','0'] for empty rows -Change the list comprehension to [process_row(row) if process_row(row) else ['0','0','0'] for row in df1.iterrows()] -Do df2=df2.fillna('0')
I'm trying to filter a data frame based on the contents of a pre-defined array.
I've looked up several examples on StackOverflow but simply get an empty output.
I'm not able to figure what is it I'm doing incorrectly. Could I please seek some guidance here?
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements= pd.DataFrame(df.drop_duplicates().Entity.value_counts(),columns=distinct_count_column_headers)
filtered_data= distinct_elements[distinct_elements['Entity'].isin(pre_defined_arr)]
print("Filtered data ... ")
print(filtered_data)
OUTPUT
Filtered data ...
Empty DataFrame
Columns: [Entity]
Index: []
Managed to that using filter function -> .filter(items=pre_defined_arr )
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements_filtered= pd.DataFrame(df.drop_duplicates().Entity.value_counts().filter(items=pre_defined_arr),columns=distinct_count_column_headers)
It's strange that there's just one answer I bumped on that suggests filter function. Almost 9 out 10 out there talk about .isin function which didn't work in my case.
How can I name the data frame generated by the following code?
import re
import os
import csv
import codecs
import numpy as np
import matplotlib as p
import pdb
import pandas as pd
from pandas_ply import install_ply, X, sym_call
install_ply(pd)
(data_merged
.groupby('index')
.ply_select(
count = X.index.count(),
p_avg = X.item_price.mean()
))
Looking at your example, I assume that you mean to name the output variable of the dataframe, which would be data_out in the following:
data_out = (data_merged
.groupby('index')
.ply_select(
count = X.index.count(),
p_avg = X.item_price.mean()
)
)
Note that this is not actually giving the DataFrame a name, it's just naming the variable. You could create another variable that holds a reference to data_out (a pointer) that would have a different name. This is true because data_out is a mutable object.
Series are named and its name is stored in the name attribute. DataFrames are not named, but their columns are, since they are Series.