Going nuts trying to update a column of time entries in a dataframe. I am opening a csv file that has a column of time entries in UTC. I can take these times, convert them to Alaska Standard time, and print that new time out just fine. But when I attempt to put the time back into the dataframe, while I get no errors, I also don't get the new time in the dataframe. The old UTC time is retained. Code is below, I'm curious what it is I am missing. Is there something special about times?
import glob
import os
import pandas as pd
from datetime import datetime
from statistics import mean
def main():
AKST = 9
allDirectories = os.listdir('c:\\MyDir\\')
for directory in allDirectories:
curDirectory = directory.capitalize()
print('Gathering data from: ' + curDirectory)
dirPath = 'c:\\MyDir\\' + directory + '\\*.csv'
# Files are named by date, so sorting by name gives us a proper date order
files = sorted(glob.glob(dirPath))
df = pd.DataFrame()
for i in range(0,len(files)):
data = pd.read_csv(files[i], usecols=['UTCDateTime', 'current_humidity', 'pm2_5_cf_1', 'pm2_5_cf_1_b'])
dfTemp = pd.DataFrame(data) # Temporary dataframe to hold our new info
df = pd.concat([df, dfTemp], axis=0) # Add new info to end of dataFrame
print("Converting UTC to AKST, this may take a moment.")
for index, row in df.iterrows():
convertedDT = datetime.strptime(row['UTCDateTime'], '%Y/%m/%dT%H:%M:%Sz') - pd.DateOffset(hours=AKST)
print("UTC: " + row['UTCDateTime'])
df.at[index,'UTCDateTime'] = convertedDT
print("AKST: " + str(convertedDT))
print("row['UTCDateTime] = " + row['UTCDateTime'] + '\n') # Should be updated with AKST, but is not!
Edit - Alternatively: Is there a way to go about converting the date when it is first read in to the dataframe? Seems like that would be faster than having two for loops.
From your code, it looks like the data is getting updated correctly in the dataframe, but you are printing the row, which is not updated, as it was fetched from dataframe before its updation!
#You are updating df
df.at[index,'UTCDateTime'] = convertedDT #You are updating df
# below you are printing row
print("row['UTCDateTime] = " + row['UTCDateTime']
See sample code below and its output for the explanation.
data=pd.DataFrame({'Year': [1982,1983], 'Statut':['Yes', 'No']})
for index, row in data.iterrows():
data.at[index, 'Year'] = '5000' + str(index)
print('Printing row which is unchanged : ', row['Year'])
print('Updated Dataframe\n',data)
Output
Printing row which is unchanged : 1982
Printing row which is unchanged : 1983
Updated Dataframe
Year Statut
0 50000 Yes
1 50001 No
Related
I am trying to read the file, then i would like to done the calculation to write back to the same file. But the result will replace the ori existing data, how can i change it? Please help me
import pandas as pd
df = pd.read_csv(r'C:\Users\Asus\Downloads\Number of Employed Persons by Status In Employment, Malaysia.csv')
print(df.to_string())
mean1 = df['Value'].mean()
sum1 = df['Value'].sum()
print ('Mean Value: ' + str(mean1))
print ('Sum of Value: ' + str(sum1))
df = pd.DataFrame([['Mean Value: ' + str(mean1)], ['Sum of Value: ' + str(sum1)]])
df.to_csv(r'C:\Users\Asus\Downloads\Number of Employed Persons by Status In Employment, Malaysia.csv', index=False)
print(df)
Do you want to add the data at the bottom of the file?
Override the data is not the best approach, in my opinion, but this is one solution:
import pandas as pd
df = pd.read_csv('data.csv')
mean1 = df['Value'].mean()
sum1 = df['Value'].sum()
df.loc[df.index.max() + 1] = ['Mean','Sum']
df.loc[df.index.max() + 1] = [mean1, sum1]
df.to_csv('data.csv', index=False)
Another option could be: Save all into an xlsx at the end (is better load the data from CSV if there is a large of data) and keep the dataframe1 in a first sheet, and the analysis on a second sheet.
updated question
by using the code below i am able to access dataframe only after completion of for loop, but i want to use most recently created column of the dataframe at intermediate time. i.e after every 5 minutes whichever is the last column of the dataframe ,how to achieve this?
#app.route("/sortbymax")
def sortbymax():
df = updated_data()
#### here i want to use most recently created column
df = create_links(df)
df = df.sort_values(by=['perc_change'], ascending=False)
return render_template('sortbymax.html',tables=[df.to_html(escape = False)], titles=df.columns.values)
def read_data():
filename = r'c:\Users\91956\Desktop\bk.xlsm'
df = pd.read_excel(filename)
return df
def updated_data():
df = read_data()
for i in range(288):
temp = read_data()
x=datetime.datetime.now().strftime("%H:%M:%S")
df['perc_change_'+x] = temp['perc_change']
time.sleep(300)
return df
I see you have a file .xlsm which means is a macro enabled excel. I guess you can read it but if you want to change it with python than you most probably lose the macro part in your excel.
For the python part:
this will copy the perc_change column every 5 minutes, with the respective name. However bear in mind that this will work only for one day (it will replace existing columns after that). If you want to work for longer periods, let me know so that I will add day-month-year (whatever you want) in column names.
import datetime
import time
def read_data():
filename = r'c:\Users\91956\Desktop\bk.xlsm'
df = pd.read_excel(filename)
return df
def write_data(df):
filename = r'c:\Users\91956\Desktop\bk.xlsm'
df.to_excel(filename)
df = read_data() #read excel for first time
for i in range(288): #this will run for one day exactly
temp = read_data()
x=datetime.datetime.now().strftime("%H:%M")
df['perc_change_'+x] = temp['perc_change']
time.sleep(300)
I am working on converting the Blf file into Tab separated file. I am able to extract all the useful information from the file in a list below. I want to calculate the difference between timestamp values coming in one column. Please find attached my code so far:
import can
import csv
import datetime
import pandas as pd
filename = open('C:\\Users\\shraddhasrivastav\\Downloads\\BLF File\\output.csv', "w")
log = can.BLFReader('C:\\Users\\shraddhasrivastav\\Downloads\\BLF File\\test.blf')
# print ("We are here!")
log_output = []
for msg in log:
msg = str(msg).split()
#print (msg)
data_list = msg[7:(7 + int(msg[6]))]
log_output_entry = [(msg[1]), msg[3], msg[6], " ".join(data_list), msg[-1]]
log_output_entry.insert(1, 'ID=')
test_entry = " \t ".join(log_output_entry) # join the list and remove string quotes in the csv file
filename.write(test_entry + '\n')
df = pd.DataFrame(log_output)
df.columns = ['Timestamp', 'ID', 'DLC','Channel']
filename.close() # Close the file outside the loop
The output I am getting so far is below:
Under my first column, I want the difference between the timestamp values (Example- 2nd row value - 1st row timestamp value... 4th row timestamp value - 3rd row timestamp value...and so on... What should I add in my code to achieve this?
Below is the screenshot of how I want my file's Timestamp field to look like. (Calculating the difference between consecutive rows)
enter image description here
You can use pandas.DataFrame.shift:
df['Time Delta'] = df['Timestamp'] - df['Timestamp'].shift(periods=1, axis=0)
Keep in mind that the file you currently have seems to have variable length between the columns in the text file you write to, so it might be hard to directly insert into pandas. Maybe following would work:
import can
import csv
import datetime
import pandas as pd
#filename = open('C:\\Users\\shraddhasrivastav\\Downloads\\BLF File\\output.csv', "w")
log = can.BLFReader('C:\\Users\\shraddhasrivastav\\Downloads\\BLF File\\test.blf')
# print ("We are here!")
log_output = []
for msg in log:
msg = str(msg).split()
#print (msg)
data_list = msg[7:(7 + int(msg[6]))]
log_output_entry = [(msg[1]), msg[3], msg[6], " ".join(data_list), msg[-1]]
log_output_entry.insert(1, 'ID=')
assert len(log_output_entry)==4
log_output.append(log_output_entry)
#test_entry = " \t ".join(log_output_entry) # join the list and remove string quotes in the csv file
#filename.write(test_entry + '\n')
df = pd.DataFrame(log_output)
df.columns = ['Timestamp', 'ID', 'DLC','Channel']
df['Timestamp'] = pd.to_datetime(df['Timestamp'],unit='s')
df['Time Delta'] = df['Timestamp'] - df['Timestamp'].shift(periods=1, axis=0)
df.to_csv('C:\\Users\\shraddhasrivastav\\Downloads\\BLF File\\output_df.csv')
#filename.close() # Close the file outside the loop
I am to download a number of .csv files which I convert to pandas dataframe and append to each other.
The csv can be accessed via url which is created each day and using datetime it can be easily generated and put in a list.
I am able to open these individually in the list.
When I try to open a number of these and append them together I get an empty dataframe. The code looks like this so.
#Imports
import datetime
import pandas as pd
#Testing can open .csv file
data = pd.read_csv('https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin01022018.csv')
data.iloc[:5]
#Taking heading to use to create new dataframe
data_headings = list(data.columns.values)
#Setting up string for url
path_start = 'https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin'
file = ".csv"
#Getting dates which are used in url
start = datetime.datetime.strptime("01-02-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("04-02-2018", "%d-%m-%Y")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
#Creating list of url
date_list = []
for date in date_generated:
date_string = date.strftime("%d%m%Y")
x = path_start + date_string + file
date_list.append(x)
#Opening and appending csv files from list which contains url
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link)
print(df)
I have checked that they are not just empty csv but they are not. Any help would be appreciated.
Cheers,
Sandy
You are never storing the appended dataframe. The line:
df.append(data_link)
Should be
df = df.append(data_link)
However, this may be the wrong approach. You really want to use the array of URLs and concatenate them. Check out this similar question and see if it can improve your code!
I really can't understand what you wanted to do here:
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
By the way, try this:
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link.copy())
How do i loop through my excel sheet and add each 'Adjusted Close' to a dataframe? I want to summarize all adj close and make an stock indice.
When i try with the below code the dataframe Percent_Change is empty.
xls = pd.ExcelFile('databas.xlsx')
countSheets = len(xls.sheet_names)
Percent_Change = pd.DataFrame()
x = 0
for x in range(countSheets):
data = pd.read_excel('databas.xlsx', sheet_name=x, index_col='Date')
# Calculate the percent change from day to day
Percent_Change[x] = pd.Series(data['Adj Close'].pct_change()*100, index=Percent_Change.index)
stock_index = data['Percent_Change'].cumsum()
unfortunately I do not have the data to replicate your complete example. However, there appears to be a bug in your code.
You are looping over "x" and "x" is a list of integers. You probably want to loop over the sheet names and append them to your DF. If you want to do that your code should be:
import pandas as pd
xls = pd.ExcelFile('databas.xlsx')
# pep8 unto thyself only, it is conventional to use "_" instead of camelCase or to avoid longer names if at all possible
sheets = xls.sheet_names
Percent_Change = pd.DataFrame()
# using sheet instead of x is more "pythonic"
for sheet in sheets:
data = pd.read_excel('databas.xlsx', sheet_name=sheet, index_col='Date')
# Calculate the percent change from day to day
Percent_Change[sheet] = pd.Series(data['Adj Close'].pct_change()*100, index=Percent_Change.index)
stock_index = data['Percent_Change'].cumsum()