I am trying to add time/duration values from a CSV file that I have but I have failed so far. Here's the sample csv that I'm trying to add up.
Is getting this output possible?
Output:
I have been trying to add up the datetime but I always fail:
finput = open("./Test.csv", "r")
while 1:
line = finput.readline()
if not line:
break
else:
user = line.split(delim)[0]
direction = line.split(delim)[1]
duration = line.split(delim)[2]
durationz = 0:00:00
for k in duration:
durationz += k
Also:
is there a specific way to declare a time value?
Use datetime.timedelta() objects to model the durations, and pass in the 3 components as seconds, minutes and hours.
Parse your file with the csv module; no point in re-inventing the character-separated-values-parsing wheel here.
Use a dictionary to track In and Out values per user; using a collections.defaultdict() object will make it easier to add new users:
from collections import defaultdict
from datetime import timedelta
import csv
durations = defaultdict(lambda: {'In': timedelta(), 'Out': timedelta()})
with open("./Test.csv", "rb") as inf:
reader = csv.reader(inf, delimiter=delim)
for name, direction, duration in reader:
hours, minutes, seconds = map(int, duration.split(':'))
duration = timedelta(hours=hours, minutes=minutes, seconds=seconds)
durations[name][direction] += duration
for name, directions in durations.items():
print '{:10} In {}'.format(name, directions['In'])
print ' Out {}'.format(directions['Out'])
print ' Total {}'.format(
directions['In'] + directions['Out'])
timedelta() objects, when converted back to strings (such as when printing or formatting with str.format() are converted to the h:mm:ss format again.
Demo:
>>> import csv
>>> from collections import defaultdict
>>> from datetime import timedelta
>>> sample = '''\
... Johnny,In,0:02:36
... Kate,Out,0:02:15
... Paul,In,0:03:57
... Chris,In,0:01:26
... Jonathan,In,0:00:37
... Kyle,In,0:06:46
... Armand,Out,0:00:22
... Ryan,In,0:00:51
... Jonathan,Out,0:12:19
... '''.splitlines()
>>> durations = defaultdict(lambda: {'In': timedelta(), 'Out': timedelta()})
>>> reader = csv.reader(sample)
>>> for name, direction, duration in reader:
... hours, minutes, seconds = map(int, duration.split(':'))
... duration = timedelta(hours=hours, minutes=minutes, seconds=seconds)
... durations[name][direction] += duration
...
>>> for name, directions in durations.items():
... print '{:10} In {}'.format(name, directions['In'])
... print ' Out {}'.format(directions['Out'])
... print ' Total {}'.format(
... directions['In'] + directions['Out'])
...
Johnny In 0:02:36
Out 0:00:00
Total 0:02:36
Kyle In 0:06:46
Out 0:00:00
Total 0:06:46
Ryan In 0:00:51
Out 0:00:00
Total 0:00:51
Chris In 0:01:26
Out 0:00:00
Total 0:01:26
Paul In 0:03:57
Out 0:00:00
Total 0:03:57
Jonathan In 0:00:37
Out 0:12:19
Total 0:12:56
Kate In 0:00:00
Out 0:02:15
Total 0:02:15
Armand In 0:00:00
Out 0:00:22
Total 0:00:22
First, you might find python's built-in csv module to be helpful. Instead of having to manually split your lines and assign data, you can just do the following:
import csv
with open("test.csv", mode="r") as f:
reader = csv.reader(f)
for row in reader:
user, direction, duration = row # this is equivalent to your own variable assignment code,
# using a cool feature of python called tuple unpacking
A dictionary would be a great way to group the data by user. Here is what that might look like:
...
user_dict = {}
for row in reader:
user, direction, duration = row
user_dict[user] = user_dict.get(user, default={"in": "0:00:00", "out": "0:00:00"})
user_dict[user][direction] = duration
Once that runs through the whole input csv you should have a dictionary containing an entry for every user, with every user entry containing their respective "in" and "out" values. If they are missing either an in or out value in the csv, it has been set to "0:00:00" by using the "default" parameter of the dictionary.get() statement.
We could manually parse the times, but dealing with time addition ourselves would be a huge pain. Luckily, python has a built-in module for dealing with time, called datetime.
import csv
import datetime
user_dict = {}
with open("test.csv", mode="r") as f:
reader = csv.reader(f)
for row in reader:
user, direction, duration = row
hour, minute, second = duration.split(":")
# since the split left us with strings, and datetime needs integers, we'll need to cast everything to an int.
hour = int(hour)
minute = int(minute)
second = int(second)
# (we could have done the above more concisely using a list comprehension, which would look like this:
# hour, minute, second = [int(time) for time in duration.split(":")]
# to add time values we'll use the timedelta function in datetime, which takes days then seconds as its arguments.
# We'll just use seconds, so we'll need to convert the hours and minutes first.
seconds = second + minute*60 + hour*60*60
duration = datetime.timedelta(0, seconds)
user_dict[user] = user_dict.get(user, default={"in": datetime.timedelta(0,0), "out": datetime.timedelta(0,0)})
user_dict[user][direction] = duration
Looking at your example, we're just adding the in time to the out time (though if we wanted total time on the clock we would want to subtract the in time from the out time). We can do the addition part with the following:
output = []
for user, time_dict in user_dict.items():
total = time_dict["in"] + time_dict["out"]
output.append([user, time_dict["in"], time_dict["out"], total])
with open("output.csv", mode="w") as f:
writer = csv.writer(f)
writer.writerows(output)
That should give you close to what you want, though the output will be a single row for each user -- the data will appear horizontally instead of vertically.
All the code together:
import csv
import datetime
user_dict = {}
with open("test.csv", mode="r") as f:
reader = csv.reader(f)
for row in reader:
user, direction, duration = row
hour, minute, second = [int(time) for time in duration.split(":")]
seconds = second + minute*60 + hour*60*60
duration = datetime.timedelta(0, seconds)
user_dict[user] = user_dict.get(user, default={"in": datetime.timedelta(0,0), "out": datetime.timedelta(0,0)})
user_dict[user][direction] = duration
output = []
for user, time_dict in user_dict.items():
total = time_dict["in"] + time_dict["out"]
output.append([user, time_dict["in"], time_dict["out"], total])
with open("output.csv", mode="w") as f:
writer = csv.writer(f)
header = ["name", "time in", "time out", "total time"]
writer.writerow(header)
writer.writerows(output)
There are a few things you can fix.
First, you can read every line in your file by doing for line in file.
You can't declare the variable durationz as 0:00:00. It simply doesn't work in python.
One thing you could do is make durationz 0, and parse the time by turning it into an amount of seconds. Some pseudocode:
split duration string by ":"
add 60 * 60 * hours to duration
add 60 * minutes to duration
add seconds to duration
Related
I am very new to python and need some help finding the maximum/highest value of in a column of data (time) that is imported from a csv file. this is the code i have tried.
file = open ("results.csv")
unneeded = file.readline()
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
maxtime = 0
for x in hours:
if x > maxtime:
maxtime = x
print (maxtime)
any help is appreciated
edit: i tried this code but it gives me the wrong answer :(
file = open ("results.csv")
unneeded = file.readline()
maxtime = 0
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
if hours > str(maxtime):
maxtime = hours
print (maxtime)
[first few lines of results][1]
edit:
results cvs
[1]: https://i.stack.imgur.com/z3pEJ.png
I haven't tested it but this should work. Using the CSV library is easy for parsing CSV files.
import csv
with open("results.csv") as file:
csv_reader = csv.reader(file, delimiter=',')
for row in csv_reader:
hours = row[4]
maxtime = 0
if hours > maxtime:
maxtime = x
print (maxtime)
file.close()
My recommendation is using the pandas module for anything CSV-related.
Using dateutil, I create a dataset of dates and identification numbers whose date values are shuffled (no specific order) like so:
from dateutil.parser import *
from dateutil.rrule import *
from random import shuffle
dates = list(rrule(
DAILY,
dtstart=parse("19970902T090000"),
until=parse("19971224T000000")
))
shuffle(dates)
with open('out.csv', mode='w', encoding='utf-8') as out:
for i,date in enumerate(dates):
out.write(f'{i},{date}\n')
So thus, in this particular dataset, 1997-12-23 09:00:00 would be the "largest" date. Then, to extract the maximum date, we can just do it via string comparisons if it is formatted in the ISO 8601 date/time format:
from pandas import read_csv
df = read_csv('out.csv', names=['index', 'time'], header=1)
print(max(df['time']))
After running it, we indeed get 1997-12-23 09:00:00 printed in the terminal!
I am fairly new at Python and am having some issues reading in my csv file. There are sensor names, datestamps and readings in each column. However, there are multiple of the same sensor name, which I have already made a list of the different options called OPTIONS, shown below
OPTIONS = []
with open('sensor_data.csv', 'rb') as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
if row[0] not in OPTIONS:
OPTIONS.append(row[0])
sensor_name = row[0]
datastamp = row[1]
readings = float(row[2])
print(OPTIONS)
Options
prints fine,
But now I am having issues retrieving any readings, and using them to calculate average and maximum readings for each unique sensor name.
here are a few lines of sensor_data.csv, which goes from 2018-01-01 to 2018-12-31 for sensor_1 to sensor_25.
Any help would be appreciated.
What you have for the readings variable is just the reading of each row. One way to get the average readings is to keep track of the sum and count of readings (sum_readings and count_readings respectively) and then after the for loop you can get the average by dividing the sum with the count. You can get the maximum by initializing a max_readings variable with a reading minimum value (I assume to be 0) and then update the variable whenever the current reading is larger than max_readings (max_readings < readings)
import csv
OPTIONS = []
OPTIONS_READINGS = {}
with open('sensor_data.csv', 'rb') as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
if row[0] not in OPTIONS:
OPTIONS.append(row[0])
OPTIONS_READINGS[row[0]] = []
sensor_name = row[0]
datastamp = row[1]
readings = float(row[2])
print(OPTIONS)
OPTIONS_READINGS[row[0]].append(readings)
for option in OPTIONS_READINGS:
print(option)
readings = OPTIONS_READINGS[option]
print('Max readings:', max(readings))
print('Average readings:', sum(readings) / len(readings))
Edit: Sorry I misread the question. If you want to get the maximum and average of each unique options, there is a more straight forward way which is to use an additional dictionary-type variable OPTIONS_READINGS whose keys are the option names and the values are the list of readings. You can find the maximum and average reading of an options by simply using the expression max(OPTIONS_READINGS[option]) and sum(OPTIONS_READINGS[option]) / len(OPTIONS_READINGS[option]) respectively.
A shorter version below
import csv
from collections import defaultdict
readings = defaultdict(list)
with open('sensor_data.csv', 'r') as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
readings[row[0]].append(float(row[2]) )
for sensor_name,values in readings.items():
print('Sensor: {}, Max readings: {}, Avg: {}'.format(sensor_name,max(values), sum(values)/ len(values)))
below is some code written to open a CSV file. Its values are stored like this:
03/05/2017 09:40:19,21.2,35.0
03/05/2017 09:40:27,21.2,35.0
03/05/2017 09:40:38,21.1,35.0
03/05/2017 09:40:48,21.1,35.0
This is just a snippet of code I use in a real time plotting program, which fully works but the fact that the array is getting so big is unclean. Normally new values get added to the CSV while the program is running and the length of the arrays is very high. Is there a way to not have exploding arrays like this?
Just run the program, you will have to make a CSV with those values too and you will see my problem.
from datetime import datetime
import time
y = [] #temperature
t = [] #time object
h = [] #humidity
def readfile():
readFile = open('document.csv', 'r')
sepFile = readFile.read().split('\n')
readFile.close()
for idx, plotPair in enumerate(sepFile):
if plotPair in '. ':
# skip. or space
continue
if idx > 1: # to skip the first line
xAndY = plotPair.split(',')
time_string = xAndY[0]
time_string1 = datetime.strptime(time_string, '%d/%m/%Y %H:%M:%S')
t.append(time_string1)
y.append(float(xAndY[1]))
h.append(float(xAndY[2]))
print([y])
while True:
readfile()
time.sleep(2)
This is the output I get:
[[21.1]]
[[21.1, 21.1]]
[[21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1, 21.1]]
Any help is appreciated.
You can use Python's deque if you also want to limit the total number of entries you wish to keep. It produces a list which features a maximum length. Once the list is full, any new entries push the oldest entry off the start.
The reason your list is growing is that you need to re-read your file up to the point of you last entry before continuing to add new entries. Assuming your timestamps are unique, you could use takewhile() to help you do this, which reads entries until a condition is met.
from itertools import takewhile
from collections import deque
from datetime import datetime
import csv
import time
max_length = 1000 # keep this many entries
t = deque(maxlen=max_length) # time object
y = deque(maxlen=max_length) # temperature
h = deque(maxlen=max_length) # humidity
def read_file():
with open('document.csv', newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input) # skip over the header line
# If there are existing entries, read until the last read item is found again
if len(t):
list(takewhile(lambda row: datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S') != t[-1], csv_input))
for row in csv_input:
print(row)
t.append(datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S'))
y.append(float(row[1]))
h.append(float(row[2]))
while True:
read_file()
print(t)
time.sleep(1)
Also, it is easier to work with the entries using Python's built in csv library to read each of the values into a list for each row. As you have a header row, read this in using next() before starting the loop.
I have a python module called HourlyCsvGeneration.py. I have some data which is being generated on hourly basis which is is sample.txt. Here is the sample of the data in the sample.txt:-
2014-07-24 15:00:00,1,1,1,1,1001
2014-07-24 15:01:00,1,1,1,1,1001
2014-07-24 15:02:00,1,1,1,1,1001
2014-07-24 15:15:00,1,1,1,1,1001
2014-07-24 15:16:00,1,1,1,1,1001
2014-07-24 15:17:00,1,1,1,1,1001
2014-07-24 15:30:00,1,1,1,1,1001
2014-07-24 15:31:00,1,1,1,1,1001
2014-07-24 15:32:00,1,1,1,1,1001
2014-07-24 15:45:00,1,1,1,1,1001
2014-07-24 15:46:00,1,1,1,1,1001
2014-07-24 15:47:00,1,1,1,1,1001
As you can see there are 4 intervals 00-15, 15-30, 30,45 and 45-59 and the next hour starts and so on. I am writing the code that would read the data in this txt file and generating 4 CSV files for every hour in a day. So analysing the above data the 4 CSV files should be generated should have naming convention like 2014-07-24 15:00.csv containing the data between 15:00 and 15:15, 2014-07-24 15:15.csv containing the data between 15:15 and 15:30 and so on for every hour. The python code must handle all this.
Here is my current code snippet:-
import csv
def connection():
fo = open("sample.txt", "r")
data = fo.readlines()
header = ['tech', 'band', 'region', 'market', 'code']
for line in data:
line = line.strip("\n")
line = line.split(",")
time = line[0]
lines = [x for x in time.split(':') if x]
i = len(lines)
if i == 0:
continue
else:
hour, minute, sec = lines[0], lines[1], lines[2]
minute = int(minute)
if minute >= 0 and minute < 15:
print hour, minute
print line[1:]
elif minute >= 15 and minute < 30:
print hour, minute
print line[1:]
elif minute >= 30 and minute < 45:
print hour, minute
print line[1:]
elif minute >=45 and minute < 59:
print hour, minute
print line[1:]
connection()
[1:] gives the right data for each interval and I am kind off stuck in generating CSV files and writing the data. So instead of printing [1:], I want this to be get written in the csv file of that interval with the appropriate naming convention as explained in the above description.
Expected output:-
2014-07-24 15:00.csv must contain
1,1,1,1,1001
1,1,1,1,1001
1,1,1,1,1001
2014-07-24 15:15.csv must contain
1,1,1,1,1001
1,1,1,1,1001
1,1,1,1,1001
and so on for 15.30.csv and 15.45.csv. Keeping in mind this is just a small chunk of data. The actual data is for every hour of the data. Meaning generating 4 csv files for each hour that is 24*4 files for one day. So how can I make my code more robust and efficient?
Any help?Thanks
Seems like a job for itertools.groupby, if the timestamps are strictly increasing in value:
from datetime import datetime as DateTime
from itertools import imap, groupby
from operator import itemgetter
get_first = itemgetter(0)
get_second = itemgetter(1)
def process_line(line):
timestamp_string, _, values = line.partition(',')
timestamp = DateTime.strptime(timestamp_string, '%Y-%m-%d %H:%M:%S')
return (
timestamp.replace(minute=timestamp.minute // 15 * 15, second=0),
values
)
def main():
with open('sample.txt', 'r') as lines:
for date, group in groupby(imap(process_line, lines), get_first):
with open('{0:%Y-%m-%d %H_%M}.csv'.format(date), 'w') as out_file:
out_file.writelines(imap(get_second, group))
if __name__ == '__main__':
main()
Your problem is not trivial, because if you try to open all the output files at once, then you will run out of file descriptors and crash. So what you want to do is open a file in append mode, write one line, then close the file. This isn't a horribly efficient operation, so I wouldn't worry about efficiency yet.
outfile = open("2014-07-24 15:00.csv","a")
outfile.write("csv, line, data\n")
outfile.close()
I'd recommend using pandas for this. It takes care of a whole bunch of the dirty work for you.
import pandas as pd
df = pd.read_table('DummyText.txt',sep=',',index_col=0,parse_dates=True,header=None)
fname = (str(pd.datetime(2014,7,24,15,0))+'.csv').replace(':','-')
df[pd.datetime(2014,7,24,15,0):pd.datetime(2014,7,24,15,15)].to_csv(fname,header=None)
I took the : out of the filename. It didn't seem to like that.
All you need to do with the above is setup some loops to cycle through the dates and times.
Here is some ways that might help
import csv
from datetime import datetime
def get_higher_minute(minute_of_day):
return (((minute_of_day/ 15) + 1 ) % 4) * 15
def connection():
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
dateObject = datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
minute_of_day = dateObject.minute
higher_minute = get_higher_minute(minute_of_day)
newdate = dateObject.replace(minute = higher_minute)
file_name_of_new_csv = "%s.csv" % dateObject.strftime("%Y-%m-%d %H:%M")
new_csv_writer = csv.writer(file_name_of_new_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
new_csv_writer.writerow(row[0:])
new_csv_writer.close()
def main():
connection()
if __name__=="__main__":
main()
Hope it helps
Sorry. left new_csv_writer open.
#SOF
In my csv files as shown below, I'm trying to make a bit of code to cycle through and find the last unique class (a unique class is based on element 0 and element 1 merged together) in my example the last unique Class is:Class02 and CD1 which would equal: Class02CD1
It would then need to look at element 4 for the students in that uniqueclass (in this example there are only 2 students, if a number exists in any of them then it needs to then get the time from element 2 and compare it to the current time, if the current time is 30minuites or more after the time given it should print the word "Late"
CSV File
uniq1,uniq2,three,four,five,six
Class01,CD2,data,data,,data
Class01,CD2,data,data,22,data
Class01,CD2,data,data,,data
Class01,CD2,data,data,4,data
Class02,CD3,data,data,,data
Class02,CD3,data,data,,data
Class02,CD3,data,data,,data
Class02,CD3,data,data,,data
Class02,CD3,data,data,,data
Class02,CD3,data,data,3,data
Class02,CD3,data,data,,data
DClass2,DE2,data,data,133,data
DClass2,DE2,data,data,24,data
Class02,CD1,13:01,data,,data
Class02,CD1,13:05,data,1,data
Cycle through unique elements to find the last class.
Check if a number exists in that unique class.
If the number exists then it should get the time from element 2
If the number exists and we've got a time then we should compare it to the current time.
If the current time is 30 minuites or after the time in the last class then it should print "late" else do nothing.
Anyone know how to solve this, I'm completely lost and don't have a clue how to do this.
Prints
Class02CD1
Current: 2013-11-11 12:07:37.635000
Fetched: 2013-11-11 13:05:00
Calculated: 1382.61666667
late
First iterate over the file and find out the class on the last line, and then iterate over the file again and now if the class name on a given line equals the stored class name then apply your conditions there and use datetime module to find out the time difference.
from datetime import datetime
import csv
import time
with open('abc1') as f:
reader = csv.reader(f, delimiter=',')
for line in reader:
pass
class_name = ''.join(line[:2]) #save the name on last line
print class_name
f.seek(0) # Rest the file pointer t o the start of the file
for line in reader:
cls_name = ''.join(line[:2])
if cls_name == class_name:
if line[-2]:
current_dtime = datetime.now()
fetched_time = datetime.strptime(line[2], '%H:%M')
fetched_time = datetime(year=current_dtime.year,
month = current_dtime.month,
day = current_dtime.day,
hour = fetched_time.hour,
minute = fetched_time.minute
)
if ((current_dtime - fetched_time).seconds/ 60.0) > 30.0:
print "late"
This is the answer up to the time comparison part at the end.
import csv
dict = {}
key = False
with open('test.csv', mode='r') as infile:
reader = csv.reader(infile,)
first = False
for row in reader:
if first:
first = True
continue
key = row[0]+row[1]
if key in dict:
dict[key].append(row)
else:
dict[key] = [row]
print dict
if(key):
for row in dict[key]:
# do time comparison
pass