Generating multiple CSV files on hourly basis in Python - python

I have a python module called HourlyCsvGeneration.py. I have some data which is being generated on hourly basis which is is sample.txt. Here is the sample of the data in the sample.txt:-
2014-07-24 15:00:00,1,1,1,1,1001
2014-07-24 15:01:00,1,1,1,1,1001
2014-07-24 15:02:00,1,1,1,1,1001
2014-07-24 15:15:00,1,1,1,1,1001
2014-07-24 15:16:00,1,1,1,1,1001
2014-07-24 15:17:00,1,1,1,1,1001
2014-07-24 15:30:00,1,1,1,1,1001
2014-07-24 15:31:00,1,1,1,1,1001
2014-07-24 15:32:00,1,1,1,1,1001
2014-07-24 15:45:00,1,1,1,1,1001
2014-07-24 15:46:00,1,1,1,1,1001
2014-07-24 15:47:00,1,1,1,1,1001
As you can see there are 4 intervals 00-15, 15-30, 30,45 and 45-59 and the next hour starts and so on. I am writing the code that would read the data in this txt file and generating 4 CSV files for every hour in a day. So analysing the above data the 4 CSV files should be generated should have naming convention like 2014-07-24 15:00.csv containing the data between 15:00 and 15:15, 2014-07-24 15:15.csv containing the data between 15:15 and 15:30 and so on for every hour. The python code must handle all this.
Here is my current code snippet:-
import csv
def connection():
fo = open("sample.txt", "r")
data = fo.readlines()
header = ['tech', 'band', 'region', 'market', 'code']
for line in data:
line = line.strip("\n")
line = line.split(",")
time = line[0]
lines = [x for x in time.split(':') if x]
i = len(lines)
if i == 0:
continue
else:
hour, minute, sec = lines[0], lines[1], lines[2]
minute = int(minute)
if minute >= 0 and minute < 15:
print hour, minute
print line[1:]
elif minute >= 15 and minute < 30:
print hour, minute
print line[1:]
elif minute >= 30 and minute < 45:
print hour, minute
print line[1:]
elif minute >=45 and minute < 59:
print hour, minute
print line[1:]
connection()
[1:] gives the right data for each interval and I am kind off stuck in generating CSV files and writing the data. So instead of printing [1:], I want this to be get written in the csv file of that interval with the appropriate naming convention as explained in the above description.
Expected output:-
2014-07-24 15:00.csv must contain
1,1,1,1,1001
1,1,1,1,1001
1,1,1,1,1001
2014-07-24 15:15.csv must contain
1,1,1,1,1001
1,1,1,1,1001
1,1,1,1,1001
and so on for 15.30.csv and 15.45.csv. Keeping in mind this is just a small chunk of data. The actual data is for every hour of the data. Meaning generating 4 csv files for each hour that is 24*4 files for one day. So how can I make my code more robust and efficient?
Any help?Thanks

Seems like a job for itertools.groupby, if the timestamps are strictly increasing in value:
from datetime import datetime as DateTime
from itertools import imap, groupby
from operator import itemgetter
get_first = itemgetter(0)
get_second = itemgetter(1)
def process_line(line):
timestamp_string, _, values = line.partition(',')
timestamp = DateTime.strptime(timestamp_string, '%Y-%m-%d %H:%M:%S')
return (
timestamp.replace(minute=timestamp.minute // 15 * 15, second=0),
values
)
def main():
with open('sample.txt', 'r') as lines:
for date, group in groupby(imap(process_line, lines), get_first):
with open('{0:%Y-%m-%d %H_%M}.csv'.format(date), 'w') as out_file:
out_file.writelines(imap(get_second, group))
if __name__ == '__main__':
main()

Your problem is not trivial, because if you try to open all the output files at once, then you will run out of file descriptors and crash. So what you want to do is open a file in append mode, write one line, then close the file. This isn't a horribly efficient operation, so I wouldn't worry about efficiency yet.
outfile = open("2014-07-24 15:00.csv","a")
outfile.write("csv, line, data\n")
outfile.close()

I'd recommend using pandas for this. It takes care of a whole bunch of the dirty work for you.
import pandas as pd
df = pd.read_table('DummyText.txt',sep=',',index_col=0,parse_dates=True,header=None)
fname = (str(pd.datetime(2014,7,24,15,0))+'.csv').replace(':','-')
df[pd.datetime(2014,7,24,15,0):pd.datetime(2014,7,24,15,15)].to_csv(fname,header=None)
I took the : out of the filename. It didn't seem to like that.
All you need to do with the above is setup some loops to cycle through the dates and times.

Here is some ways that might help
import csv
from datetime import datetime
def get_higher_minute(minute_of_day):
return (((minute_of_day/ 15) + 1 ) % 4) * 15
def connection():
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
dateObject = datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
minute_of_day = dateObject.minute
higher_minute = get_higher_minute(minute_of_day)
newdate = dateObject.replace(minute = higher_minute)
file_name_of_new_csv = "%s.csv" % dateObject.strftime("%Y-%m-%d %H:%M")
new_csv_writer = csv.writer(file_name_of_new_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
new_csv_writer.writerow(row[0:])
new_csv_writer.close()
def main():
connection()
if __name__=="__main__":
main()
Hope it helps
Sorry. left new_csv_writer open.

Related

Python: daily average function

I am trying to make a function that returns a list/array with the daily averages for a variable from either one of 3 csv files
Each csv file is similar to this:
date, time, variable1, variable2, variable3
2021-01-01,01:00:00,1.43738,25.838,22.453
2021-01-01,02:00:00,2.08652,21.028,19.099
2021-01-01,03:00:00,1.39101,23.18,20.925
2021-01-01,04:00:00,0.76506,22.053,19.974
The date contains the entire year of 2021 with increments of 1 hour
def daily_average(data, station, variable):
The function has 3 parameters:
data
station: One of the 3 csv files
variable: Either variable 1 or 2 or 3
Libraries such as datetime, calendar and numpy can be used
Pandas can also be used
Well.
First of all - try to do it yourself before asking a question.
It will help you to learn things.
But now to your question.
csv_lines_test = [
"date, time, variable1, variable2, variable3\n",
"2021-01-01,01:00:00,1.43738,25.838,22.453\n",
"2021-01-01,02:00:00,2.08652,21.028,19.099\n",
"2021-01-01,03:00:00,1.39101,23.18,20.925\n",
"2021-01-01,04:00:00,0.76506,22.053,19.974\n",
]
import datetime as dt
def daily_average(csv_lines, date, variable_num):
# variable_num should be 1-3
avg_arr = []
# Read csv file line by line.
for i, line in enumerate(csv_lines):
if i == 0:
# Skip headers
continue
line = line.rstrip()
values = line.split(',')
date_csv = dt.datetime.strptime(values[0], "%Y-%m-%d").date()
val_arr = [float(val) for val in values[2:]]
if date == date_csv:
avg_arr.append(val_arr[variable_num-1])
return sum(avg_arr) / len(avg_arr)
avg = daily_average(csv_lines_test, dt.date(2021, 1, 1), 1)
print(avg)
If you want to read data directly from csv file:
with open("csv_file_path.csv", 'r') as f:
data = [line for line in f]
avg = daily_average(data, dt.date(2021, 1, 1), 1)
print(avg)

How to use python to find the maximum value in a csv file list

I am very new to python and need some help finding the maximum/highest value of in a column of data (time) that is imported from a csv file. this is the code i have tried.
file = open ("results.csv")
unneeded = file.readline()
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
maxtime = 0
for x in hours:
if x > maxtime:
maxtime = x
print (maxtime)
any help is appreciated
edit: i tried this code but it gives me the wrong answer :(
file = open ("results.csv")
unneeded = file.readline()
maxtime = 0
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
if hours > str(maxtime):
maxtime = hours
print (maxtime)
[first few lines of results][1]
edit:
results cvs
[1]: https://i.stack.imgur.com/z3pEJ.png
I haven't tested it but this should work. Using the CSV library is easy for parsing CSV files.
import csv
with open("results.csv") as file:
csv_reader = csv.reader(file, delimiter=',')
for row in csv_reader:
hours = row[4]
maxtime = 0
if hours > maxtime:
maxtime = x
print (maxtime)
file.close()
My recommendation is using the pandas module for anything CSV-related.
Using dateutil, I create a dataset of dates and identification numbers whose date values are shuffled (no specific order) like so:
from dateutil.parser import *
from dateutil.rrule import *
from random import shuffle
dates = list(rrule(
DAILY,
dtstart=parse("19970902T090000"),
until=parse("19971224T000000")
))
shuffle(dates)
with open('out.csv', mode='w', encoding='utf-8') as out:
for i,date in enumerate(dates):
out.write(f'{i},{date}\n')
So thus, in this particular dataset, 1997-12-23 09:00:00 would be the "largest" date. Then, to extract the maximum date, we can just do it via string comparisons if it is formatted in the ISO 8601 date/time format:
from pandas import read_csv
df = read_csv('out.csv', names=['index', 'time'], header=1)
print(max(df['time']))
After running it, we indeed get 1997-12-23 09:00:00 printed in the terminal!

Arrays' Length exploding when appending Values to it from CSV

below is some code written to open a CSV file. Its values are stored like this:
03/05/2017 09:40:19,21.2,35.0
03/05/2017 09:40:27,21.2,35.0
03/05/2017 09:40:38,21.1,35.0
03/05/2017 09:40:48,21.1,35.0
This is just a snippet of code I use in a real time plotting program, which fully works but the fact that the array is getting so big is unclean. Normally new values get added to the CSV while the program is running and the length of the arrays is very high. Is there a way to not have exploding arrays like this?
Just run the program, you will have to make a CSV with those values too and you will see my problem.
from datetime import datetime
import time
y = [] #temperature
t = [] #time object
h = [] #humidity
def readfile():
readFile = open('document.csv', 'r')
sepFile = readFile.read().split('\n')
readFile.close()
for idx, plotPair in enumerate(sepFile):
if plotPair in '. ':
# skip. or space
continue
if idx > 1: # to skip the first line
xAndY = plotPair.split(',')
time_string = xAndY[0]
time_string1 = datetime.strptime(time_string, '%d/%m/%Y %H:%M:%S')
t.append(time_string1)
y.append(float(xAndY[1]))
h.append(float(xAndY[2]))
print([y])
while True:
readfile()
time.sleep(2)
This is the output I get:
[[21.1]]
[[21.1, 21.1]]
[[21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1, 21.1]]
Any help is appreciated.
You can use Python's deque if you also want to limit the total number of entries you wish to keep. It produces a list which features a maximum length. Once the list is full, any new entries push the oldest entry off the start.
The reason your list is growing is that you need to re-read your file up to the point of you last entry before continuing to add new entries. Assuming your timestamps are unique, you could use takewhile() to help you do this, which reads entries until a condition is met.
from itertools import takewhile
from collections import deque
from datetime import datetime
import csv
import time
max_length = 1000 # keep this many entries
t = deque(maxlen=max_length) # time object
y = deque(maxlen=max_length) # temperature
h = deque(maxlen=max_length) # humidity
def read_file():
with open('document.csv', newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input) # skip over the header line
# If there are existing entries, read until the last read item is found again
if len(t):
list(takewhile(lambda row: datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S') != t[-1], csv_input))
for row in csv_input:
print(row)
t.append(datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S'))
y.append(float(row[1]))
h.append(float(row[2]))
while True:
read_file()
print(t)
time.sleep(1)
Also, it is easier to work with the entries using Python's built in csv library to read each of the values into a list for each row. As you have a header row, read this in using next() before starting the loop.

Writing Data to csv file to a single row

I need to write data to a single row and column by where data is coming from the serial port. So I used to read and write row by row.
But the requirement is I need to write data to the next column of a single row In PYTHON.
I need to have filed at the end like this
1,2,33,43343,4555,344323
In such a way all data is to be in a single row and multiple columns, and not in one column and multiple rows.
So this writes data in one after one row.
1
12
2222
3234
1233
131
but I want
1 , 12 , 2222 , 3234 , 1233 , 131
like every single row and multiple columns.
import serial
import time
import csv
ser = serial.Serial('COM29', 57600)
timeout = time.time() + 60/6 # 5 minutes from now
while True:
test = 0
if test == 5 or time.time() > timeout:
break
ss=ser.readline()
print ss
s=ss.replace("\n","")
with open('C:\Users\Ivory Power\Desktop\EEG_Data\Othr_eeg\egg31.csv', 'ab') as csvfile:
spamwriter = csv.writer(csvfile,delimiter=',', lineterminator='\n')
spamwriter.writerow([ s ])
csvfile.close()
time.sleep(0.02)
The csv module writes rows - every time you call writerow a newline is written and a new row is started. So, you can't call it multiple times and expect to get columns. You can, however, collect the data into a list and then write that list when you are done. The csv module is overkill for this.
import serial
import time
import csv
ser = serial.Serial('COM29', 57600)
timeout = time.time() + 60/6 # 5 minutes from now
data = []
for _ in range(5):
data.append(ser.readline().strip())
time.sleep(0.02)
with open('C:\Users\Ivory Power\Desktop\EEG_Data\Othr_eeg\egg31.csv', 'ab') as csvfile:
spamwriter = csv.writer(csvfile,delimiter=',', lineterminator='\n')
spamwriter.writerow(data)
# csv is overkill here unless the data itself contains commas
# that need to be escaped. You could do this instead.
# csvfile.write(','.join(data) + '\n')
UPDATE
One of the tricks to question writing here to to supply a short, runnable example of the problem. That way, everybody runs the same thing and you can talk about whats wrong in terms of the code and output everyone can play with.
Here is the program updated with mock data. I changed the open to "wb" so that the file is deleted if it already exists when the program runs. Run it and let me know how its results different from what you want.
import csv
import time
filename = 'deleteme.csv'
test_row = '1,2,33,43343,4555,344323'
test_data = test_row.split(',')
data = []
for _ in range(6):
data.append(test_data.pop(0).strip())
time.sleep(0.02)
with open(filename, 'wb') as csvfile:
spamwriter = csv.writer(csvfile,delimiter=',', lineterminator='\n')
spamwriter.writerow(data)
print repr(open(filename).read())
assert open(filename).read().strip() == test_row, 'got one row'
Assuming your serial port data wouldn't overrun your main memory heap, the following would be the code that would suit your need.
import serial
import time
import csv
ser = serial.Serial('COM29', 57600)
timeout = time.time() + 60/6 # 5 minutes from now
result = []
while True:
test = 0
if test == 5 or time.time() > timeout:
break
ss=ser.readline()
print(ss)
s=ss.replace("\n","")
result.append(s)
time.sleep(0.02)
with open('C:\Users\Ivory Power\Desktop\EEG_Data\Othr_eeg\egg31.csv', 'w') as csvfile:
spamwriter = csv.writer(csvfile,delimiter=',', lineterminator='\n')
spamwriter.writerow([result])
csvfile.close()

Adding time/duration from CSV file

I am trying to add time/duration values from a CSV file that I have but I have failed so far. Here's the sample csv that I'm trying to add up.
Is getting this output possible?
Output:
I have been trying to add up the datetime but I always fail:
finput = open("./Test.csv", "r")
while 1:
line = finput.readline()
if not line:
break
else:
user = line.split(delim)[0]
direction = line.split(delim)[1]
duration = line.split(delim)[2]
durationz = 0:00:00
for k in duration:
durationz += k
Also:
is there a specific way to declare a time value?
Use datetime.timedelta() objects to model the durations, and pass in the 3 components as seconds, minutes and hours.
Parse your file with the csv module; no point in re-inventing the character-separated-values-parsing wheel here.
Use a dictionary to track In and Out values per user; using a collections.defaultdict() object will make it easier to add new users:
from collections import defaultdict
from datetime import timedelta
import csv
durations = defaultdict(lambda: {'In': timedelta(), 'Out': timedelta()})
with open("./Test.csv", "rb") as inf:
reader = csv.reader(inf, delimiter=delim)
for name, direction, duration in reader:
hours, minutes, seconds = map(int, duration.split(':'))
duration = timedelta(hours=hours, minutes=minutes, seconds=seconds)
durations[name][direction] += duration
for name, directions in durations.items():
print '{:10} In {}'.format(name, directions['In'])
print ' Out {}'.format(directions['Out'])
print ' Total {}'.format(
directions['In'] + directions['Out'])
timedelta() objects, when converted back to strings (such as when printing or formatting with str.format() are converted to the h:mm:ss format again.
Demo:
>>> import csv
>>> from collections import defaultdict
>>> from datetime import timedelta
>>> sample = '''\
... Johnny,In,0:02:36
... Kate,Out,0:02:15
... Paul,In,0:03:57
... Chris,In,0:01:26
... Jonathan,In,0:00:37
... Kyle,In,0:06:46
... Armand,Out,0:00:22
... Ryan,In,0:00:51
... Jonathan,Out,0:12:19
... '''.splitlines()
>>> durations = defaultdict(lambda: {'In': timedelta(), 'Out': timedelta()})
>>> reader = csv.reader(sample)
>>> for name, direction, duration in reader:
... hours, minutes, seconds = map(int, duration.split(':'))
... duration = timedelta(hours=hours, minutes=minutes, seconds=seconds)
... durations[name][direction] += duration
...
>>> for name, directions in durations.items():
... print '{:10} In {}'.format(name, directions['In'])
... print ' Out {}'.format(directions['Out'])
... print ' Total {}'.format(
... directions['In'] + directions['Out'])
...
Johnny In 0:02:36
Out 0:00:00
Total 0:02:36
Kyle In 0:06:46
Out 0:00:00
Total 0:06:46
Ryan In 0:00:51
Out 0:00:00
Total 0:00:51
Chris In 0:01:26
Out 0:00:00
Total 0:01:26
Paul In 0:03:57
Out 0:00:00
Total 0:03:57
Jonathan In 0:00:37
Out 0:12:19
Total 0:12:56
Kate In 0:00:00
Out 0:02:15
Total 0:02:15
Armand In 0:00:00
Out 0:00:22
Total 0:00:22
First, you might find python's built-in csv module to be helpful. Instead of having to manually split your lines and assign data, you can just do the following:
import csv
with open("test.csv", mode="r") as f:
reader = csv.reader(f)
for row in reader:
user, direction, duration = row # this is equivalent to your own variable assignment code,
# using a cool feature of python called tuple unpacking
A dictionary would be a great way to group the data by user. Here is what that might look like:
...
user_dict = {}
for row in reader:
user, direction, duration = row
user_dict[user] = user_dict.get(user, default={"in": "0:00:00", "out": "0:00:00"})
user_dict[user][direction] = duration
Once that runs through the whole input csv you should have a dictionary containing an entry for every user, with every user entry containing their respective "in" and "out" values. If they are missing either an in or out value in the csv, it has been set to "0:00:00" by using the "default" parameter of the dictionary.get() statement.
We could manually parse the times, but dealing with time addition ourselves would be a huge pain. Luckily, python has a built-in module for dealing with time, called datetime.
import csv
import datetime
user_dict = {}
with open("test.csv", mode="r") as f:
reader = csv.reader(f)
for row in reader:
user, direction, duration = row
hour, minute, second = duration.split(":")
# since the split left us with strings, and datetime needs integers, we'll need to cast everything to an int.
hour = int(hour)
minute = int(minute)
second = int(second)
# (we could have done the above more concisely using a list comprehension, which would look like this:
# hour, minute, second = [int(time) for time in duration.split(":")]
# to add time values we'll use the timedelta function in datetime, which takes days then seconds as its arguments.
# We'll just use seconds, so we'll need to convert the hours and minutes first.
seconds = second + minute*60 + hour*60*60
duration = datetime.timedelta(0, seconds)
user_dict[user] = user_dict.get(user, default={"in": datetime.timedelta(0,0), "out": datetime.timedelta(0,0)})
user_dict[user][direction] = duration
Looking at your example, we're just adding the in time to the out time (though if we wanted total time on the clock we would want to subtract the in time from the out time). We can do the addition part with the following:
output = []
for user, time_dict in user_dict.items():
total = time_dict["in"] + time_dict["out"]
output.append([user, time_dict["in"], time_dict["out"], total])
with open("output.csv", mode="w") as f:
writer = csv.writer(f)
writer.writerows(output)
That should give you close to what you want, though the output will be a single row for each user -- the data will appear horizontally instead of vertically.
All the code together:
import csv
import datetime
user_dict = {}
with open("test.csv", mode="r") as f:
reader = csv.reader(f)
for row in reader:
user, direction, duration = row
hour, minute, second = [int(time) for time in duration.split(":")]
seconds = second + minute*60 + hour*60*60
duration = datetime.timedelta(0, seconds)
user_dict[user] = user_dict.get(user, default={"in": datetime.timedelta(0,0), "out": datetime.timedelta(0,0)})
user_dict[user][direction] = duration
output = []
for user, time_dict in user_dict.items():
total = time_dict["in"] + time_dict["out"]
output.append([user, time_dict["in"], time_dict["out"], total])
with open("output.csv", mode="w") as f:
writer = csv.writer(f)
header = ["name", "time in", "time out", "total time"]
writer.writerow(header)
writer.writerows(output)
There are a few things you can fix.
First, you can read every line in your file by doing for line in file.
You can't declare the variable durationz as 0:00:00. It simply doesn't work in python.
One thing you could do is make durationz 0, and parse the time by turning it into an amount of seconds. Some pseudocode:
split duration string by ":"
add 60 * 60 * hours to duration
add 60 * minutes to duration
add seconds to duration

Categories

Resources