Arrays' Length exploding when appending Values to it from CSV - python

below is some code written to open a CSV file. Its values are stored like this:
03/05/2017 09:40:19,21.2,35.0
03/05/2017 09:40:27,21.2,35.0
03/05/2017 09:40:38,21.1,35.0
03/05/2017 09:40:48,21.1,35.0
This is just a snippet of code I use in a real time plotting program, which fully works but the fact that the array is getting so big is unclean. Normally new values get added to the CSV while the program is running and the length of the arrays is very high. Is there a way to not have exploding arrays like this?
Just run the program, you will have to make a CSV with those values too and you will see my problem.
from datetime import datetime
import time
y = [] #temperature
t = [] #time object
h = [] #humidity
def readfile():
readFile = open('document.csv', 'r')
sepFile = readFile.read().split('\n')
readFile.close()
for idx, plotPair in enumerate(sepFile):
if plotPair in '. ':
# skip. or space
continue
if idx > 1: # to skip the first line
xAndY = plotPair.split(',')
time_string = xAndY[0]
time_string1 = datetime.strptime(time_string, '%d/%m/%Y %H:%M:%S')
t.append(time_string1)
y.append(float(xAndY[1]))
h.append(float(xAndY[2]))
print([y])
while True:
readfile()
time.sleep(2)
This is the output I get:
[[21.1]]
[[21.1, 21.1]]
[[21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1, 21.1]]
Any help is appreciated.

You can use Python's deque if you also want to limit the total number of entries you wish to keep. It produces a list which features a maximum length. Once the list is full, any new entries push the oldest entry off the start.
The reason your list is growing is that you need to re-read your file up to the point of you last entry before continuing to add new entries. Assuming your timestamps are unique, you could use takewhile() to help you do this, which reads entries until a condition is met.
from itertools import takewhile
from collections import deque
from datetime import datetime
import csv
import time
max_length = 1000 # keep this many entries
t = deque(maxlen=max_length) # time object
y = deque(maxlen=max_length) # temperature
h = deque(maxlen=max_length) # humidity
def read_file():
with open('document.csv', newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input) # skip over the header line
# If there are existing entries, read until the last read item is found again
if len(t):
list(takewhile(lambda row: datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S') != t[-1], csv_input))
for row in csv_input:
print(row)
t.append(datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S'))
y.append(float(row[1]))
h.append(float(row[2]))
while True:
read_file()
print(t)
time.sleep(1)
Also, it is easier to work with the entries using Python's built in csv library to read each of the values into a list for each row. As you have a header row, read this in using next() before starting the loop.

Related

How to use python to find the maximum value in a csv file list

I am very new to python and need some help finding the maximum/highest value of in a column of data (time) that is imported from a csv file. this is the code i have tried.
file = open ("results.csv")
unneeded = file.readline()
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
maxtime = 0
for x in hours:
if x > maxtime:
maxtime = x
print (maxtime)
any help is appreciated
edit: i tried this code but it gives me the wrong answer :(
file = open ("results.csv")
unneeded = file.readline()
maxtime = 0
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
if hours > str(maxtime):
maxtime = hours
print (maxtime)
[first few lines of results][1]
edit:
results cvs
[1]: https://i.stack.imgur.com/z3pEJ.png
I haven't tested it but this should work. Using the CSV library is easy for parsing CSV files.
import csv
with open("results.csv") as file:
csv_reader = csv.reader(file, delimiter=',')
for row in csv_reader:
hours = row[4]
maxtime = 0
if hours > maxtime:
maxtime = x
print (maxtime)
file.close()
My recommendation is using the pandas module for anything CSV-related.
Using dateutil, I create a dataset of dates and identification numbers whose date values are shuffled (no specific order) like so:
from dateutil.parser import *
from dateutil.rrule import *
from random import shuffle
dates = list(rrule(
DAILY,
dtstart=parse("19970902T090000"),
until=parse("19971224T000000")
))
shuffle(dates)
with open('out.csv', mode='w', encoding='utf-8') as out:
for i,date in enumerate(dates):
out.write(f'{i},{date}\n')
So thus, in this particular dataset, 1997-12-23 09:00:00 would be the "largest" date. Then, to extract the maximum date, we can just do it via string comparisons if it is formatted in the ISO 8601 date/time format:
from pandas import read_csv
df = read_csv('out.csv', names=['index', 'time'], header=1)
print(max(df['time']))
After running it, we indeed get 1997-12-23 09:00:00 printed in the terminal!

How to compare date from csv(string) to actual date

filenameA ="ApptA.csv"
filenameAc = "CheckoutA.csv"
def checkouttenantA():
global filenameA
global filenameAc
import csv
import datetime
with open(filenameA, 'r') as inp, open(filenameAc, 'a' , newline = "") as out:
my_writer = csv.writer(out)
for row in csv.reader(inp):
my_date= datetime.date.today()
string_date = my_date.strftime("%d/%m/%Y")
if row[5] <= string_date:
my_writer.writerow(row)
Dates are saved in format %d/%m/%Y in an excel file on column [5]. I am trying to compare dates in csv file with actual date, but it is only comparing the %d part. I assume it is because dates are in string format.
Ok so there are a few improvements to make as well, which I'll put as an edit to this, but you're converting todays date to a string with strftime() and comparing the two strings, you should be converting the string date from the csv file to a datetime object and comparing those instead.
I'll add plenty of comments to try and explain the code and the reasoning behind it.
# imports should go at the top
import csv
# notice we are importing datetime from datetime (we are importing the `datetime` type from the module datetime
import from datetime import datetime
# try to avoid globals where possible (they're not needed here)
def check_dates_in_csv(input_filepath):
''' function to load csv file and compare dates to todays date'''
# create a list to store the rows which meet our criteria
# appending the rows to this will make a list of lists (nested list)
output_data = []
# get todays date before loop to avoid calling now() every line
# we only need this once and it'll slow the loop down calling it every row
todays_date = datetime.now()
# open your csv here using the function argument
with open(input_filepath, output_filepath) as csv_file:
reader = csv.reader(csv_file)
# iterate over the rows and grab the date in each row
for row in reader:
string_date = row[5]
# convert the string to a datetime object
csv_date = datetime.strptime(string_date, '%d/%m/%Y')
# compare the dates and append if it meets the criteria
if csv_date <= todays_date:
output_data.append(row)
# function should only do one thing, compare the dates
# save the output after
return output_data
# then run the script here
# this comparison is basically the entry point of the python program
# this answer explains it better than I could: https://stackoverflow.com/questions/419163/what-does-if-name-main-do
if __name__ == "__main__":
# use our new function to get the output data
output_data = check_dates_in_csv("input_file.csv")
# save the data here
with open("output.csv", "w") as output_file:
writer = csv.writer(output_file)
writer.writerows(output_data)
I would recommend to use Pandas for such tasks:
import pandas as pd
filenameA ="ApptA.csv"
filenameAc = "CheckoutA.csv"
today = pd.datetime.today()
df = pd.read_csv(filenameA, parse_dates=[5])
df.loc[df.iloc[:, 5] <= today].to_csv(filenameAc, index=False)

Remove specific rows from CSV file in python

I am trying to remove rows with a specific ID within particular dates from a large CSV file.
The CSV file contains a column [3] with dates formatted like "1962-05-23" and a column with identifiers [2]: "ddd:011232700:mpeg21:a00191".
Within the following date range:
01-01-1951 to 12-31-1951
07-01-1962 to 12-31-1962
01-01 to 09-30-1963
7-01 to 07-31-1965
10-01 to 10-31-1965
04-01-1966 to 11-30-1966
01-01-1969 to 12-31-1969
01-01-1970 to 12-31-1989
I want to remove rows that contain the ID ddd:11*
I think I have to create a variable that contains both the date range and the ID. And look for these in every row, but I'm very new to python so I'm not sure what would be an eloquent way to do this.
This is what I have now. -CODE UPDATED
import csv
import collections
import sys
import re
from datetime import datetime
csv.field_size_limit(sys.maxsize)
dateranges = [("01-01-1951","12-31-1951"),("07-01-1962","12-31-1962")]
dateranges = list(map(lambda dr: tuple(map(lambda x: datetime.strptime(x,"%m-%d-%Y"),dr)),dateranges))
def datefilter(x):
x = datetime.strptime(x,"%Y-%m-%d")
for r in dateranges:
if r[0]<=x and r[1]>=x: return True
return False
writer = csv.writer(open('filtered.csv', 'wb'))
for row in csv.reader('my_file.csv', delimiter='\t'):
if datefilter(row[3]):
if not row[2].startswith("dd:111"):
writer.writerow(row)
else:
writer.writerow(row)
writer.close()
I'd recommend using pandas: it's great for filtering tables. Nice and readable.
import pandas as pd
# assumes the csv contains a header, and the 2 columns of interest are labeled "mydate" and "identifier"
# Note that "date" is a pandas keyword so not wise to use for column names
df = pd.read_csv(inputFilename, parse_dates=[2]) # assumes mydate column is the 3rd column (0-based)
df = df[~df.identifier.str.contains('ddd:11')] # filters out all rows with 'ddd:11' in the 'identifier' column
# then filter out anything not inside the specified date ranges:
df = df[((pd.to_datetime("1951-01-01") <= df.mydate) & (df.mydate <= pd.to_datetime("1951-12-31"))) |
((pd.to_datetime("1962-07-01") <= df.mydate) & (df.mydate <= pd.to_datetime("1962-12-31")))]
df.to_csv(outputFilename)
See Pandas Boolean Indexing
Here is how I would approach that, but it may not be the best method.
from datetime import datetime
dateranges = [("01-01-1951","12-31-1951"),("07-01-1962","12-31-1962")]
dateranges = list(map(lambda dr: tuple(map(lambda x: datetime.strptime(x,"%m-%d-%Y"),dr)),dateranges))
def datefilter(x):
# The date format is different here to match the format of the csv
x = datetime.strptime(x,"%Y-%m-%d")
for r in dateranges:
if r[0]<=x and r[1]>=x: return True
return False
with open(main_file, "rb") as fp:
root = csv.reader(fp, delimiter='\t')
result = collections.defaultdict(list)
for row in root:
if datefilter(row[3]):
# use a regular expression or any other means to filter on id here
if row[2].startswith("dd:111"): #code to remove item
What I have done is create a list of tuples of your date ranges (for brevity, I only put 2 ranges in it), and then I convert those into datetime objects.
I have used maps for doing that in one line: first loop over all tuples in that list, applying a function which loops over all entries in that tuple and converts to a date time, using the tuple and list functions to get back to the original structure. Doing it the long way would look like:
dateranges2=[]
for dr in dateranges:
dateranges2.append((datetime.strptime(dr[0],"%m-%d-%Y"),datetime.strptime(dr[1],"%m-%d-%Y"))
dateranges = dateranges2
Notice that I just convert each item in the tuple into a datetime, and add the tuples to the new list, replacing the original (which I don't need anymore).
Next, I create a datefilter function which takes a datestring, converts it to a datetime, and then loops over all the ranges, checking if the value is in the range. If it is, we return True (indicating this item should be filtered), otherwise return False if we have checking all ranges with no match (indicating that we don't filter this item).
Now you can check out the id using any method that you want once the date has matched, and remove the item if desired. As your example is constant in the first few characters, we can just use the string startswith function to check the id. If it is more complex, we could use a regex.
My kinda approach workds like this -
import csv
import re
import datetime
field_id = 'ddd:11'
d1 = datetime.date(1951,1,01) #change the start date
d2 = datetime.date(1951,12,31) #change the end date
diff = d2 - d1
date_list = []
for i in range(diff.days + 1):
date_list.append((d1 + datetime.timedelta(i)).isoformat())
with open('mwevers_example_2016.01.02-07.25.55.csv','rb') as csv_file:
reader = csv.reader(csv_file)
for row in reader:
for date in date_list:
if row[3] == date:
print row
var = re.search('\\b'+field_id,row[2])
if bool(var) == True:
print 'olalala'#here you can make a function to copy those rows into another file or any list
import csv
import sys
import re
from datetime import datetime
csv.field_size_limit(sys.maxsize)
field_id = 'ddd:11'
dateranges = [("1951-01-01", "1951-12-31"),
("1962-07-01", "1962-12-31"),
("1963-01-01", "1963-09-30"),
("1965-07-01", "1965-07-30"),
("1965-10-01", "1965-10-31"),
("1966-04-01", "1966-11-30"),
("1969-01-01", "1989-12-31")
]
dateranges = list(map(lambda dr:
tuple(map(lambda x:
datetime.strptime(x, "%Y-%m-%d"), dr)),
dateranges))
def datefilter(x):
x = datetime.strptime(x, "%Y-%m-%d")
for r in dateranges:
if r[0] <= x and r[1] >= x:
return True
return False
output = []
with open('my_file.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t', quotechar='"')
next(reader)
for row in reader:
if datefilter(row[4]):
var = re.search('\\b'+field_id, row[3])
if bool(var) == False:
output.append(row)
else:
output.append(row)
with open('output.csv', 'w') as outputfile:
writer = csv.writer(outputfile, delimiter='\t', quotechar='"')
writer.writerows(output)

Dealing with strings amidst int in csv file, None value

I'm reading in data from a csv file where some of the values are "None". The values that are being read in are then contained in a list.
The list is the passed to a function which requires all values within the list to be in int() format.
However I can't apply this with the "None" string value being present. I've tried replacing the "None" with None, or with "" but that hasn't worked, it results in an error. The data in the list also needs to stay in the same position so I cant just completely ignore it all together.
I could replace all "None" with 0 but None != 0 really.
EDIT: I've added my code so hopefully it'll make a bit more sense. Trying to create a line chart from data in csv file:
import csv
import sys
from collections import Counter
import pygal
from pygal.style import LightSolarizedStyle
from operator import itemgetter
#Read in file to data variable and set header variable
filename = sys.argv[1]
data = []
with open(filename) as f:
reader = csv.reader(f)
header = reader.next()
data = [row for row in reader]
#count rows in spreadsheet (minus header)
row_count = (sum(1 for row in data))-1
#extract headers which I want to use
headerlist = []
for x in header[1:]:
headerlist.append(x)
#initialise line chart in module pygal. set style, title, and x axis labels using headerlist variable
line_chart = pygal.Line(style = LightSolarizedStyle)
line_chart.title = 'Browser usage evolution (in %)'
line_chart.x_labels = map(str, headerlist)
#create lists for data from spreadsheet to be put in to
empty1 = []
empty2 = []
#select which data i want from spreadsheet
for dataline in data:
empty1.append(dataline[0])
empty2.append(dataline[1:-1])
#DO SOMETHING TO "NONE" VALUES IN EMPTY TWO SO THEY CAN BE PASSED TO INT CONVERTER ASSIGNED TO EMPTY 3
#convert all items in the lists, that are in the list of empty two to int
empty3 = [[int(x) for x in sublist] for sublist in empty2]
#add data to chart line by line
count = -1
for dataline in data:
while count < row_count:
count += 1
line_chart.add(empty1[count], [x for x in empty3[count]])
#function that only takes int data
line_chart.render_to_file("browser.svg")
There will be a lot of inefficiencies or weird ways of doing things, trying to slowly learn.
The above script gives chart:
With all the Nones set as 0, bu this doesn't really reflect the existence of Chrome pre a certain date. Thanks
Without seeing your code, I can only offer limited help.
It sounds like you need to utilize ast.literal_eval().
import ast
csvread = csv.reader(file)
list = []
for row in csvread:
list.append(ast.literal_eval(row[0]))

Generating multiple CSV files on hourly basis in Python

I have a python module called HourlyCsvGeneration.py. I have some data which is being generated on hourly basis which is is sample.txt. Here is the sample of the data in the sample.txt:-
2014-07-24 15:00:00,1,1,1,1,1001
2014-07-24 15:01:00,1,1,1,1,1001
2014-07-24 15:02:00,1,1,1,1,1001
2014-07-24 15:15:00,1,1,1,1,1001
2014-07-24 15:16:00,1,1,1,1,1001
2014-07-24 15:17:00,1,1,1,1,1001
2014-07-24 15:30:00,1,1,1,1,1001
2014-07-24 15:31:00,1,1,1,1,1001
2014-07-24 15:32:00,1,1,1,1,1001
2014-07-24 15:45:00,1,1,1,1,1001
2014-07-24 15:46:00,1,1,1,1,1001
2014-07-24 15:47:00,1,1,1,1,1001
As you can see there are 4 intervals 00-15, 15-30, 30,45 and 45-59 and the next hour starts and so on. I am writing the code that would read the data in this txt file and generating 4 CSV files for every hour in a day. So analysing the above data the 4 CSV files should be generated should have naming convention like 2014-07-24 15:00.csv containing the data between 15:00 and 15:15, 2014-07-24 15:15.csv containing the data between 15:15 and 15:30 and so on for every hour. The python code must handle all this.
Here is my current code snippet:-
import csv
def connection():
fo = open("sample.txt", "r")
data = fo.readlines()
header = ['tech', 'band', 'region', 'market', 'code']
for line in data:
line = line.strip("\n")
line = line.split(",")
time = line[0]
lines = [x for x in time.split(':') if x]
i = len(lines)
if i == 0:
continue
else:
hour, minute, sec = lines[0], lines[1], lines[2]
minute = int(minute)
if minute >= 0 and minute < 15:
print hour, minute
print line[1:]
elif minute >= 15 and minute < 30:
print hour, minute
print line[1:]
elif minute >= 30 and minute < 45:
print hour, minute
print line[1:]
elif minute >=45 and minute < 59:
print hour, minute
print line[1:]
connection()
[1:] gives the right data for each interval and I am kind off stuck in generating CSV files and writing the data. So instead of printing [1:], I want this to be get written in the csv file of that interval with the appropriate naming convention as explained in the above description.
Expected output:-
2014-07-24 15:00.csv must contain
1,1,1,1,1001
1,1,1,1,1001
1,1,1,1,1001
2014-07-24 15:15.csv must contain
1,1,1,1,1001
1,1,1,1,1001
1,1,1,1,1001
and so on for 15.30.csv and 15.45.csv. Keeping in mind this is just a small chunk of data. The actual data is for every hour of the data. Meaning generating 4 csv files for each hour that is 24*4 files for one day. So how can I make my code more robust and efficient?
Any help?Thanks
Seems like a job for itertools.groupby, if the timestamps are strictly increasing in value:
from datetime import datetime as DateTime
from itertools import imap, groupby
from operator import itemgetter
get_first = itemgetter(0)
get_second = itemgetter(1)
def process_line(line):
timestamp_string, _, values = line.partition(',')
timestamp = DateTime.strptime(timestamp_string, '%Y-%m-%d %H:%M:%S')
return (
timestamp.replace(minute=timestamp.minute // 15 * 15, second=0),
values
)
def main():
with open('sample.txt', 'r') as lines:
for date, group in groupby(imap(process_line, lines), get_first):
with open('{0:%Y-%m-%d %H_%M}.csv'.format(date), 'w') as out_file:
out_file.writelines(imap(get_second, group))
if __name__ == '__main__':
main()
Your problem is not trivial, because if you try to open all the output files at once, then you will run out of file descriptors and crash. So what you want to do is open a file in append mode, write one line, then close the file. This isn't a horribly efficient operation, so I wouldn't worry about efficiency yet.
outfile = open("2014-07-24 15:00.csv","a")
outfile.write("csv, line, data\n")
outfile.close()
I'd recommend using pandas for this. It takes care of a whole bunch of the dirty work for you.
import pandas as pd
df = pd.read_table('DummyText.txt',sep=',',index_col=0,parse_dates=True,header=None)
fname = (str(pd.datetime(2014,7,24,15,0))+'.csv').replace(':','-')
df[pd.datetime(2014,7,24,15,0):pd.datetime(2014,7,24,15,15)].to_csv(fname,header=None)
I took the : out of the filename. It didn't seem to like that.
All you need to do with the above is setup some loops to cycle through the dates and times.
Here is some ways that might help
import csv
from datetime import datetime
def get_higher_minute(minute_of_day):
return (((minute_of_day/ 15) + 1 ) % 4) * 15
def connection():
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
dateObject = datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
minute_of_day = dateObject.minute
higher_minute = get_higher_minute(minute_of_day)
newdate = dateObject.replace(minute = higher_minute)
file_name_of_new_csv = "%s.csv" % dateObject.strftime("%Y-%m-%d %H:%M")
new_csv_writer = csv.writer(file_name_of_new_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
new_csv_writer.writerow(row[0:])
new_csv_writer.close()
def main():
connection()
if __name__=="__main__":
main()
Hope it helps
Sorry. left new_csv_writer open.

Categories

Resources