python CSV writer - formatting - python

Quick question on how to properly write data back into a CSV file using the python csv module. Currently i'm importing a file, pulling a column of dates and making a column of days_of_the_week using the datetime module. I want to then write out a new csv file (or overright the individual one) containing one original element and the new element.
with open('new_dates.csv') as csvfile2:
readCSV2 = csv.reader(csvfile2, delimiter=',')
incoming = []
for row in readCSV2:
readin = row[0]
time = row[1]
year, month, day = (int(x) for x in readin.split('-'))
ans = datetime.date(year, month, day)
wkday = ans.strftime("%A")
incoming.append(wkday)
incoming.append(time)
with open('new_dates2.csv', 'w') as out_file:
out_file.write('\n'.join(incoming))
Input files looks like this:
2017-03-02,09:25
2017-03-01,06:45
2017-02-28,23:49
2017-02-28,19:34
When using this code I end up with an output file that looks like this:
Friday
15:23
Friday
14:41
Friday
13:54
Friday
7:13
What I need is an output file that looks like this:
Friday,15:23
Friday,14:41
Friday,13:54
Friday,7:13
If I change the delimiter in out_file.write to a comma I just get one element of data per column, like this:
Friday 15:23 Friday 14:41 Friday 13:54 ....
Any thoughts would be appreciated. Thanks!

Being somewhat unclear on what format you want, I've assumed you just want a single space between wkday and time. For a quick fix, instead of appending both wkday and time separately, as in your example, append them together:
...
incoming.append('{} {}'.format(wkday,time))
...
OR, build your incoming as a list of lists:
...
incoming.append([wkday,time])
...
and change your write to:
with open('new_dates2.csv', 'w') as out_file:
out_file.write('\n'.join([' '.join(t) for t in incoming]))

It seems you want Friday in column 0 and the time in column 1, so you need to change your incoming to a list of lists. That means the append statement should look like this:
...
incoming.append([wkday, time])
...
Then, it is better to use the csv.writer to write back to the file. You can write the whole incoming in one go without worrying about formatting.
with open('new_dates2.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerows(incoming)

Basically your incoming array is a linear list. So, you should have been doing is something like following:
#your incoming array
incoming = ['Friday', '15:23', 'Friday', '14:41', 'Friday', '13:54', 'Friday', '7:13']
#actual parsing of the array for correct output
for i,j in zip(incoming[::2], incoming[1::2]):
out_file.write(','.join((i,j)))
out_file.write('\n')

You don't really need the csv module for this. I'm guessing at the input, but from the description it looks like:
2017-03-02,09:25
2017-03-01,06:45
2017-02-28,23:49
2017-02-28,19:34
This will parse it and write it in a new format:
import datetime
with open('new_dates.csv') as f1, open('new_dates2.csv','w') as f2:
for line in f1:
dt = datetime.datetime.strptime(line.strip(),'%Y-%m-%d,%H:%M')
f2.write(dt.strftime('%A,%H:%M\n'))
Output file:
Thursday,09:25
Wednesday,06:45
Tuesday,23:49
Tuesday,19:34

Related

Data Cleansing Dates in Python

I am processing a .csv in Python which has a column full of timestamps in the below format:-
August 21st 2020, 13:58:19.000
The full content of a line in the .csv looks like this:-
"August 21st 2020, 14:55:12.000","joebloggs#gmail.com",IE,Ireland,"192.168.0.1",2
What I wish to do is populate a separate column with the respective date and time and so I'd have '2020-08-21' in a new column and '13:58:19' in an additional new column.
Looking at this, I think I need to firstly tidy up the date and remove the letters after the actual date (21st, 4th, 7th etc) so that they read 21, 4 etc...
I started writing the below but this will replace the actual numbers too. How can I just strip out the letters after the numbers? I did think of a simple find/replace but this would remove 'st' from August for example hence a regex is needed I think to look for a space following by 1 or 2 numbers and then check if it is followed by ['nd','st','rd','th'] and if so, strip this out:-
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line = re.sub(r"\s\d+", "", line)
print(line)
Based on a suggestion to use the datetime module, I tried splitting to see if this helped:-
import re
from datetime import datetime
with open('input_csv.csv', 'r') as input_file:
lines = input_file.readlines()[1:]
for line in lines:
line_lst = line.split(',')
line_date = datetime.strptime(line_lst[0], '%B %d %y')
line_time = datetime.strptime(line_lst[1], '%H:%M:%S.%f')
print(line_date)
print(line_time)
but I still receive an error:-
ValueError: time data '"May 12th 2020' does not match format '%B %d %y'
It's probably easier to use the datetime library. For example,
from datetime import datetime
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line_date = datetime.strptime(line, '%B %d %y, %H:%M:%S.%f')
and then use strftime() to write it to the new file in the new format. To read more about these methods:
https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
I recommend using as many good libraries as you can for this: a lot of what you want to do has already been developed and tested.
Use a CSV parser
Parsing a CSV can look simple at first but become increasingly difficult as there are a lot of special rules around quoting (escaping) characters. Use Python's standard CSV module to do this:
import csv
with open('input.csv', newline='') as f:
reader = csv.reader(csv_file)
for row in reader:
date_val = row[0]
print(f'Raw string: {date_val}')
and I get the following without any special code:
Raw string: August 21st 2020, 14:55:12.000
Use a really good date-time parser
As a couple of others already pointed out, it's important to use a date-time parser, but they didn't tell you that Python's standard parser cannot handle the ordinals like '12th' and '21st' (maybe they just didn't see those...). The 3rd-party dateutil lib can handle them, though:
from dateutil import parser
...
...
print(f'Raw string: {date_val}')
dt = parser.parse(date_val)
print(f'Date: {dt.date()}')
print(f'Time: {dt.time()}')
Raw string: August 21st 2020, 14:55:12.000
Date: 2020-08-21
Time: 14:55:12
Make a new a CSV
Again, use the CSV module's writer class to take the modified rows from the reader and turn them into a CSV.
For simplicity, I recommend just accumulating new rows as you're reading and parsing the input. I've also included commented-out lines for copying a header, if you have one:
...
new_rows = []
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# header = next(reader)
# new_rows.append(header)
for row in reader:
date_val = row[0]
remainder = row[1:]
dt = parser.parse(date_val)
new_row = [dt.date(), dt.time()]
new_row += remainder
new_rows.append(new_row)
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(new_rows)
My output.csv looks like this:
2020-08-21,14:55:12,joebloggs#gmail.com,IE,Ireland,192.168.0.1,2

How to extract a month from date in csv file?

I'm trying to get an output of all the employees who worked this month by extracting the month from the date but I get this error:
month = int(row[1].split('-')[1])
IndexError: list index out of range
A row in the attendance log csv looks like this:
"404555403","2020-10-14 23:58:15.668520","Chandler Bing"
I don't understand why it's out of range?
Thanks for any help!
import csv
import datetime
def monthly_attendance_report():
"""
The function prints the attendance data of all employees from this month.
"""
this_month = datetime.datetime.now().month
with open('attendance_log.csv', 'r') as csvfile:
content = csv.reader(csvfile, delimiter=',')
for row in content:
month = int(row[1].split('-')[1])
if month == this_month:
return row
monthly_attendance_report()
It is working for me. The problem will be probably in processing the csv file, because csv files have in most cases headers, which means that you can't split header text. So add slicer [1:] to your for loop and ignore first line with header:
for row in content[1:]:
And processing date by slicing is not good at all, too. Use datetime module or something like that.

Compare time in Python 2

I get a .csv file with values inside and one of the columns contains durations in the format hh:mm:ss for example 06:42:13 (6 hours, 42 minutes and 13 seconds). Now I want to compare this time with a given time for example 00:00:00 because I have to handle the information in that row different.
time is the value I got out of the .csv file
if time == 00:00:00:
do something
else:
do something different
Thats what I want but it obviously doesn't work how I did it. I thought python stored the time as a String but when i compared it like this:
if time == "00:00:00":
it didn't work either.
Thats how I get the values out of the .csv file:
import csv
import_list = []
with open("input.csv", "r") as csvfile:
inputreader = csv.reader(csvfile, delimiter=';')
for row in inputreader:
import_list.append(row)
The .csv file looks like this:
Name; Duration; Tests; Warnings; Errors
Test1; 06:42:13; 2000; 2; 1
Test2; 00:00:00; 0; 0; 0
and so on.
Try it like this:
if time == " 00:00:00":
...
You have a trailing space at the beginning.
Alternatively you can change your code into this:
import csv
import_list = []
with open("input.csv", "r") as csvfile:
inputreader = csv.reader(csvfile, delimiter=';')
for row in inputreader:
import_list.append([item.strip() for item in row])
Do this instead:
if time.strip() == "00:00:00":
do something
else:
do something different
Instead of doing string comparisions, using inbuilt datetime library to create datetime objects. Use datetime.strptime to convert date string.

Get a string in the last column using Python

i need to get a string from CSV file, i know that I can use Python but i've been looking for hours but still can't get it. The is the CSV looks like:
DATE|CUST|PHONE|EMAIL|NAME|CLASS|QTY|AMOUNT|ID|TRX_ID|BOOKING CODE|PIN
01-02-2013 09:04:16|sdasd|43543|csdfd|Voucher Regular|REGULAR|1|2250000|G001T001|0062013000149|32143000341|MV1011302JSGUCFOM
01-02-2013 09:04:16|sdasd|43543|csdfd|Voucher Regular|REGULAR|2|1200000|G001T001|0062013000149|32143000341|MV4011302CBWDQYOU&MV4011302PVSEVAPJ
01-02-2013 11:01:13|ge|||Voucher Regular|REGULAR|1|600000|G001T001|20000027000005|32143000355|MV4011302UHKMJEEM
The string that I want to get is the PIN column (the last one); but in each column, there can be multiple PINs, separated by '&'.
Thanks for the help, been looking at solving this for hours.
Split on | and get the last entry:
pin = line.split('|')[-1]
Or more fancy:
import csv
with open('bookings.csv', 'rb') as csvfile:
bookings = csv.reader(csvfile, delimiter='|')
for values in bookings:
print(values[-1])
When dealing with a csv file, just use the csv module:
import csv
from itertools import chain
with open('path/to/your/file.csv', 'rb') as csvfile:
tmp = (r['PIN'].split('&') for r in csv.DictReader(csvfile, delimiter='|'))
pins = list(chain.from_iterable(tmp))
for pin in pins:
print pin
Iterate through each line and use something like this:
'hello|world|and|noel'.split('|')[-1]
to get the last element

select certain dates inside loop for .csv file

Name,USAF,NCDC,Date,HrMn,I,Type,Dir,Q,I,Spd,Q
OXNARD,723927,93110,19590101,0000,4,SAO,270,1,N,3.1,1,
OXNARD,723927,93110,19590101,0100,4,SAO,338,1,N,1.0,1,
OXNARD,723927,93110,19590101,0200,4,SAO,068,1,N,1.0,1,
OXNARD,723927,93110,19590101,0300,4,SAO,068,1,N,2.1,1,
OXNARD,723927,93110,19590101,0400,4,SAO,315,1,N,1.0,1,
OXNARD,723927,93110,19590101,0500,4,SAO,999,1,C,0.0,1,
....
OXNARD,723927,93110,19590102,0000,4,SAO,225,1,N,2.1,1,
OXNARD,723927,93110,19590102,0100,4,SAO,248,1,N,2.1,1,
OXNARD,723927,93110,19590102,0200,4,SAO,999,1,C,0.0,1,
OXNARD,723927,93110,19590102,0300,4,SAO,068,1,N,2.1,1,
Here is a snippet of a csv file storing hourly wind speeds (Spd) in each row. What I'd like to do is select all hourly winds for each day in the csv file and store them into a temporary daily list storing all of that day's hourly values (24 if no missing values). Then I'll output the current day's list, create new empty list for the next day, locate hourly speeds in the next day, output that daily list, and so forth until the end of the file.
I'm struggling with a good method to do this. One thought I have is to read in line i, determine the date(YYYY-MM-DD), then read in line i+1 and see if that date matchs date i. If they match, then we're in the same day. If they don't, then we are onto the next day. But I can't even figure out how to read in the next line in the file...
Any suggestions to execute this method or a completely new (and better?!) method are most welcome. Thanks you in advance!
obs_in = open(csv_file).readlines()
for i in range(1,len(obs_in)):
# Skip over the header lines
if not str(obs_in[i]).startswith("Identification") and not str(obs_in[i]).startswith("Name"):
name,usaf,ncdc,date,hrmn,i,type,dir,q,i2,spd,q2,blank = obs_in[i].split(',')
current_dt = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
current_spd = spd
# Read in next line's date: is it in the same day?
# If in the same day, then append spd into tmp daily list
# If not, then start a new list for the next day
You can take advantage of the well-ordered nature of the data file and use csv.dictreader. Then you can build up a dictionary of the windspeeds organized by date quite simply, which you can process as you like. Note that the csv reader returns strings, so you might want to convert to other types as appropriate while you assemble the list.
import csv
from collections import defaultdict
bydate = defaultdict(list)
rdr = csv.DictReader(open('winds.csv','rt'))
for k in rdr:
bydate[k['Date']].append(float(k['Spd']))
print(bydate)
defaultdict(<type 'list'>, {'19590101': [3.1000000000000001, 1.0, 1.0, 2.1000000000000001, 1.0, 0.0], '19590102': [2.1000000000000001, 2.1000000000000001, 0.0, 2.1000000000000001]})
You can obviously change the argument to the append call to a tuple, for instance append((float(k['Spd']), datetime.datetime.strptime(k['Date']+k['HrMn'],'%Y%m%D%H%M)) so that you can also collect the times.
If the file has extraneous spaces, you can use the skipinitialspace parameter: rdr = csv.DictReader(open('winds.csv','rt'), fieldnames=ff, skipinitialspace=True). If this still doesn't work, you can pre-process the header line:
bydate = defaultdict(list)
with open('winds.csv', 'rt') as f:
fieldnames = [k.strip() for k in f.readline().split(', ')]
rdr = csv.DictReader(f, fieldnames=fieldnames, skipinitialspace=True)
for k in rdr:
bydate[k['Date']].append(k['Spd'])
return bydate
bydate is accessed like a regular dictionary. To access a specific day's data, do bydate['19590101']. To get the list of dates that were processed, you can do bydate.keys().
If you want to convert them to Python datetime objects at the time of reading the file, you can import datetime, then replace the assignment line with bydate[datetime.datetime.strptime(k['Date'], '%Y%m%d')].append(k['Spd']).
It can be something like this.
def dump(buf, date):
"""dumps buffered line into file 'spdYYYYMMDD.csv'"""
if len(buf) == 0: return
with open('spd%s.csv' % date, 'w') as f:
for line in buf:
f.write(line)
obs_in = open(csv_file).readlines()
# buf stores one day record
buf = []
# date0 is meant for time stamp for the buffer
date0 = None
for i in range(1,len(obs_in)):
# Skip over the header lines
if not str(obs_in[i]).startswith("Identification") and \
not str(obs_in[i]).startswith("Name"):
name,usaf,ncdc,date,hrmn,ii,type,dir,q,i2,spd,q2,blank = \
obs_in[i].split(',')
current_dt = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
current_spd = spd
# see if the time stamp of current record is different. if it is different
# dump the buffer, and also set the time stamp of buffer
if date != date0:
dump(buf, date0)
buf = []
date0 = date
# you change this. i am simply writing entire line
buf.append(obs_in[i])
# when you get out the buffer should be filled with the last day's record.
# so flush that too.
dump(buf, date0)
I also found that i have to use ii instead of i for the filed "I" of the data, as you used i for loop counter.
I know this question is from years ago but just wanted to point out that a small bash script can neatly perform this task. I copied your example into a file called data.txt and this is the script:
#!/bin/bash
date=19590101
date_end=19590102
while [[ $date -le $date_end ]] ; do
grep ",${date}," data.txt > file_${date}.txt
date=`date +%Y%m%d -d ${date}+1day` # NOTE: MAC-OSX date differs
done
Note that this won't work on MAC as for some reason the date command implementation is different, so on MAC you either need to use gdate (from coreutils) or change the options to match those for date on MAC.
If there are dates missing from the file the grep command produces an empty file - this link shows ways to avoid this:
how to stop grep creating empty file if no results

Categories

Resources