Data Cleansing Dates in Python - python

I am processing a .csv in Python which has a column full of timestamps in the below format:-
August 21st 2020, 13:58:19.000
The full content of a line in the .csv looks like this:-
"August 21st 2020, 14:55:12.000","joebloggs#gmail.com",IE,Ireland,"192.168.0.1",2
What I wish to do is populate a separate column with the respective date and time and so I'd have '2020-08-21' in a new column and '13:58:19' in an additional new column.
Looking at this, I think I need to firstly tidy up the date and remove the letters after the actual date (21st, 4th, 7th etc) so that they read 21, 4 etc...
I started writing the below but this will replace the actual numbers too. How can I just strip out the letters after the numbers? I did think of a simple find/replace but this would remove 'st' from August for example hence a regex is needed I think to look for a space following by 1 or 2 numbers and then check if it is followed by ['nd','st','rd','th'] and if so, strip this out:-
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line = re.sub(r"\s\d+", "", line)
print(line)
Based on a suggestion to use the datetime module, I tried splitting to see if this helped:-
import re
from datetime import datetime
with open('input_csv.csv', 'r') as input_file:
lines = input_file.readlines()[1:]
for line in lines:
line_lst = line.split(',')
line_date = datetime.strptime(line_lst[0], '%B %d %y')
line_time = datetime.strptime(line_lst[1], '%H:%M:%S.%f')
print(line_date)
print(line_time)
but I still receive an error:-
ValueError: time data '"May 12th 2020' does not match format '%B %d %y'

It's probably easier to use the datetime library. For example,
from datetime import datetime
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line_date = datetime.strptime(line, '%B %d %y, %H:%M:%S.%f')
and then use strftime() to write it to the new file in the new format. To read more about these methods:
https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

I recommend using as many good libraries as you can for this: a lot of what you want to do has already been developed and tested.
Use a CSV parser
Parsing a CSV can look simple at first but become increasingly difficult as there are a lot of special rules around quoting (escaping) characters. Use Python's standard CSV module to do this:
import csv
with open('input.csv', newline='') as f:
reader = csv.reader(csv_file)
for row in reader:
date_val = row[0]
print(f'Raw string: {date_val}')
and I get the following without any special code:
Raw string: August 21st 2020, 14:55:12.000
Use a really good date-time parser
As a couple of others already pointed out, it's important to use a date-time parser, but they didn't tell you that Python's standard parser cannot handle the ordinals like '12th' and '21st' (maybe they just didn't see those...). The 3rd-party dateutil lib can handle them, though:
from dateutil import parser
...
...
print(f'Raw string: {date_val}')
dt = parser.parse(date_val)
print(f'Date: {dt.date()}')
print(f'Time: {dt.time()}')
Raw string: August 21st 2020, 14:55:12.000
Date: 2020-08-21
Time: 14:55:12
Make a new a CSV
Again, use the CSV module's writer class to take the modified rows from the reader and turn them into a CSV.
For simplicity, I recommend just accumulating new rows as you're reading and parsing the input. I've also included commented-out lines for copying a header, if you have one:
...
new_rows = []
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# header = next(reader)
# new_rows.append(header)
for row in reader:
date_val = row[0]
remainder = row[1:]
dt = parser.parse(date_val)
new_row = [dt.date(), dt.time()]
new_row += remainder
new_rows.append(new_row)
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(new_rows)
My output.csv looks like this:
2020-08-21,14:55:12,joebloggs#gmail.com,IE,Ireland,192.168.0.1,2

Related

How remove time from date values pulled from a JSON file?

I am working in python and using a JSON file and pulling info from it and sending to a csv file. The code I am using is as follows:
import csv
import json
csv_kwargs = {
'dialect': 'excel',
'doublequote': True,
'quoting': csv.QUOTE_MINIMAL
}
inpfile = open('checkin.json', 'r', encoding='utf-8')
outfile = open('checkin.csv', 'w', encoding='utf-8')
writer = csv.writer(outfile, **csv_kwargs, lineterminator="\n")
for line in inpfile:
d = json.loads(line)
writer.writerow([d['business_id'],d['date']])
inpfile.close()
outfile.close()
checkin.json key values of business_id and date. The date values are in the form of 'MM:DD:YYYY HH:MM:SS' where it shows the date and then the time. Each business_id includes multiple dates associated with it. I included a line of the JSON file to show how each 'business_id' works and the dates associated with it. A line from the JSON is shown below:
{"business_id":"--1UhMGODdWsrMastO9DZw","date":"2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016-10-15 02:45:18, 2016-11-18 01:54:50, 2017-04-20 18:39:06, 2017-05-03 17:58:02"}
My question is how do you code this to keep the date, but not the time being that they are in the same key value.
You can parse the date in your JSON as a timestamp and then truncate it to date using Python's built-in datetime module.
Import the module:
from datetime import datetime
Parse the date while writing:
for line in inpfile:
d = json.loads(line)
dates = map(lambda dt: datetime.strptime(dt.strip(), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d'), d['dates'].split(' '))
for date in dates:
writer.writerow([d['business_id'], date])
The formatting for date values described in you question isn't consistent, first you say it's MM:DD:YYYY, however in the sample line from the json input file it appears to be YYYY-MM-DD, and while such details may matter, that particular one doesn't to the revised code below. What did matter was the fact that there can be more than one, which is why I'm updating my answer.
import csv
import json
csv_kwargs = {
'dialect': 'excel',
'doublequote': True,
'quoting': csv.QUOTE_MINIMAL,
}
with open('checkin.json', 'r', encoding='utf-8') as inpfile, \
open('checkin.csv', 'w', encoding='utf-8', newline='') as outfile:
writer = csv.writer(outfile, **csv_kwargs)
for line in inpfile:
d = json.loads(line)
# Convert date value string into list of dates with the times removed.
dates = [date.strip().split(' ')[0] for date in d['date'].split(',')]
writer.writerow([d['business_id']] + dates)
If you're strictly using this program to convert the json file to csv, you can simply use string slices:
date, time = d['date'][:12], d['date'][12:]
If you want to store it as a datetime object to do something else
dt = time.strptime(d['date'], "'%m:%d:%Y''%H:%M:%S'")
# Other stuff
dt_string = dt.strftime("'%m:%d:%Y'")

Copying Headers from CSV in Python not Working-- Delimiter Issue

I'm trying to copy headers from one Python file to another and it's splitting the headers into individual characters, one character for a column. I'm not sure why.
I've read through StackOverflow but couldn't find a question/solution to this problem.
first.csv file data
Date,Data
1/2/2019,a
12/1/2018,b
11/3/2018,c
Python Code
import csv
from datetime import datetime, timedelta
date_ = datetime.strftime(datetime.now(),'%Y_%m_%d')
with open('first.csv', 'r') as full_file, open('second.csv' + '_' + date_ + '.csv', 'w') as past_10_days:
writer = csv.writer(past_10_days)
writer.writerow(next(full_file)) #copy headers over from original file
for row in csv.reader(full_file): #run through remaining rows
if datetime.strptime(row[0],'%m/%d/%Y') > datetime.now() - timedelta(days=10): #write rows where timestamp is greater than today - 10
writer.writerow(row)
Result I get:
D,a,t,e,D,a,t,a
1/2/2019,a
I'd like the result to just be
Date,Data
1/2/2019,a
Am I just missing setting an option? This is Python 3+
Thanks!
Change
writer.writerow(next(full_file))
To
writer.writerow(next(csv.reader(full_file)))
Your code is reading full_file as a text file, not as a CSV, so you'll just get the characters.
Ideally, as roganjosh pointed out, you should simply define the reader once, so the code should look like this:
reader = csv.reader(full_file)
writer.writerow(next(reader))
for row in reader:
if datetime.strptime(row[0],'%m/%d/%Y') > datetime.now() - timedelta(days=10):
writer.writerow(row)

Compare time in Python 2

I get a .csv file with values inside and one of the columns contains durations in the format hh:mm:ss for example 06:42:13 (6 hours, 42 minutes and 13 seconds). Now I want to compare this time with a given time for example 00:00:00 because I have to handle the information in that row different.
time is the value I got out of the .csv file
if time == 00:00:00:
do something
else:
do something different
Thats what I want but it obviously doesn't work how I did it. I thought python stored the time as a String but when i compared it like this:
if time == "00:00:00":
it didn't work either.
Thats how I get the values out of the .csv file:
import csv
import_list = []
with open("input.csv", "r") as csvfile:
inputreader = csv.reader(csvfile, delimiter=';')
for row in inputreader:
import_list.append(row)
The .csv file looks like this:
Name; Duration; Tests; Warnings; Errors
Test1; 06:42:13; 2000; 2; 1
Test2; 00:00:00; 0; 0; 0
and so on.
Try it like this:
if time == " 00:00:00":
...
You have a trailing space at the beginning.
Alternatively you can change your code into this:
import csv
import_list = []
with open("input.csv", "r") as csvfile:
inputreader = csv.reader(csvfile, delimiter=';')
for row in inputreader:
import_list.append([item.strip() for item in row])
Do this instead:
if time.strip() == "00:00:00":
do something
else:
do something different
Instead of doing string comparisions, using inbuilt datetime library to create datetime objects. Use datetime.strptime to convert date string.

How to detect and edit date in file

I have a file that consists of a bunch of lines that have dates in them, for example:
1, '01-JAN-10', '04-JAN-10', 100, 'HELEN', 'PRICE'
2, 'MARK', 'TYER', '05-JAN-10', '06-JAN-10', 120
I want to change the date parts of the lines to a different format, but I don't know how to detect which part of the line has the date fields and I don't know how to replace them with the new date format. I already have a function called changeDate(date) that returns a correctly formatted date given a bad format date. This is my code so far:
def editFile(filename)
f = open(filename)
while line:
line = f.readline()
for word in line.split():
#detect if it is a date, and change to new format
f.close()
You can use strptime and try/catch to do this:
strptime
Return a datetime corresponding to date_string, parsed according to
format.
See more details from strftime() and strptime() Behavior.
from datetime import datetime
s="1, '01-JAN-10', '04-FEB-28', 100, 'HELEN', 'PRICE'"
for word in s.replace(' ','').replace('\'','').split(','):
try:
dt=datetime.strptime(word,'%y-%b-%d')
print('{0}/{1}/{2}'.format(dt.month, dt.day, dt.year))
except Exception as e:
print(word)
Result:
1
1/10/2001
2/28/2004
100
HELEN
PRICE
You can use regex to detect. It's hard to modify the file in place, maybe you could write all the new contents to a new file.
import re
with open('filename', 'r') as f:
input_file = f.read()
# input_file = "1, '01-JAN-10', '04-JAN-10', 100, 'HELEN', 'PRICE'"
dates = re.findall(r'\d+-[A-Za-z]+-\d+', input_file) # output: ['01-JAN-10', '04-JAN-10']
for old in dates:
input_file.replace(old, changeDate(old)) # your changeDate(date) in your question
with open('new_file', 'w+') as f:
f.write(input_file)

python CSV writer - formatting

Quick question on how to properly write data back into a CSV file using the python csv module. Currently i'm importing a file, pulling a column of dates and making a column of days_of_the_week using the datetime module. I want to then write out a new csv file (or overright the individual one) containing one original element and the new element.
with open('new_dates.csv') as csvfile2:
readCSV2 = csv.reader(csvfile2, delimiter=',')
incoming = []
for row in readCSV2:
readin = row[0]
time = row[1]
year, month, day = (int(x) for x in readin.split('-'))
ans = datetime.date(year, month, day)
wkday = ans.strftime("%A")
incoming.append(wkday)
incoming.append(time)
with open('new_dates2.csv', 'w') as out_file:
out_file.write('\n'.join(incoming))
Input files looks like this:
2017-03-02,09:25
2017-03-01,06:45
2017-02-28,23:49
2017-02-28,19:34
When using this code I end up with an output file that looks like this:
Friday
15:23
Friday
14:41
Friday
13:54
Friday
7:13
What I need is an output file that looks like this:
Friday,15:23
Friday,14:41
Friday,13:54
Friday,7:13
If I change the delimiter in out_file.write to a comma I just get one element of data per column, like this:
Friday 15:23 Friday 14:41 Friday 13:54 ....
Any thoughts would be appreciated. Thanks!
Being somewhat unclear on what format you want, I've assumed you just want a single space between wkday and time. For a quick fix, instead of appending both wkday and time separately, as in your example, append them together:
...
incoming.append('{} {}'.format(wkday,time))
...
OR, build your incoming as a list of lists:
...
incoming.append([wkday,time])
...
and change your write to:
with open('new_dates2.csv', 'w') as out_file:
out_file.write('\n'.join([' '.join(t) for t in incoming]))
It seems you want Friday in column 0 and the time in column 1, so you need to change your incoming to a list of lists. That means the append statement should look like this:
...
incoming.append([wkday, time])
...
Then, it is better to use the csv.writer to write back to the file. You can write the whole incoming in one go without worrying about formatting.
with open('new_dates2.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerows(incoming)
Basically your incoming array is a linear list. So, you should have been doing is something like following:
#your incoming array
incoming = ['Friday', '15:23', 'Friday', '14:41', 'Friday', '13:54', 'Friday', '7:13']
#actual parsing of the array for correct output
for i,j in zip(incoming[::2], incoming[1::2]):
out_file.write(','.join((i,j)))
out_file.write('\n')
You don't really need the csv module for this. I'm guessing at the input, but from the description it looks like:
2017-03-02,09:25
2017-03-01,06:45
2017-02-28,23:49
2017-02-28,19:34
This will parse it and write it in a new format:
import datetime
with open('new_dates.csv') as f1, open('new_dates2.csv','w') as f2:
for line in f1:
dt = datetime.datetime.strptime(line.strip(),'%Y-%m-%d,%H:%M')
f2.write(dt.strftime('%A,%H:%M\n'))
Output file:
Thursday,09:25
Wednesday,06:45
Tuesday,23:49
Tuesday,19:34

Categories

Resources