How to detect and edit date in file - python

I have a file that consists of a bunch of lines that have dates in them, for example:
1, '01-JAN-10', '04-JAN-10', 100, 'HELEN', 'PRICE'
2, 'MARK', 'TYER', '05-JAN-10', '06-JAN-10', 120
I want to change the date parts of the lines to a different format, but I don't know how to detect which part of the line has the date fields and I don't know how to replace them with the new date format. I already have a function called changeDate(date) that returns a correctly formatted date given a bad format date. This is my code so far:
def editFile(filename)
f = open(filename)
while line:
line = f.readline()
for word in line.split():
#detect if it is a date, and change to new format
f.close()

You can use strptime and try/catch to do this:
strptime
Return a datetime corresponding to date_string, parsed according to
format.
See more details from strftime() and strptime() Behavior.
from datetime import datetime
s="1, '01-JAN-10', '04-FEB-28', 100, 'HELEN', 'PRICE'"
for word in s.replace(' ','').replace('\'','').split(','):
try:
dt=datetime.strptime(word,'%y-%b-%d')
print('{0}/{1}/{2}'.format(dt.month, dt.day, dt.year))
except Exception as e:
print(word)
Result:
1
1/10/2001
2/28/2004
100
HELEN
PRICE

You can use regex to detect. It's hard to modify the file in place, maybe you could write all the new contents to a new file.
import re
with open('filename', 'r') as f:
input_file = f.read()
# input_file = "1, '01-JAN-10', '04-JAN-10', 100, 'HELEN', 'PRICE'"
dates = re.findall(r'\d+-[A-Za-z]+-\d+', input_file) # output: ['01-JAN-10', '04-JAN-10']
for old in dates:
input_file.replace(old, changeDate(old)) # your changeDate(date) in your question
with open('new_file', 'w+') as f:
f.write(input_file)

Related

Data Cleansing Dates in Python

I am processing a .csv in Python which has a column full of timestamps in the below format:-
August 21st 2020, 13:58:19.000
The full content of a line in the .csv looks like this:-
"August 21st 2020, 14:55:12.000","joebloggs#gmail.com",IE,Ireland,"192.168.0.1",2
What I wish to do is populate a separate column with the respective date and time and so I'd have '2020-08-21' in a new column and '13:58:19' in an additional new column.
Looking at this, I think I need to firstly tidy up the date and remove the letters after the actual date (21st, 4th, 7th etc) so that they read 21, 4 etc...
I started writing the below but this will replace the actual numbers too. How can I just strip out the letters after the numbers? I did think of a simple find/replace but this would remove 'st' from August for example hence a regex is needed I think to look for a space following by 1 or 2 numbers and then check if it is followed by ['nd','st','rd','th'] and if so, strip this out:-
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line = re.sub(r"\s\d+", "", line)
print(line)
Based on a suggestion to use the datetime module, I tried splitting to see if this helped:-
import re
from datetime import datetime
with open('input_csv.csv', 'r') as input_file:
lines = input_file.readlines()[1:]
for line in lines:
line_lst = line.split(',')
line_date = datetime.strptime(line_lst[0], '%B %d %y')
line_time = datetime.strptime(line_lst[1], '%H:%M:%S.%f')
print(line_date)
print(line_time)
but I still receive an error:-
ValueError: time data '"May 12th 2020' does not match format '%B %d %y'
It's probably easier to use the datetime library. For example,
from datetime import datetime
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line_date = datetime.strptime(line, '%B %d %y, %H:%M:%S.%f')
and then use strftime() to write it to the new file in the new format. To read more about these methods:
https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
I recommend using as many good libraries as you can for this: a lot of what you want to do has already been developed and tested.
Use a CSV parser
Parsing a CSV can look simple at first but become increasingly difficult as there are a lot of special rules around quoting (escaping) characters. Use Python's standard CSV module to do this:
import csv
with open('input.csv', newline='') as f:
reader = csv.reader(csv_file)
for row in reader:
date_val = row[0]
print(f'Raw string: {date_val}')
and I get the following without any special code:
Raw string: August 21st 2020, 14:55:12.000
Use a really good date-time parser
As a couple of others already pointed out, it's important to use a date-time parser, but they didn't tell you that Python's standard parser cannot handle the ordinals like '12th' and '21st' (maybe they just didn't see those...). The 3rd-party dateutil lib can handle them, though:
from dateutil import parser
...
...
print(f'Raw string: {date_val}')
dt = parser.parse(date_val)
print(f'Date: {dt.date()}')
print(f'Time: {dt.time()}')
Raw string: August 21st 2020, 14:55:12.000
Date: 2020-08-21
Time: 14:55:12
Make a new a CSV
Again, use the CSV module's writer class to take the modified rows from the reader and turn them into a CSV.
For simplicity, I recommend just accumulating new rows as you're reading and parsing the input. I've also included commented-out lines for copying a header, if you have one:
...
new_rows = []
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# header = next(reader)
# new_rows.append(header)
for row in reader:
date_val = row[0]
remainder = row[1:]
dt = parser.parse(date_val)
new_row = [dt.date(), dt.time()]
new_row += remainder
new_rows.append(new_row)
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(new_rows)
My output.csv looks like this:
2020-08-21,14:55:12,joebloggs#gmail.com,IE,Ireland,192.168.0.1,2

How remove time from date values pulled from a JSON file?

I am working in python and using a JSON file and pulling info from it and sending to a csv file. The code I am using is as follows:
import csv
import json
csv_kwargs = {
'dialect': 'excel',
'doublequote': True,
'quoting': csv.QUOTE_MINIMAL
}
inpfile = open('checkin.json', 'r', encoding='utf-8')
outfile = open('checkin.csv', 'w', encoding='utf-8')
writer = csv.writer(outfile, **csv_kwargs, lineterminator="\n")
for line in inpfile:
d = json.loads(line)
writer.writerow([d['business_id'],d['date']])
inpfile.close()
outfile.close()
checkin.json key values of business_id and date. The date values are in the form of 'MM:DD:YYYY HH:MM:SS' where it shows the date and then the time. Each business_id includes multiple dates associated with it. I included a line of the JSON file to show how each 'business_id' works and the dates associated with it. A line from the JSON is shown below:
{"business_id":"--1UhMGODdWsrMastO9DZw","date":"2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016-10-15 02:45:18, 2016-11-18 01:54:50, 2017-04-20 18:39:06, 2017-05-03 17:58:02"}
My question is how do you code this to keep the date, but not the time being that they are in the same key value.
You can parse the date in your JSON as a timestamp and then truncate it to date using Python's built-in datetime module.
Import the module:
from datetime import datetime
Parse the date while writing:
for line in inpfile:
d = json.loads(line)
dates = map(lambda dt: datetime.strptime(dt.strip(), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d'), d['dates'].split(' '))
for date in dates:
writer.writerow([d['business_id'], date])
The formatting for date values described in you question isn't consistent, first you say it's MM:DD:YYYY, however in the sample line from the json input file it appears to be YYYY-MM-DD, and while such details may matter, that particular one doesn't to the revised code below. What did matter was the fact that there can be more than one, which is why I'm updating my answer.
import csv
import json
csv_kwargs = {
'dialect': 'excel',
'doublequote': True,
'quoting': csv.QUOTE_MINIMAL,
}
with open('checkin.json', 'r', encoding='utf-8') as inpfile, \
open('checkin.csv', 'w', encoding='utf-8', newline='') as outfile:
writer = csv.writer(outfile, **csv_kwargs)
for line in inpfile:
d = json.loads(line)
# Convert date value string into list of dates with the times removed.
dates = [date.strip().split(' ')[0] for date in d['date'].split(',')]
writer.writerow([d['business_id']] + dates)
If you're strictly using this program to convert the json file to csv, you can simply use string slices:
date, time = d['date'][:12], d['date'][12:]
If you want to store it as a datetime object to do something else
dt = time.strptime(d['date'], "'%m:%d:%Y''%H:%M:%S'")
# Other stuff
dt_string = dt.strftime("'%m:%d:%Y'")

how to remove DD in YYYYMMDD in python

I need to remove the day in date and I tried to use datetime.strftime and datetime.strptime but it couldn't work. I need to create a tuple of 2 items(date,price) from a nested list but I need to change the date format first.
here's part of the code:
def get_data(my_csv):
with open("my_csv.csv", "r") as csv_file:
csv_reader = csv.reader(csv_file, delimiter = (','))
next(csv_reader)
data = []
for line in csv_reader:
data.append(line)
return data
def get_monthly_avg(data):
oldformat = '20040819'
datetimeobject = datetime.strptime(oldformat,'%y%m%d')
newformat = datetime.strftime('%y%m ')
You miss print with date formats. 'Y' has to be capitalized.
from datetime import datetime
# use datetime to convert
def strip_date(data):
d = datetime.strptime(data,'%Y%m%d')
return datetime.strftime(d,'%Y%m')
data = '20110513'
print (strip_date(data))
# or just cut off day (2 last symbols) from date string
print (data[:6])
The first variant is better because you can verify that string is in proper date format.
Output:
201105
201105
You didnt specify any code, but this might work:
date = functionThatGetsDate()
date = date[0:6]

value error with datetime.strptime

I am trying to read dates from a txt file and have that converted to datetime format
Code:
from datetime import datetime, date
with open("birth.txt") as f:
content = f.readlines()
content = [x.strip() for x in content]
for i in content:
a = i.split(":")
date_b = []
date_b.append(a[-1])
print date_b
for j in date_b:
date_object = datetime.strptime(str(j), '%m-%d-%Y')
print date_object
Text File:
a:11-23-2001
b:02-14-2002
ValueError: time data ' 11-23-2001' does not match format '%m-%d-%Y'
Can someone help me resolve this error?
There are multiple problematic parts with your code. The error is caused by having a space before your date string although I'm not sure where it comes from given your file. Also, why are you even having the second loop? And you're overwriting the date_b in each line loop... Try this:
from datetime import datetime
with open("birth.txt") as f:
dates = [] # store this outside of your loop
for line in f: # read line by line
v, d = line.strip().split(":")
d = datetime.strptime(d.strip(), '%m-%d-%Y') # just in case of additional whitespace
dates.append((v, d))
print(dates)
# [('a', datetime.datetime(2001, 11, 23, 0, 0)), ('b', datetime.datetime(2002, 2, 14, 0, 0))]
You can turn the latter into a dictionary, too (dict(dates)) or build a dictionary immediately.
Besides the issues that #zwer points out, you have the major inefficiency of reading the entire file into memory before processing it. This actually makes your job harder than it needs to be because files in Python are iterable over their lines. You can do something like:
from datetime import datetime, date
with open('birth.txt') as f:
for line in f:
key, datestr = line.strip().split(':')
dateobj = datetime.strptime(datestr, '%m-%d-%Y')
print(dateobj)
Using the fact that the file is an iterator, you can write a one-line list comprehension to generate a full list of dates:
with open('birth.txt') as f:
dates = [datetime.strptime(line.strip().split(':')[1], '%m-%d-%Y') for line in f]
If the key has some significance, you can create a dictionary with a dictionary comprehension using a similar syntax:
with open('birth.txt') as f:
dates = {key: datetime.strptime(datestr, '%m-%d-%Y') for key, datestr in (line.strip().split(':') for line in f)}

ValueError: time data '' does not match format '%d-%m-%Y %H:%M:%S'

What I am trying to do is:
Delete all rows where csv date is lower than 25.05.2016 23:59
Save the file with a different name
I have the following data in a csv in col A
WFQVG98765
FI Quality-Value-Growth
Some Random String 1
Datum
13-05-2016 23:59
14-05-2016 23:59
15-05-2016 23:59
16-05-2016 23:59
17-05-2016 23:59
18-05-2016 23:59
19-05-2016 02:03
.
.
.
.
This is what I have tried now
import csv
import datetime
from dateutil.parser import parse
def is_date(string):
try:
parse(string)
return True
except ValueError:
return False
'''
1. Delete all rows where csv date is lower than 25.05.2016 23:59
2. Save the file with a different name
'''
cmpDate = datetime.datetime.strptime('25.05.2016 23:59:00', '%d.%m.%Y %H:%M:%S')
with open('WF.csv', 'r') as csvfile:
csvReader = csv.reader(csvfile, delimiter=',')
for row in csvReader:
print (row[0])
if is_date(row[0]) and not row[0].strip(' '):
csvDate = datetime.datetime.strptime(row[0], '%d-%m-%Y %H:%M:%S') 'Error Here : ValueError: time data '' does not match format '%d-%m-%Y %H:%M:%S'
I also tried this for the error line
csvDate = datetime.datetime.strptime(row[0], '%d-%m-%Y %H:%M') 'But got the same error
if csvDate<cmpDate:
print (row[0]+'TRUE')
Here how can I delete the row if the condition is true and finally save it with a different name ?
You can analyse each row to compare the dates, and save the rows you want to keep in a list. You can then store those rows into a new csv file and delete the old one if you don't need it anymore.
Here's a snipped that does what you're asking for:
import csv
from datetime import datetime
cmpDate = datetime.strptime('25.05.2016 23:59:00', '%d.%m.%Y %H:%M:%S')
def is_lower(date_str):
try:
csvDate = datetime.strptime(row[0], '%d-%m-%Y %H:%M')
return (csvDate < cmpDate)
except:
pass
with open('WF.csv', 'r') as csvfile:
csvReader = csv.reader(csvfile, delimiter=',')
data = [row for row in csvReader if not is_lower(row[0])]
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
[writer.writerow(row) for row in data]
is_date() is giving you false positives. Be more strict when you check the date format and consistent when you load a date string into datetime - follow one of the principles of Zen of Python - "There should be one-- and preferably only one --obvious way to do it":
def is_date(date_string):
try:
datetime.datetime.strptime(date_string, '%d-%m-%Y %H:%M:%S')
return True
except ValueError:
return False
In other words, don't mix dateutil.parser.parse() and datetime.datetime.strptime().
The datetime.datetime.strptime exception indicates you are passing a blank string to the function in row[0].
Once you get that issue resolved, you need to add code to write acceptable rows to a new file.
You're doing the wrong comparison when you call strip. Two things:
First of all, just use row[0].strip() (with no arguments). This will strip all whitespace (like newlines, etc), not just spaces.
Secondly, if is_date(row[0]) and not row[0].strip(' ') only passes when row[0] is empty, which is the opposite of what you want. This should be if row[0].strip() and is_date(row[0]):
Even better, given how your is_date function is implemented, you should probably just put your datetime creation into a function that handles errors. This is my usual pattern:
def parse_date(str_date):
try:
return datetime.datetime.strptime(str_date, '%d-%m-%Y %H:%M')
except ValueError:
return None
cmp_date = datetime.datetime.strptime('25.05.2016 23:59:00', '%d.%m.%Y %H:%M:%S')
output_rows = []
with open('WF.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
csv_date = parse_date(row[0].strip()) # returns a datetime or None
if csv_date and csv_date >= cmp_date:
output_rows.append(row)
# Finally, write output_rows to the output file

Categories

Resources