How remove time from date values pulled from a JSON file? - python

I am working in python and using a JSON file and pulling info from it and sending to a csv file. The code I am using is as follows:
import csv
import json
csv_kwargs = {
'dialect': 'excel',
'doublequote': True,
'quoting': csv.QUOTE_MINIMAL
}
inpfile = open('checkin.json', 'r', encoding='utf-8')
outfile = open('checkin.csv', 'w', encoding='utf-8')
writer = csv.writer(outfile, **csv_kwargs, lineterminator="\n")
for line in inpfile:
d = json.loads(line)
writer.writerow([d['business_id'],d['date']])
inpfile.close()
outfile.close()
checkin.json key values of business_id and date. The date values are in the form of 'MM:DD:YYYY HH:MM:SS' where it shows the date and then the time. Each business_id includes multiple dates associated with it. I included a line of the JSON file to show how each 'business_id' works and the dates associated with it. A line from the JSON is shown below:
{"business_id":"--1UhMGODdWsrMastO9DZw","date":"2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016-10-15 02:45:18, 2016-11-18 01:54:50, 2017-04-20 18:39:06, 2017-05-03 17:58:02"}
My question is how do you code this to keep the date, but not the time being that they are in the same key value.

You can parse the date in your JSON as a timestamp and then truncate it to date using Python's built-in datetime module.
Import the module:
from datetime import datetime
Parse the date while writing:
for line in inpfile:
d = json.loads(line)
dates = map(lambda dt: datetime.strptime(dt.strip(), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d'), d['dates'].split(' '))
for date in dates:
writer.writerow([d['business_id'], date])

The formatting for date values described in you question isn't consistent, first you say it's MM:DD:YYYY, however in the sample line from the json input file it appears to be YYYY-MM-DD, and while such details may matter, that particular one doesn't to the revised code below. What did matter was the fact that there can be more than one, which is why I'm updating my answer.
import csv
import json
csv_kwargs = {
'dialect': 'excel',
'doublequote': True,
'quoting': csv.QUOTE_MINIMAL,
}
with open('checkin.json', 'r', encoding='utf-8') as inpfile, \
open('checkin.csv', 'w', encoding='utf-8', newline='') as outfile:
writer = csv.writer(outfile, **csv_kwargs)
for line in inpfile:
d = json.loads(line)
# Convert date value string into list of dates with the times removed.
dates = [date.strip().split(' ')[0] for date in d['date'].split(',')]
writer.writerow([d['business_id']] + dates)

If you're strictly using this program to convert the json file to csv, you can simply use string slices:
date, time = d['date'][:12], d['date'][12:]
If you want to store it as a datetime object to do something else
dt = time.strptime(d['date'], "'%m:%d:%Y''%H:%M:%S'")
# Other stuff
dt_string = dt.strftime("'%m:%d:%Y'")

Related

Data Cleansing Dates in Python

I am processing a .csv in Python which has a column full of timestamps in the below format:-
August 21st 2020, 13:58:19.000
The full content of a line in the .csv looks like this:-
"August 21st 2020, 14:55:12.000","joebloggs#gmail.com",IE,Ireland,"192.168.0.1",2
What I wish to do is populate a separate column with the respective date and time and so I'd have '2020-08-21' in a new column and '13:58:19' in an additional new column.
Looking at this, I think I need to firstly tidy up the date and remove the letters after the actual date (21st, 4th, 7th etc) so that they read 21, 4 etc...
I started writing the below but this will replace the actual numbers too. How can I just strip out the letters after the numbers? I did think of a simple find/replace but this would remove 'st' from August for example hence a regex is needed I think to look for a space following by 1 or 2 numbers and then check if it is followed by ['nd','st','rd','th'] and if so, strip this out:-
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line = re.sub(r"\s\d+", "", line)
print(line)
Based on a suggestion to use the datetime module, I tried splitting to see if this helped:-
import re
from datetime import datetime
with open('input_csv.csv', 'r') as input_file:
lines = input_file.readlines()[1:]
for line in lines:
line_lst = line.split(',')
line_date = datetime.strptime(line_lst[0], '%B %d %y')
line_time = datetime.strptime(line_lst[1], '%H:%M:%S.%f')
print(line_date)
print(line_time)
but I still receive an error:-
ValueError: time data '"May 12th 2020' does not match format '%B %d %y'
It's probably easier to use the datetime library. For example,
from datetime import datetime
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line_date = datetime.strptime(line, '%B %d %y, %H:%M:%S.%f')
and then use strftime() to write it to the new file in the new format. To read more about these methods:
https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
I recommend using as many good libraries as you can for this: a lot of what you want to do has already been developed and tested.
Use a CSV parser
Parsing a CSV can look simple at first but become increasingly difficult as there are a lot of special rules around quoting (escaping) characters. Use Python's standard CSV module to do this:
import csv
with open('input.csv', newline='') as f:
reader = csv.reader(csv_file)
for row in reader:
date_val = row[0]
print(f'Raw string: {date_val}')
and I get the following without any special code:
Raw string: August 21st 2020, 14:55:12.000
Use a really good date-time parser
As a couple of others already pointed out, it's important to use a date-time parser, but they didn't tell you that Python's standard parser cannot handle the ordinals like '12th' and '21st' (maybe they just didn't see those...). The 3rd-party dateutil lib can handle them, though:
from dateutil import parser
...
...
print(f'Raw string: {date_val}')
dt = parser.parse(date_val)
print(f'Date: {dt.date()}')
print(f'Time: {dt.time()}')
Raw string: August 21st 2020, 14:55:12.000
Date: 2020-08-21
Time: 14:55:12
Make a new a CSV
Again, use the CSV module's writer class to take the modified rows from the reader and turn them into a CSV.
For simplicity, I recommend just accumulating new rows as you're reading and parsing the input. I've also included commented-out lines for copying a header, if you have one:
...
new_rows = []
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# header = next(reader)
# new_rows.append(header)
for row in reader:
date_val = row[0]
remainder = row[1:]
dt = parser.parse(date_val)
new_row = [dt.date(), dt.time()]
new_row += remainder
new_rows.append(new_row)
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(new_rows)
My output.csv looks like this:
2020-08-21,14:55:12,joebloggs#gmail.com,IE,Ireland,192.168.0.1,2

How to use python to find the maximum value in a csv file list

I am very new to python and need some help finding the maximum/highest value of in a column of data (time) that is imported from a csv file. this is the code i have tried.
file = open ("results.csv")
unneeded = file.readline()
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
maxtime = 0
for x in hours:
if x > maxtime:
maxtime = x
print (maxtime)
any help is appreciated
edit: i tried this code but it gives me the wrong answer :(
file = open ("results.csv")
unneeded = file.readline()
maxtime = 0
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
if hours > str(maxtime):
maxtime = hours
print (maxtime)
[first few lines of results][1]
edit:
results cvs
[1]: https://i.stack.imgur.com/z3pEJ.png
I haven't tested it but this should work. Using the CSV library is easy for parsing CSV files.
import csv
with open("results.csv") as file:
csv_reader = csv.reader(file, delimiter=',')
for row in csv_reader:
hours = row[4]
maxtime = 0
if hours > maxtime:
maxtime = x
print (maxtime)
file.close()
My recommendation is using the pandas module for anything CSV-related.
Using dateutil, I create a dataset of dates and identification numbers whose date values are shuffled (no specific order) like so:
from dateutil.parser import *
from dateutil.rrule import *
from random import shuffle
dates = list(rrule(
DAILY,
dtstart=parse("19970902T090000"),
until=parse("19971224T000000")
))
shuffle(dates)
with open('out.csv', mode='w', encoding='utf-8') as out:
for i,date in enumerate(dates):
out.write(f'{i},{date}\n')
So thus, in this particular dataset, 1997-12-23 09:00:00 would be the "largest" date. Then, to extract the maximum date, we can just do it via string comparisons if it is formatted in the ISO 8601 date/time format:
from pandas import read_csv
df = read_csv('out.csv', names=['index', 'time'], header=1)
print(max(df['time']))
After running it, we indeed get 1997-12-23 09:00:00 printed in the terminal!

To store Datetime type in csv file

I want to store Datetime in CSV file.
I tried following code
date=location.iloc[0,1]
lastdate=location.iloc[len(location)-1,1]
import csv
f = open('numbers3.csv', 'w')
with f:
writer = csv.writer(f)
writer.writerow(fields)
while(date==lastdate) or (date<lastdate):
print(date)
strr=str(date)
writer.writerow(str(date))
date=date+timedelta(minutes=15)
output for this is
And I want the following output
If you want to store a list of datetime to a column in a csv file, you need to put all datetime elements in a list with this format:
datetimes = [['2020-10-10 13:12'], ['2020-10-10 13:12'], ['2020-10-10 13:12'], ['2020-10-10 13:12']]
myFile = open('/path/csvexample3.csv', 'w')
with myFile:
writer = csv.writer(myFile)
writer.writerows(datetimes)
Then you are good to go. Note that each date is in a list, and the datetimes is a list of list

How to compare date from csv(string) to actual date

filenameA ="ApptA.csv"
filenameAc = "CheckoutA.csv"
def checkouttenantA():
global filenameA
global filenameAc
import csv
import datetime
with open(filenameA, 'r') as inp, open(filenameAc, 'a' , newline = "") as out:
my_writer = csv.writer(out)
for row in csv.reader(inp):
my_date= datetime.date.today()
string_date = my_date.strftime("%d/%m/%Y")
if row[5] <= string_date:
my_writer.writerow(row)
Dates are saved in format %d/%m/%Y in an excel file on column [5]. I am trying to compare dates in csv file with actual date, but it is only comparing the %d part. I assume it is because dates are in string format.
Ok so there are a few improvements to make as well, which I'll put as an edit to this, but you're converting todays date to a string with strftime() and comparing the two strings, you should be converting the string date from the csv file to a datetime object and comparing those instead.
I'll add plenty of comments to try and explain the code and the reasoning behind it.
# imports should go at the top
import csv
# notice we are importing datetime from datetime (we are importing the `datetime` type from the module datetime
import from datetime import datetime
# try to avoid globals where possible (they're not needed here)
def check_dates_in_csv(input_filepath):
''' function to load csv file and compare dates to todays date'''
# create a list to store the rows which meet our criteria
# appending the rows to this will make a list of lists (nested list)
output_data = []
# get todays date before loop to avoid calling now() every line
# we only need this once and it'll slow the loop down calling it every row
todays_date = datetime.now()
# open your csv here using the function argument
with open(input_filepath, output_filepath) as csv_file:
reader = csv.reader(csv_file)
# iterate over the rows and grab the date in each row
for row in reader:
string_date = row[5]
# convert the string to a datetime object
csv_date = datetime.strptime(string_date, '%d/%m/%Y')
# compare the dates and append if it meets the criteria
if csv_date <= todays_date:
output_data.append(row)
# function should only do one thing, compare the dates
# save the output after
return output_data
# then run the script here
# this comparison is basically the entry point of the python program
# this answer explains it better than I could: https://stackoverflow.com/questions/419163/what-does-if-name-main-do
if __name__ == "__main__":
# use our new function to get the output data
output_data = check_dates_in_csv("input_file.csv")
# save the data here
with open("output.csv", "w") as output_file:
writer = csv.writer(output_file)
writer.writerows(output_data)
I would recommend to use Pandas for such tasks:
import pandas as pd
filenameA ="ApptA.csv"
filenameAc = "CheckoutA.csv"
today = pd.datetime.today()
df = pd.read_csv(filenameA, parse_dates=[5])
df.loc[df.iloc[:, 5] <= today].to_csv(filenameAc, index=False)

How to detect and edit date in file

I have a file that consists of a bunch of lines that have dates in them, for example:
1, '01-JAN-10', '04-JAN-10', 100, 'HELEN', 'PRICE'
2, 'MARK', 'TYER', '05-JAN-10', '06-JAN-10', 120
I want to change the date parts of the lines to a different format, but I don't know how to detect which part of the line has the date fields and I don't know how to replace them with the new date format. I already have a function called changeDate(date) that returns a correctly formatted date given a bad format date. This is my code so far:
def editFile(filename)
f = open(filename)
while line:
line = f.readline()
for word in line.split():
#detect if it is a date, and change to new format
f.close()
You can use strptime and try/catch to do this:
strptime
Return a datetime corresponding to date_string, parsed according to
format.
See more details from strftime() and strptime() Behavior.
from datetime import datetime
s="1, '01-JAN-10', '04-FEB-28', 100, 'HELEN', 'PRICE'"
for word in s.replace(' ','').replace('\'','').split(','):
try:
dt=datetime.strptime(word,'%y-%b-%d')
print('{0}/{1}/{2}'.format(dt.month, dt.day, dt.year))
except Exception as e:
print(word)
Result:
1
1/10/2001
2/28/2004
100
HELEN
PRICE
You can use regex to detect. It's hard to modify the file in place, maybe you could write all the new contents to a new file.
import re
with open('filename', 'r') as f:
input_file = f.read()
# input_file = "1, '01-JAN-10', '04-JAN-10', 100, 'HELEN', 'PRICE'"
dates = re.findall(r'\d+-[A-Za-z]+-\d+', input_file) # output: ['01-JAN-10', '04-JAN-10']
for old in dates:
input_file.replace(old, changeDate(old)) # your changeDate(date) in your question
with open('new_file', 'w+') as f:
f.write(input_file)

Categories

Resources