I am processing a .csv in Python which has a column full of timestamps in the below format:-
August 21st 2020, 13:58:19.000
The full content of a line in the .csv looks like this:-
"August 21st 2020, 14:55:12.000","joebloggs#gmail.com",IE,Ireland,"192.168.0.1",2
What I wish to do is populate a separate column with the respective date and time and so I'd have '2020-08-21' in a new column and '13:58:19' in an additional new column.
Looking at this, I think I need to firstly tidy up the date and remove the letters after the actual date (21st, 4th, 7th etc) so that they read 21, 4 etc...
I started writing the below but this will replace the actual numbers too. How can I just strip out the letters after the numbers? I did think of a simple find/replace but this would remove 'st' from August for example hence a regex is needed I think to look for a space following by 1 or 2 numbers and then check if it is followed by ['nd','st','rd','th'] and if so, strip this out:-
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line = re.sub(r"\s\d+", "", line)
print(line)
Based on a suggestion to use the datetime module, I tried splitting to see if this helped:-
import re
from datetime import datetime
with open('input_csv.csv', 'r') as input_file:
lines = input_file.readlines()[1:]
for line in lines:
line_lst = line.split(',')
line_date = datetime.strptime(line_lst[0], '%B %d %y')
line_time = datetime.strptime(line_lst[1], '%H:%M:%S.%f')
print(line_date)
print(line_time)
but I still receive an error:-
ValueError: time data '"May 12th 2020' does not match format '%B %d %y'
It's probably easier to use the datetime library. For example,
from datetime import datetime
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line_date = datetime.strptime(line, '%B %d %y, %H:%M:%S.%f')
and then use strftime() to write it to the new file in the new format. To read more about these methods:
https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
I recommend using as many good libraries as you can for this: a lot of what you want to do has already been developed and tested.
Use a CSV parser
Parsing a CSV can look simple at first but become increasingly difficult as there are a lot of special rules around quoting (escaping) characters. Use Python's standard CSV module to do this:
import csv
with open('input.csv', newline='') as f:
reader = csv.reader(csv_file)
for row in reader:
date_val = row[0]
print(f'Raw string: {date_val}')
and I get the following without any special code:
Raw string: August 21st 2020, 14:55:12.000
Use a really good date-time parser
As a couple of others already pointed out, it's important to use a date-time parser, but they didn't tell you that Python's standard parser cannot handle the ordinals like '12th' and '21st' (maybe they just didn't see those...). The 3rd-party dateutil lib can handle them, though:
from dateutil import parser
...
...
print(f'Raw string: {date_val}')
dt = parser.parse(date_val)
print(f'Date: {dt.date()}')
print(f'Time: {dt.time()}')
Raw string: August 21st 2020, 14:55:12.000
Date: 2020-08-21
Time: 14:55:12
Make a new a CSV
Again, use the CSV module's writer class to take the modified rows from the reader and turn them into a CSV.
For simplicity, I recommend just accumulating new rows as you're reading and parsing the input. I've also included commented-out lines for copying a header, if you have one:
...
new_rows = []
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# header = next(reader)
# new_rows.append(header)
for row in reader:
date_val = row[0]
remainder = row[1:]
dt = parser.parse(date_val)
new_row = [dt.date(), dt.time()]
new_row += remainder
new_rows.append(new_row)
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(new_rows)
My output.csv looks like this:
2020-08-21,14:55:12,joebloggs#gmail.com,IE,Ireland,192.168.0.1,2
i need to get a string from CSV file, i know that I can use Python but i've been looking for hours but still can't get it. The is the CSV looks like:
DATE|CUST|PHONE|EMAIL|NAME|CLASS|QTY|AMOUNT|ID|TRX_ID|BOOKING CODE|PIN
01-02-2013 09:04:16|sdasd|43543|csdfd|Voucher Regular|REGULAR|1|2250000|G001T001|0062013000149|32143000341|MV1011302JSGUCFOM
01-02-2013 09:04:16|sdasd|43543|csdfd|Voucher Regular|REGULAR|2|1200000|G001T001|0062013000149|32143000341|MV4011302CBWDQYOU&MV4011302PVSEVAPJ
01-02-2013 11:01:13|ge|||Voucher Regular|REGULAR|1|600000|G001T001|20000027000005|32143000355|MV4011302UHKMJEEM
The string that I want to get is the PIN column (the last one); but in each column, there can be multiple PINs, separated by '&'.
Thanks for the help, been looking at solving this for hours.
Split on | and get the last entry:
pin = line.split('|')[-1]
Or more fancy:
import csv
with open('bookings.csv', 'rb') as csvfile:
bookings = csv.reader(csvfile, delimiter='|')
for values in bookings:
print(values[-1])
When dealing with a csv file, just use the csv module:
import csv
from itertools import chain
with open('path/to/your/file.csv', 'rb') as csvfile:
tmp = (r['PIN'].split('&') for r in csv.DictReader(csvfile, delimiter='|'))
pins = list(chain.from_iterable(tmp))
for pin in pins:
print pin
Iterate through each line and use something like this:
'hello|world|and|noel'.split('|')[-1]
to get the last element
Name,USAF,NCDC,Date,HrMn,I,Type,Dir,Q,I,Spd,Q
OXNARD,723927,93110,19590101,0000,4,SAO,270,1,N,3.1,1,
OXNARD,723927,93110,19590101,0100,4,SAO,338,1,N,1.0,1,
OXNARD,723927,93110,19590101,0200,4,SAO,068,1,N,1.0,1,
OXNARD,723927,93110,19590101,0300,4,SAO,068,1,N,2.1,1,
OXNARD,723927,93110,19590101,0400,4,SAO,315,1,N,1.0,1,
OXNARD,723927,93110,19590101,0500,4,SAO,999,1,C,0.0,1,
....
OXNARD,723927,93110,19590102,0000,4,SAO,225,1,N,2.1,1,
OXNARD,723927,93110,19590102,0100,4,SAO,248,1,N,2.1,1,
OXNARD,723927,93110,19590102,0200,4,SAO,999,1,C,0.0,1,
OXNARD,723927,93110,19590102,0300,4,SAO,068,1,N,2.1,1,
Here is a snippet of a csv file storing hourly wind speeds (Spd) in each row. What I'd like to do is select all hourly winds for each day in the csv file and store them into a temporary daily list storing all of that day's hourly values (24 if no missing values). Then I'll output the current day's list, create new empty list for the next day, locate hourly speeds in the next day, output that daily list, and so forth until the end of the file.
I'm struggling with a good method to do this. One thought I have is to read in line i, determine the date(YYYY-MM-DD), then read in line i+1 and see if that date matchs date i. If they match, then we're in the same day. If they don't, then we are onto the next day. But I can't even figure out how to read in the next line in the file...
Any suggestions to execute this method or a completely new (and better?!) method are most welcome. Thanks you in advance!
obs_in = open(csv_file).readlines()
for i in range(1,len(obs_in)):
# Skip over the header lines
if not str(obs_in[i]).startswith("Identification") and not str(obs_in[i]).startswith("Name"):
name,usaf,ncdc,date,hrmn,i,type,dir,q,i2,spd,q2,blank = obs_in[i].split(',')
current_dt = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
current_spd = spd
# Read in next line's date: is it in the same day?
# If in the same day, then append spd into tmp daily list
# If not, then start a new list for the next day
You can take advantage of the well-ordered nature of the data file and use csv.dictreader. Then you can build up a dictionary of the windspeeds organized by date quite simply, which you can process as you like. Note that the csv reader returns strings, so you might want to convert to other types as appropriate while you assemble the list.
import csv
from collections import defaultdict
bydate = defaultdict(list)
rdr = csv.DictReader(open('winds.csv','rt'))
for k in rdr:
bydate[k['Date']].append(float(k['Spd']))
print(bydate)
defaultdict(<type 'list'>, {'19590101': [3.1000000000000001, 1.0, 1.0, 2.1000000000000001, 1.0, 0.0], '19590102': [2.1000000000000001, 2.1000000000000001, 0.0, 2.1000000000000001]})
You can obviously change the argument to the append call to a tuple, for instance append((float(k['Spd']), datetime.datetime.strptime(k['Date']+k['HrMn'],'%Y%m%D%H%M)) so that you can also collect the times.
If the file has extraneous spaces, you can use the skipinitialspace parameter: rdr = csv.DictReader(open('winds.csv','rt'), fieldnames=ff, skipinitialspace=True). If this still doesn't work, you can pre-process the header line:
bydate = defaultdict(list)
with open('winds.csv', 'rt') as f:
fieldnames = [k.strip() for k in f.readline().split(', ')]
rdr = csv.DictReader(f, fieldnames=fieldnames, skipinitialspace=True)
for k in rdr:
bydate[k['Date']].append(k['Spd'])
return bydate
bydate is accessed like a regular dictionary. To access a specific day's data, do bydate['19590101']. To get the list of dates that were processed, you can do bydate.keys().
If you want to convert them to Python datetime objects at the time of reading the file, you can import datetime, then replace the assignment line with bydate[datetime.datetime.strptime(k['Date'], '%Y%m%d')].append(k['Spd']).
It can be something like this.
def dump(buf, date):
"""dumps buffered line into file 'spdYYYYMMDD.csv'"""
if len(buf) == 0: return
with open('spd%s.csv' % date, 'w') as f:
for line in buf:
f.write(line)
obs_in = open(csv_file).readlines()
# buf stores one day record
buf = []
# date0 is meant for time stamp for the buffer
date0 = None
for i in range(1,len(obs_in)):
# Skip over the header lines
if not str(obs_in[i]).startswith("Identification") and \
not str(obs_in[i]).startswith("Name"):
name,usaf,ncdc,date,hrmn,ii,type,dir,q,i2,spd,q2,blank = \
obs_in[i].split(',')
current_dt = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
current_spd = spd
# see if the time stamp of current record is different. if it is different
# dump the buffer, and also set the time stamp of buffer
if date != date0:
dump(buf, date0)
buf = []
date0 = date
# you change this. i am simply writing entire line
buf.append(obs_in[i])
# when you get out the buffer should be filled with the last day's record.
# so flush that too.
dump(buf, date0)
I also found that i have to use ii instead of i for the filed "I" of the data, as you used i for loop counter.
I know this question is from years ago but just wanted to point out that a small bash script can neatly perform this task. I copied your example into a file called data.txt and this is the script:
#!/bin/bash
date=19590101
date_end=19590102
while [[ $date -le $date_end ]] ; do
grep ",${date}," data.txt > file_${date}.txt
date=`date +%Y%m%d -d ${date}+1day` # NOTE: MAC-OSX date differs
done
Note that this won't work on MAC as for some reason the date command implementation is different, so on MAC you either need to use gdate (from coreutils) or change the options to match those for date on MAC.
If there are dates missing from the file the grep command produces an empty file - this link shows ways to avoid this:
how to stop grep creating empty file if no results