Importing csv with python by timestemp range - python

I have about 100 csv files (at this moment, tomorrow will be more) in one location, updating every day with 24-40 new files. So, what is the best way to import files just from the past day, but the other way than this where I need to put file name:
data = pd.read_csv('/data/testingfile-PM_18707-2017_06_14-05_03_23__382.csv', delimiter = ';', low_memory=False)
data1 = pd.read_csv('/data/testingfile--PM_18707-2017_06_14-06_30_56__131.csv', delimiter = ';', low_memory=False)
Is it possible to write some timestamp recognition function?
from datetime import time
from datetime import date
from datetime import datetime
import fnmatch
def get_local_file(date, hour, path='data/'):
"""Get date+hour processing file from local drive
:param date: str Processing date
:param hour: str Processing hour
:param path: str Path to file location
:return: Pandas DF Retrieved DataFrame
"""
hour = [time(i).strftime(%H) for i in range(24)]
sdate = date.replace('-', '_') + "-" + str(hour)
for p_file in os.listdir(path):
if fnmatch.fnmatch(p_file, 'testingfile-PM*'+sdate+'*.csv'):
return pd.read_csv(path+p_file, delimiter=';')
I found something like this, but I can't make it work.

If you are looking for a way to extract date from the name of your csv file, then have a look at the pythons' datetime module (or strptime method, to be accurate). It allows you to parse the strings into datetimes like this:
from datetime import datetime
name = "data/testingfile-PM_18707-2017_06_14-05_03_23__382.csv"
datepart = name.strip("data/testingfile-PM_18707-").split("__")[0] #quick and dirty parsing method that satisfies the given two examples.
date = datetime.strptime(datepart,"%Y_%m_%d-%H_%M_%S")
print(datepart)
print(date)
2017_06_14-05_03_23
2017-06-14 05:03:23
So if you want to selectively open only 1 day old csvs, you could do something like this:
import glob
from datetime import datetime
now = datetime.now()
for csv in glob.glob("data/*.csv"):
datepart = csv.strip("data/testingfile-PM_18707-").split("__")[0]
date = datetime.strptime(datepart, "%Y_%m_%d-%H_%M_%S")
if (now - date).total_seconds() < 3600*24:
pd.read_csv(csv)
else:
print("Too old to care!")
Note that this has nothing to do with Pandas itself.

Related

Issues with converting date time to proper format- Columns must be same length as key

I'm doing some data analysis on a dataset (https://www.kaggle.com/sudalairajkumar/covid19-in-usa) and Im trying to convert the date and time column (lastModified) to the proper datetime format. When I tried it first it returned an error
ValueError: hour must be in 0..23
so I tried doing this -
data_df[['date','time']] =
data_df['lastModified'].str.split(expand=True)
data_df['lastModified'] = (pd.to_datetime(data_df.pop('date'),
format='%d/%m/%Y') +
pd.to_timedelta(data_df.pop('time') + ':00'))
This gives an error - Columns must be same length as key
I understand this means that both columns I'm splitting arent the same size. How do I resolve this issue? I'm relatively new to python. Please explain in a easy to understand manner. thanks very much
This is my whole code-
import pandas as pd
dataset_url = 'https://www.kaggle.com/sudalairajkumar/covid19-in-
usa'
import opendatasets as od
od.download(dataset_url)
data_dir = './covid19-in-usa'
import os
os.listdir(data_dir)
data_df = pd.read_csv('./covid19-in-usa/us_covid19_daily.csv')
data_df
data_df[['date','time']] =
data_df['lastModified'].str.split(expand=True)
data_df['lastModified'] = (pd.to_datetime(data_df.pop('date'),
format='%d/%m/%Y') +
pd.to_timedelta(data_df.pop('time') + ':00'))
Looks like lastModified is in ISO format. I have used something like below to convert iso date string:
from dateutil import parser
from datetime import datetime
...
timestamp = parser.isoparse(lastModified).timestamp()
dt = datetime.fromtimestamp(timestamp)
...
On this line:
data_df[['date','time']] = data_df['lastModified'].str.split(expand=True)
In order to do this assignment, the number of columns on both sides of the = must be the same. split can output multiple columns, but it will only do this if it finds the character it's looking for to split on. By default, it splits by whitespace. There is no whitespace in the date column, and therefore it will not split. You can read the documentation for this here.
For that reason, this line should be like this, so it splits on the T:
data_df[['date','time']] = data_df['lastModified'].str.split('T', expand=True)
But the solution posted by #southiejoe is likely to be more reliable. These timestamps are in a standard format; parsing them is a previously-solved problem.
You need these libraries
#import
from dateutil import parser
from datetime import datetime
Then try writing something similar for convert the date and time column. This way the columns should be the same length as the key
#convert the time column to the correct datetime format
clock = parser.isoparse(lastModified).timestamp()
#convert the date column to the correct datetime format
data = datetime.fromtimestamp(timestamp)

How to extract date from filename in python?

I need to extract the event date written on the filename to be in a new column called event_date, I am assumed I can use regex but I still do not get the exact formula to implement.
The filename is written below
file_name = X-Y Cable Installment Monitoring (10-7-20).xlsx
The (10-7-20) is in mm-dd-yy format.
I expect the date would result df['event_date'] = 2020-10-07
How should I write my script to get the correct date from the filename.
Thanks in advance.
use str.rsplit() with datetime module -
Steps -
extract date
convert it into the required datetime format.
from datetime import datetime
file_name = 'X-Y Cable Installment Monitoring (10-7-20).xlsx'
date = file_name.rsplit('(')[1].rsplit(')')[0] # '10-7-20'
date = datetime.strptime(date, "%m-%d-%y").strftime('%Y-%m-%d') # '2020-10-07'
Or via regex -
import re
regex = re.compile(r"(\d{1,2}-\d{1,2}-\d{2})") # pattern to capture date
matchArray = regex.findall(file_name)
date = matchArray[0]
date = datetime.strptime(date, "%m-%d-%y").strftime('%Y-%m-%d')

How do I remove/strip the row number from a variable in Python

import csv
import pandas as pd
from datetime import datetime,time,date
from pandas.io.data import DataReader
fd = pd.read_csv('c:\\path\\to\\file.csv')
fd.columns = ['Date','Time']
datex = fd.Date
timex = fd.Time
timestr = datetime.strptime ( str(datex+" "+timex) , "%m/%d/%Y %H:%M")
So, what I'm trying to do is pass columns Date and Time to datetime. There are two columns, date and time containing, obviously, the date and time. But when I try the above method, I receive this error:
\n35760 08/07/2015 04:56\n35761 08/07/2015 04:57\n35762 08/07/2015 04:58\n35763 08/07/2015 04:59\ndtype: object' does not match format '%m/%d/%Y %H:%M'
So, how do I either strip or remove \nXXXXX from datex and timex? Or otherwise match the format?
# concatenate two columns ( date and time ) into one column that represent date time now into one columns
datetime = datex + timex
# remove additional characters and spaces from newly created datetime colummn
datetime = datetime.str.replace('\n\d+' , '').str.strip()
# then string should be ready to be converted to datetime easily
pd.to_datetime(datetime , format='%m/%d/%Y%H:%M')
Use pandas built-in parse_dates function :)
pd.read_csv('c:\\path\\to\\file.csv', parse_dates=True)

Addition of two dates on python 3

I try adding date and hours from csv file in one datetime variable. I read questions about adding some timedelta and official doc https://docs.python.org/3/library/datetime.html#timedelta-objects, but don't understend how it works.
My csv row looks like - ['2005.02.28', '17:38', '1.32690', '1.32720', '1.32680', '1.32720', '5'].I convert row[0] = 2005.02.28 to date and convert row[1] = 17:38. Now i need creating new datetime variable looks like 2005.02.28 17:38. How i can do it?
import csv
import datetime as dt
with open('EURUSDM1.csv') as csvfile:
datereader=csv.reader(csvfile)
for row in datereader:
date=dt.datetime.strptime(row[0], "%Y.%m.%d")
time=dt.datetime.strptime(row[1], "%H:%M")
import csv
import datetime as dt
with open('EURUSDM1.csv') as csvfile:
datereader = csv.reader(csvfile)
for row in datereader:
dstr = row[0] + ' ' + row[1]
date = dt.datetime.strptime(dstr, "%Y.%m.%d %H:%M")
Once you have both the values in two variables, new_date and new_time, you could simply combine them to get the datetime, like this,
>>> new_date = dt.datetime.strptime(row[0], "%Y.%m.%d")
>>> new_time = dt.datetime.strptime(row[1], "%H:%M").time()
>>>
>>> dt.datetime.combine(new_date, new_time)
datetime.datetime(2005, 2, 28, 17, 38)
Note:- Avoid using date and time as variable names, as they are also part of the datetime library. Use a variable name that is relevant to your application context.

Convert YYYYMMDD filename to YYYYJD

I'm trying to write a python script to convert a folder of .asc files (365 files for every year in different folders organized by year) that have the yearmonthdate in their filename to have the yearjuliandate instead and the julian date needs to be 3 digits (ie 1 = 001).
The format they are in: ETos19810101.asc.
I want them to be as: ETos1981001.asc
How do I write this in Python where I can iterate over each file and convert it to the correct julian day?
I'm trying to write a Python script to convert a folder of .asc files (365 files for every year in different folders organized by year) that have the yearmonthdate in their filename to have the yearjuliandate instead and the julian date needs to be 3 digits (ie 1 = 001).
The format they are in: ETos19810101.asc
I want them to be as: ETos1981001.asc
How do I write this in Python where I can iterate over each file and convert it to the correct julian day?
I have this so far:
import os.path, os, glob
for filename in glob.glob(filepath + "/*.asc"):
jdate = '%03d' %doy #creates 3 digit julian date
doy = doy + 1
filename.replace(int[-8:-4], jdate + 1)
Given a file name as following (you can iterate your file system with os.walk)
filename = 'ETos19810101.asc'
First of all you have to split the filename to get every significant parts:
import os
name, ext = os.path.splitext(filename)
prefix = name[0:-6] # negative prefix => string end as reference
strdate = name[-6:]
Then you can parse the date:
from datetime import datetime
date = datetime.strptime(strdate, '%Y%m%d')
Now you are able to join everything together (%Y%j format the date the way you want):
newfilename = '{prefix}{date:%Y%j}{ext}'.format(prefix=prefix, date=date, ext=ext)
Finally rename the file:
os.rename(filename, newfilename)
Note that the last instruction will fail if newfilename already exists.
To fix this issue you have to remove the file, if it exists:
if os.path.exists(newfilename):
os.remove(newfilename)
os.rename(filename, newfilename)
Use the '%j specifier along with datetime.strptime and os.rename and the various os.path commands:
from datetime import datetime
from glob import glob
import os
for filename in glob(os.path.join(filepath, 'ETos*.asc')):
try:
dt = datetime.strptime(os.path.basename(filename), 'ETos%Y%m%d.asc')
except ValueError as e:
continue # rest of file name didn't have valid date - do the needful
os.rename(filename, os.path.join(filepath, format(dt, 'ETos%Y%j.asc')))
You'll probably want a bit of handling around that and adjust to take into account your path, but that's the general principle.
For working with dates you should use the datetime module. Parse the date string with strptime. There's no function to return a julian date, but it's easy to create one:
def julian_day(dt):
jan1 = dt.replace(month=1, day=1)
return 1 + (dt - jan1).days

Categories

Resources