Find the latest file for each calendar month in a folder

Find the latest file for each calendar month in a folder - python

The code below works as I need it to, but I feel like there must be a better way. I have a folder with daily(ish) files inside of it. All of them have the same prefix and the date they were sent as the file name. On certain days, no file was sent at all though. My task it to read the last file of each month (most of the time it is the last day, but April's last file was the 28th, July was the 29th, etc).
This is using the pathlib module, which I like to continue to use.
files = sorted(ROOT.glob('**/*.csv*'))
file_dates = [Path(file.stem).stem.replace('prefix_', '').split('_') for file in files] #replace everything but a list of the date elements
dates = [pd.to_datetime(date[0] + '-' + date[1] + '-' + date[2]) for date in file_dates] #construct the proper date format
x = pd.DataFrame(dates)
x['month'] = x[0].dt.strftime('%Y-%m') + '-01'
max_value = x.groupby(['month'])[0].max().reset_index()
max_value[0] = max_value[0].dt.strftime('%Y_%m_%d')
monthly_files = [str(ROOT / 'prefix_') + date + '.csv.xz' for date in max_value[0].values]
df = pd.concat([pd.read_csv(file, usecols=columns, sep='\t', compression='xz', dtype=object) for file in monthly_files])
I believe this is a case where, because I have a hammer (pandas), everything looks like a nail (I turn everything into a dataframe). I am also trying to get used to list comprehensions after several years of not using them.

There's probably better, but here's my try:
files = sorted(ROOT.glob('**/*.csv*'))
file_dates = [Path(file.stem).stem.replace('prefix_', '').split('_') for file in files] #replace everything but a list of the date elements
df = pd.DataFrame(file_dates, columns=['y', 'm', 'd'], dtype='int')
monthly = [str(yy)+'-'+str(mm)+'-'+str(df.loc[(df['y'] == yy) & (df['m'] == mm), 'd'].max()) for yy in df.y.unique() for mm in df.m.unique()]

So the file names would be prefix_<date> and the date is in format %Y-%m-%d.
import os
from datetime import datetime as dt
from collections import defaultdict
from pathlib import Path
group_by_month = defaultdict(list)
files = []
# Assuming the folder is the data folder path itself.
for file in Path(folder).iterdir():
if os.path.isfile(file) and file.startswith('prefix_'):
# Convert the string date to a datetime object
converted_dt = dt.strptime(str(file).split('prefix_')[1],
'%Y-%m-%d')
# Group the dates by month
group_by_month[converted_dt.month].append(converted_dt)
# Get the max of all the dates stored.
max_dates = {month: max(group_by_month[month])
for month in group_by_month.keys()}
# Get the files that match the prefix and the max dates
for file in Path(folder).iterdir():
for date in max_date.values():
if ('prefix_' + dt.strftime(date, '%Y-%m-%d')) in str(file):
files.append(file)
PS: I haven't worked with pandas a lot. So, went with the native style to get the files that match the max date of a month.

To my knowledge this is going to be difficult to do with list comprehension since you have to compare the current element with the next element.
However there are simpler solutions that will get you there without pandas.
The example below just loops over a string list with the file dates and keeps the date before the month changes. Since your list is sorted that should do the trick. I am assuming YYYY_MM_DD date formats
files = sorted(ROOT.glob('**/*.csv*'))
file_dates = [Path(file.stem).stem.replace('prefix_', '') for file in files]
#adding a dummy date because we're comparing to the next element
file_dates.append('0000_00_00')
result = []
for i, j in enumerate(file_dates[:-1]):
if j[6:7] != file_dates[i+1][6:7]:
result.append(j)
monthly_files = [str(ROOT / 'prefix_') + date + '.csv.xz' for date in result]
df = pd.concat([pd.read_csv(file, usecols=columns, sep='\t', compression='xz', dtype=object) for file in monthly_files])

Related

Remove rows from csv based on given criteria + export updated csv with ne file name

Trying to figure out the code to remove the rows in csv file where in column Date there is date starting with 202110 (and any day). So all rows from October should be removed.
Then I want to save csv with orginal name + 'updated'. I think that both part where I am trying to remove row is incorrect and save the file. Could you help?
My current code is
import os
import glob
import pandas as pd
from pathlib import Path
sourcefiles = source_files = sorted(Path(r'/Users/path/path/path').glob('*.csv'))
for file in sourcefiles:
df = pd.read_csv(file)
df2 = df[~df.Date.str.contains('202110')]
df2.to_csv("Updated.csv") # How to save with orginal file name + word "updated"
Just to give the example of csv file. As you can see in yellow highlighted cells there are dates in October, these rows I need to remove and save csv with 'updated' in name. Thanks a lot for help.

You can do something like this:
for file in sourcefiles:
df = pd.read_csv(file)
df.Date = pd.to_datetime(df.Date)
condition = ~((df.Date.dt.year == 2021) & (df.Date.dt.month == 10))
df_new = df.loc[condition]
name, ext = file.name.split('.')
df.to_csv(f'{name}_updated.{ext}')
This is assuming you have one dot in your filenames.

As you use pathlib, you can use file.parent and file.stem:
Replace:
df2.to_csv("Updated.csv")
By:
df2.to_csv(file.parent / f"{file.stem}_updated.csv"))

How to iterate over folder, but only retrieve newest versions of files?

I have a folder that is updated daily, with a new version of each file, following this naming scheme ['AA_06182020', 'AA_06202020', 'BTT_06182020', 'BTT_06202020', 'DC_06182020', 'DC_06202020', 'HOO_06182020', 'HOO_06202020']. The 06182020 in the file name is the date the of the file (mm/dd/yyyy), the more recent dates, obviously being the newer versions of the file. Right now I have a script (that runs daily) which iterates over every file in the folder, but I wish to get it so that only the newest version of each file is used. So far I've been able to retrieve a list of all the files, then parse the date portion of the name into a date time object and append that too a new list. I'm unsure of how to proceed from here, to make it so the list is sorted by date and only the newest versions of each file are selected for further processing?
from pathlib import Path
import pandas as pd
import re
from datetime import datetime
me_data = (r"Path To Folder")
pathlist = Path(me_data).glob('**/*.xlsx')
fyl = []
new_fyls = []
for path in pathlist:
# because path is object not string
path_in_str = str(path)
fyl.append(path.stem)
for entry in fyl:
typ, date1 = entry.split('_')
dt = datetime.strptime(date1,'%m%d%Y')
new_fyls.append((entry, dt))

I suggest you modify your 2nd loop a bit with a dictionary. You can use the filename typ so only one date is kept (plus the filename for convinience). When you encounter a new date in the loop you compare with the previous for that file and store the recent one.
files = {} # the dictionary
for entry in fyl:
typ, date1 = entry.split('_')
dt = datetime.strptime(date1, '%m%d%Y')
if typ not in files or files[typ][0] < dt: # datetime supports comparison
files[typ] = (dt, entry)
in the if statement the typ not in files checks for the first time you encounter a new file in the loop. while the other condition if it needs updating.
Lastly getting the most recent file names you need to get all the values stored and keep the second attribute each time.
new_fyls = [row[1] for row in files.values()]
produces ['AA_06202020', 'BTT_06202020', 'DC_06202020', 'HOO_06202020'] with your example

You could try sorting using a lambda function, like this:
from datetime import datetime
files = ['AA_06182020', 'AA_06202020', 'BTT_06182020', 'BTT_06202020', 'DC_06182020', 'DC_06202020', 'HOO_06182020', 'HOO_06202020']
sorted_files = sorted(files, key=lambda x: datetime.strptime(x.split('_')[1], '%m%d%Y'), reverse=True)
This will produce a sorted files list with the newest files first (according to your naming convention).

Extract date from file name and create a new column with that date auto populated in Python

I am concatenating 100 CSV's with names like XXX_XX_20112020.csv to create one file lets say master.csv
Can I Extract date from each file name and create a new column with that date auto populated for all records in that file? Should I be doing this before or after concatenation and how??

If they all follow the same XXX_XX_20112020.csv pattern then just do 'XXX_XX_20112020.csv'.rsplit('_',1)[-1].rsplit('.',1)[0]
import datetime
file_name = 'XXX_XX_20112020.csv'
file_name_ending = file_name.rsplit('_',1)[-1]
date_part = file_name_ending.rsplit('.',1)[0]
date_part_parsed = datetime.datetime.strptime(date_part, "%d%m%Y").date()
So rsplit is just to split the file names on '_' and we do the same to get rid of the suffix i.e '.csv' by splitting on '.'. Now you need to turn the date string into a real date.
Read here:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
strptime will turn the string into a datetime object when the right format is given.
Now you can make this a function and apply it to all the file names you have.
P.S: rsplit https://docs.python.org/3/library/stdtypes.html#str.rsplit

import os
import pandas as pd
master_df = pd.DataFrame()
for file in os.listdir('folder_with_csvs'):
# we access the last element after an underscore and all before the dot before csv
date_for_file = file.split('_')[-1].split('.')[0]
date_for_file = datetime.datetime.strptime(date_for_file, "%d%m%Y").date()
df = pd.read_csv(file)
# Following line will put your date in the `POST_DATE` column for every record of this file
df['POST_DATE'] = date_for_file
master_df = pd.concat([master_df, df])
# Eventually
master_df.to_csv('master.csv')

How do I change file name that contains a date automatically in python

I am trying to copy data from one excel file to another automatically using python, currently i have to manually update the date in the excel file name every morning. Is there a way to automatically update the date in the excel file name. I am very new to any form of programming, trying to learn to keep my job.
I have tried to use the time date function and declare this as a variable and copy this into code but no luck
import datetime
Filedate= (datetime.date.today()-datetime.timedelta(1))
exceldate= Filedate.strftime("%Y",)+Filedate.strftime("%m",)+Filedate.strftime("%d",)
import pyexcel as p
p.save_book_as(file_name="Q:\Valuations\Currency Options\YieldX Daily Statsexceldate.xls",#CHANGE DATE #manual entry.
dest_file_name='YieldX Daily Stats20190522.xlsx')#CHANGE DATE manual entry

My approach is to split filename into part that contains date and the rest, then replace date with current one.
import os
import datetime
import re
# get xls files
xls_files = [file for file in os.listdir(os.getcwd()) if file.endswith('.xls')]
# get current date
now = datetime.datetime.now()
# change names
for item in xls_files:
# split name and date part
name_parts = item.split('.')
get_date = re.findall('\d+-\d+-\d+', name_parts[0])
name_string_part = name_parts[0].replace(get_date[0], '')
# create new name
new_name = name_string_part + str(now.day) + '-' + str(now.month) + '-' + str(now.year) + '_' + '.xls'
# rename file
os.rename(item, new_name)

I believe what you are trying to do is, open a excel file everyday, and rename its filename to current date, where the previous excel file will have the date of yesterday.
import datetime
import pyexcel as p
yesterday = (datetime.date.today()-datetime.timedelta(1)).strftime("%Y%m%d")
today = datetime.date.today().strftime("%Y%m%d")
p.save_book_as(file_name="Q:\Valuations\Currency Options\YieldX Daily Stats" + yesterday + ".xls",
dest_file_name='YieldX Daily Stats' + today + '.xlsx')
The above code, when executed will change the name of the .xls file which was created yesterday (with its Timestamp), to the current date.
Example:-
If a file named YieldX Daily Stats20190530.xls existed yesterday, today its name will be modified to YieldX Daily Stats20190531.xls

Appending multiple pandas dataframe using for loop but returns an empty dataframe

I am to download a number of .csv files which I convert to pandas dataframe and append to each other.
The csv can be accessed via url which is created each day and using datetime it can be easily generated and put in a list.
I am able to open these individually in the list.
When I try to open a number of these and append them together I get an empty dataframe. The code looks like this so.
#Imports
import datetime
import pandas as pd
#Testing can open .csv file
data = pd.read_csv('https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin01022018.csv')
data.iloc[:5]
#Taking heading to use to create new dataframe
data_headings = list(data.columns.values)
#Setting up string for url
path_start = 'https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin'
file = ".csv"
#Getting dates which are used in url
start = datetime.datetime.strptime("01-02-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("04-02-2018", "%d-%m-%Y")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
#Creating list of url
date_list = []
for date in date_generated:
date_string = date.strftime("%d%m%Y")
x = path_start + date_string + file
date_list.append(x)
#Opening and appending csv files from list which contains url
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link)
print(df)
I have checked that they are not just empty csv but they are not. Any help would be appreciated.
Cheers,
Sandy

You are never storing the appended dataframe. The line:
df.append(data_link)
Should be
df = df.append(data_link)
However, this may be the wrong approach. You really want to use the array of URLs and concatenate them. Check out this similar question and see if it can improve your code!

I really can't understand what you wanted to do here:
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
By the way, try this:
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link.copy())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find the latest file for each calendar month in a folder - python

Related

Remove rows from csv based on given criteria + export updated csv with ne file name

How to iterate over folder, but only retrieve newest versions of files?

Extract date from file name and create a new column with that date auto populated in Python

How do I change file name that contains a date automatically in python

Appending multiple pandas dataframe using for loop but returns an empty dataframe

Categories

Resources