How to iterate over folder, but only retrieve newest versions of files?

How to iterate over folder, but only retrieve newest versions of files? - python

I have a folder that is updated daily, with a new version of each file, following this naming scheme ['AA_06182020', 'AA_06202020', 'BTT_06182020', 'BTT_06202020', 'DC_06182020', 'DC_06202020', 'HOO_06182020', 'HOO_06202020']. The 06182020 in the file name is the date the of the file (mm/dd/yyyy), the more recent dates, obviously being the newer versions of the file. Right now I have a script (that runs daily) which iterates over every file in the folder, but I wish to get it so that only the newest version of each file is used. So far I've been able to retrieve a list of all the files, then parse the date portion of the name into a date time object and append that too a new list. I'm unsure of how to proceed from here, to make it so the list is sorted by date and only the newest versions of each file are selected for further processing?
from pathlib import Path
import pandas as pd
import re
from datetime import datetime
me_data = (r"Path To Folder")
pathlist = Path(me_data).glob('**/*.xlsx')
fyl = []
new_fyls = []
for path in pathlist:
# because path is object not string
path_in_str = str(path)
fyl.append(path.stem)
for entry in fyl:
typ, date1 = entry.split('_')
dt = datetime.strptime(date1,'%m%d%Y')
new_fyls.append((entry, dt))

I suggest you modify your 2nd loop a bit with a dictionary. You can use the filename typ so only one date is kept (plus the filename for convinience). When you encounter a new date in the loop you compare with the previous for that file and store the recent one.
files = {} # the dictionary
for entry in fyl:
typ, date1 = entry.split('_')
dt = datetime.strptime(date1, '%m%d%Y')
if typ not in files or files[typ][0] < dt: # datetime supports comparison
files[typ] = (dt, entry)
in the if statement the typ not in files checks for the first time you encounter a new file in the loop. while the other condition if it needs updating.
Lastly getting the most recent file names you need to get all the values stored and keep the second attribute each time.
new_fyls = [row[1] for row in files.values()]
produces ['AA_06202020', 'BTT_06202020', 'DC_06202020', 'HOO_06202020'] with your example

You could try sorting using a lambda function, like this:
from datetime import datetime
files = ['AA_06182020', 'AA_06202020', 'BTT_06182020', 'BTT_06202020', 'DC_06182020', 'DC_06202020', 'HOO_06182020', 'HOO_06202020']
sorted_files = sorted(files, key=lambda x: datetime.strptime(x.split('_')[1], '%m%d%Y'), reverse=True)
This will produce a sorted files list with the newest files first (according to your naming convention).

Related

how to substitute value in url from value in list and use FOR loop to loop thru them all

I am trying to figure out how to replace "BATS" in the string in my code with the other exchanges in the list and loop thru them all to grab the stock data in one python script, rather than hard coding and creating multiple separate files.
Also, would like to use the same logic to replace the value of the resulting local .csv file (BATS_2021-01-19.csv) according to whatever exchange is being parsed. Here is my code.
import pandas as pd
import time
import os
import datetime
datetime = datetime.datetime.today().strftime('%Y-%m-%d')
exchanges = ["BATS","US","SG","LSE","V","TSE"]
df = pd.read_csv('https://eodhistoricaldata.com/api/eod-bulk-last-day/BATS?api_token=5f1343ba20.00275101&date=' + str(datetime))
Ticker = df['Code']
Date = df['Date']
Open = df['Open'].round(2)
High = df['High'].round(2)
Low = df['Low'].round(2)
Close = df['Adjusted_close'].round(2)
Volume = df['Volume']
total_df = pd.concat([Ticker, Date, Open, High, Low, Close, Volume],
axis=1, keys=['Ticker','Date','Open','High','Low','Close','Volume'])
filename = "BATS_"+(datetime)+".csv"
path = 'H:/EOD_DATA_RECENT/DOWNLOADS/'
full_path = os.path.join(path, filename)
total_df.to_csv(full_path, index=False)
print(total_df.head(5))

You use use string formatting to specify the data in the url whilst looping over your exchanges.
Placing an f before your string quotations will allow to you input variables directly into the string by surrounding the variable name with {}
for ex in exchanges:
print(f'https://eodhistoricaldata.com/api/eod-bulk-last-day/{ex}?api_token=5f1343ba20.00275101&date={str(datetime)}')
https://eodhistoricaldata.com/api/eod-bulk-last-day/BATS?api_token=5f1343ba20.00275101&date=2021-01-20
https://eodhistoricaldata.com/api/eod-bulk-last-day/US?api_token=5f1343ba20.00275101&date=2021-01-20
https://eodhistoricaldata.com/api/eod-bulk-last-day/SG?api_token=5f1343ba20.00275101&date=2021-01-20
https://eodhistoricaldata.com/api/eod-bulk-last-day/LSE?api_token=5f1343ba20.00275101&date=2021-01-20
https://eodhistoricaldata.com/api/eod-bulk-last-day/V?api_token=5f1343ba20.00275101&date=2021-01-20
https://eodhistoricaldata.com/api/eod-bulk-last-day/TSE?api_token=5f1343ba20.00275101&date=2021-01-20
You should also rename your datetime variable, as you won't be able to use datetime again after declaring that variable.
datetime.datetime.now()
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
datetime.datetime.now()
AttributeError: 'str' object has no attribute 'datetime'

Manipulating the values of each file in a folder using a dictionary and loop

How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.

Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.

Find the latest file for each calendar month in a folder

The code below works as I need it to, but I feel like there must be a better way. I have a folder with daily(ish) files inside of it. All of them have the same prefix and the date they were sent as the file name. On certain days, no file was sent at all though. My task it to read the last file of each month (most of the time it is the last day, but April's last file was the 28th, July was the 29th, etc).
This is using the pathlib module, which I like to continue to use.
files = sorted(ROOT.glob('**/*.csv*'))
file_dates = [Path(file.stem).stem.replace('prefix_', '').split('_') for file in files] #replace everything but a list of the date elements
dates = [pd.to_datetime(date[0] + '-' + date[1] + '-' + date[2]) for date in file_dates] #construct the proper date format
x = pd.DataFrame(dates)
x['month'] = x[0].dt.strftime('%Y-%m') + '-01'
max_value = x.groupby(['month'])[0].max().reset_index()
max_value[0] = max_value[0].dt.strftime('%Y_%m_%d')
monthly_files = [str(ROOT / 'prefix_') + date + '.csv.xz' for date in max_value[0].values]
df = pd.concat([pd.read_csv(file, usecols=columns, sep='\t', compression='xz', dtype=object) for file in monthly_files])
I believe this is a case where, because I have a hammer (pandas), everything looks like a nail (I turn everything into a dataframe). I am also trying to get used to list comprehensions after several years of not using them.

There's probably better, but here's my try:
files = sorted(ROOT.glob('**/*.csv*'))
file_dates = [Path(file.stem).stem.replace('prefix_', '').split('_') for file in files] #replace everything but a list of the date elements
df = pd.DataFrame(file_dates, columns=['y', 'm', 'd'], dtype='int')
monthly = [str(yy)+'-'+str(mm)+'-'+str(df.loc[(df['y'] == yy) & (df['m'] == mm), 'd'].max()) for yy in df.y.unique() for mm in df.m.unique()]

So the file names would be prefix_<date> and the date is in format %Y-%m-%d.
import os
from datetime import datetime as dt
from collections import defaultdict
from pathlib import Path
group_by_month = defaultdict(list)
files = []
# Assuming the folder is the data folder path itself.
for file in Path(folder).iterdir():
if os.path.isfile(file) and file.startswith('prefix_'):
# Convert the string date to a datetime object
converted_dt = dt.strptime(str(file).split('prefix_')[1],
'%Y-%m-%d')
# Group the dates by month
group_by_month[converted_dt.month].append(converted_dt)
# Get the max of all the dates stored.
max_dates = {month: max(group_by_month[month])
for month in group_by_month.keys()}
# Get the files that match the prefix and the max dates
for file in Path(folder).iterdir():
for date in max_date.values():
if ('prefix_' + dt.strftime(date, '%Y-%m-%d')) in str(file):
files.append(file)
PS: I haven't worked with pandas a lot. So, went with the native style to get the files that match the max date of a month.

To my knowledge this is going to be difficult to do with list comprehension since you have to compare the current element with the next element.
However there are simpler solutions that will get you there without pandas.
The example below just loops over a string list with the file dates and keeps the date before the month changes. Since your list is sorted that should do the trick. I am assuming YYYY_MM_DD date formats
files = sorted(ROOT.glob('**/*.csv*'))
file_dates = [Path(file.stem).stem.replace('prefix_', '') for file in files]
#adding a dummy date because we're comparing to the next element
file_dates.append('0000_00_00')
result = []
for i, j in enumerate(file_dates[:-1]):
if j[6:7] != file_dates[i+1][6:7]:
result.append(j)
monthly_files = [str(ROOT / 'prefix_') + date + '.csv.xz' for date in result]
df = pd.concat([pd.read_csv(file, usecols=columns, sep='\t', compression='xz', dtype=object) for file in monthly_files])

Python: Batch rename files in a directory using a predefined list, sorting by date created

I am downloading a number of PDF documents from an online repository, but they are not coming through with the proper naming conventions. The files align with a list of names that I have located in an Excel spreadsheet.
What I would like to do is import the Excel spreadsheet, assign the names to a variable, and then use os.rename() to rename the files I have downloaded as a batch in order to match my list.
When I download the .PDFs, each is given a random naming convention, rather than named by the URL. These are randomly generated each time the link is chosen. This is creating a problem because I cannot sort the documents in the proper order in order to name them in the proper order.
What I would like to do is sort the documents by "date created". By using sleep() I have the documents downloaded in the correct order, matching the instrument numbers, but I cannot figure out how to line them up properly to iterate through the names I would like to change.
Here is a sample of my code:
#Import packages
import pandas as pd
from selenium import webdriver
import os
#Designate file locations / destinations
file = '/Users/username/Desktop/test.xlsx'
directory = '/Users/username/Downloads'
#Obtain instrument names
xl = pd.ExcelFile(file)
df1 = xl.parse('Sheet1', parse_cols=[2], names=['instrument'])
names = df1.instrument
prefix = xyz
#Obtain file location
imported_files = os.listdir(directory)
imported_files.remove('.DS_Store')
df1['importedFiles'] = imported_files
print(df1)
instrument importedFiles
0 146169-1975 2461030_123.PDF
1 147235-1975 2461030_2027.PDF
2 148367-1975 2461030_348.PDF
3 149563-1975 2461030_5327.PDF
4 171413-1977 2461030_555.PDF
5 186305-1977 2461030_5969.PDF
6 186726-1977 2461030_7610.PDF
7 186727-1978 2461030_7878.PDF
8 187748-1978 2461030_8733.PDF
#Set working directory
os.chdir('/Users/username/Downloads')
#Set a loop to rename
for x, y in zip(names, os.listdir('/Users/username/Downloads')):
file_name, file_ext = os.path.splitext(y)
new_names = ('{}_{}{}'.format(prefix, x, file_ext))
print(new_names)
os.rename(y, new_names)
sleep(0.5)
When I print "new_names" the order of the names come out correctly in my console. However, when I take the next step to actually rename the files, the renaming doesn't work because of the randomly generated names coming from the imported files.
How can I make sure that the file names change in the same order that they are coming in? OR how can I change the order of the files so that when I name them, they match the instrument string's coming in?
Thank you!

I was able to find an answer to my own question! So in order to rename the files based on the instrument numbers I was pulling in from the Excel spreadsheet, I first had to reorganized the files I was downloading, which were generating the random numbers.
I followed this video https://www.youtube.com/watch?v=hZP3y-gxyJg and used os.path.getatime on my directory to find the creation time, and then used a renaming loop to name them. This organized the files the way that I wanted, and I was able to rename them in the order I wanted. Here is the code I used:
iterfiles = iter(os.listdir('/Users/username/Desktop'))
next(iterfiles)
for file_time in iterfiles:
time_stamp = os.path.getatime(file_time)
local_time = time.ctime(time_stamp)
ext = 'PDF'
print(local_time)
time_name = ('{}.{}'.format(local_time, ext))
os.rename(file_time, time_name)
sleep(0.5)
#--------RENAME FILES BASED ON NAME----------#
iterinstrument = iter(os.listdir('/Users/username/Desktop'))
next(iterinstrument)
for x, y in zip(instrument_numbers, iterinstrument):
file_name, file_ext = os.path.splitext(y)
number, year = x.split('-')
number = number.zfill(7)
new_names = ('{}-{}{}{}'.format(county, year, number, file_ext))
print(new_names)
os.rename(y, new_names)
sleep(0.5)

Python - Read and plot only the latest files

I have multiple csv files (each file generated per day) with generic filename (say file_) and I append date-stamps to them.
For example: file_2015_10_19, file_2015_10_18and so on.
Now, I only want to read the 5 latest files and create a comparison plot.
For me plotting is no issue but sorting all the files and reading only the latest 5 is.

You need to read all the files, and then sort them. There isn't a shortcut I'm afraid.
You can sort them by the last modified time, or parse the date component and sort by the date
import glob
import os
import datetime
file_mask = 'file_*'
ts = 'file_%Y_%m_%d'
path_to_files = r'/foo/bar/zoo/'
def get_date_from_file(s):
return datetime.datetime.strptime(s, ts)
all_files = glob.glob(os.path.join(path_to_files, file_mask))
sorted_files = sorted(all_files, key=lambda x: os.path.getmtime(x))[-5:]
sorted_by_date = sorted(all_files, key=get_date_from_file)[-5:]

import os
# list all files in the directory - returns a list of files
files = os.listdir('.')
# sort the list in reverse order
files.sort(reverse=True)
# the top 5 items in the list are the files you need
sorted_files = files[:-5]
Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to iterate over folder, but only retrieve newest versions of files? - python

Related

how to substitute value in url from value in list and use FOR loop to loop thru them all

Manipulating the values of each file in a folder using a dictionary and loop

Find the latest file for each calendar month in a folder

Python: Batch rename files in a directory using a predefined list, sorting by date created

Python - Read and plot only the latest files

Categories

Resources