Python - Read and plot only the latest files - python

I have multiple csv files (each file generated per day) with generic filename (say file_) and I append date-stamps to them.
For example: file_2015_10_19, file_2015_10_18and so on.
Now, I only want to read the 5 latest files and create a comparison plot.
For me plotting is no issue but sorting all the files and reading only the latest 5 is.

You need to read all the files, and then sort them. There isn't a shortcut I'm afraid.
You can sort them by the last modified time, or parse the date component and sort by the date
import glob
import os
import datetime
file_mask = 'file_*'
ts = 'file_%Y_%m_%d'
path_to_files = r'/foo/bar/zoo/'
def get_date_from_file(s):
return datetime.datetime.strptime(s, ts)
all_files = glob.glob(os.path.join(path_to_files, file_mask))
sorted_files = sorted(all_files, key=lambda x: os.path.getmtime(x))[-5:]
sorted_by_date = sorted(all_files, key=get_date_from_file)[-5:]

import os
# list all files in the directory - returns a list of files
files = os.listdir('.')
# sort the list in reverse order
files.sort(reverse=True)
# the top 5 items in the list are the files you need
sorted_files = files[:-5]
Hope this helps!

Related

How to use the items of the list, with the glob.glob to get the files?

There are various csv files in the directory containing dates as name(somename_20210202.csv, xyzname_20210305.csv, etc.). I would like to read the files for the given date range mentioned below. With the list of those dates, I created a pattern of files. Further, I want to use that pattern in glob.glob to get files but globbed_files returning empty list. My code is correct till pattern_list. Please suggest where is the problem.
from datetime import timedelta, date
import pandas as pd
import numpy as np
import glob
import os
import datetime as dt
def daterange(date1, date2):
for n in range(int ((date2 - date1).days)+1):
yield date1 + timedelta(n)
start_dt = date(2020,01,15)
end_dt = date(2020,02,10)
abc = []
weekdays = [5,6]
for dt in daterange(start_dt, end_dt):
if dt.weekday() not in weekdays:
abc.append(dt.strftime("%d-%b-%Y"))
#print(dt.strftime("%d-%b-%Y"))
print(abc)
dir = r"C:\User\Folder"
pattern_list = []
for dates in abc:
pattern = f'*_{dates}.csv' # use wildcards (*)
pattern_list.append(pattern)
print(pattern_list)
for x in pattern_list:
globbed_files = glob.glob(os.path.join(dir, x))
print(globbed_files)
Your pattern is correct and also get files from the glob module is correct. Only make sure your . csv file exists in your file directory('C:\User\Folder') path and then you will get the same result from globbed_files otherwise bank list.
For example:
Creates two or more file as a_20210202.csv, b_20210202.csv and c_20210305.csv etc...
and set your pattern date list as well like:
pattern_list = [*_20210202.csv, *_20210305.csv]
then:
dir = "your created files folder path"
for x in pattern_list:
globbed_files = glob.glob(os.path.join(dir, x))
print(globbed_files)
Note: example only demo purpose only, its static way to creates file and then get.

Comparing multiple csv file names and group them accordingly

I have for example 4 csv files. I have many other files with the following naming convention with some other files that don't have 'kd' in their name. I want to select the files with 'kd' and do the follows:
kd_2020_2.csv
kd_2020_2_modified.csv
kd_2021_2.csv
kd_2021_2_modified.csv
pp_2012_2.csv
I want to group the two files that have the same name except for the 'modified' portion and then read those files and do some comparison (therefore, kd_2020_2.csv and kd_2020_2_modified.csv will be grouped together and so on).
So far, I got
import pandas as pd
import os
import glob
import difflib
os.chdir('C:\\New_folder')
FileList = glob.glob('*.csv')
print(FileList)
files=[f for f in FileList if 'kd' in f]
file_name =[files[i].split('.')[0] for i in range(len(files))]
for i in range(len(file_name)):
if difflib.ndiff(file_name[i], file_name[i+1]) == 'modified':
df[i] = pd.read_csv(FileList[i])
df[i+1] = pd.read_csv(FileList[i+1])
It is going out of range since I am doing (i+1). Also, this is not what I intend to do. I want to compare each file name with all the other file names and read only the two files with matching name except the part 'modified'. Thank you for your help.
You can use itertools``groupby to create groups based on the first 9 characters of the filenames:
from itertools import groupby
file_groups = [list(i) for j, i in groupby(FileList, lambda a: a[:9])]
This will output a list of pairs:
[['kd_2020_2.csv', 'kd_2020_2_modified.csv'], ['kd_2021_2.csv, kd_2021_2_modified.csv'], ['pp_2012_2.csv']]
You can then iterate the list and load the pairs and process them:
for i in file_groups:
df1 = pd.read_csv(i[0])
df2 = pd.read_csv(i[1])

How to iterate over folder, but only retrieve newest versions of files?

I have a folder that is updated daily, with a new version of each file, following this naming scheme ['AA_06182020', 'AA_06202020', 'BTT_06182020', 'BTT_06202020', 'DC_06182020', 'DC_06202020', 'HOO_06182020', 'HOO_06202020']. The 06182020 in the file name is the date the of the file (mm/dd/yyyy), the more recent dates, obviously being the newer versions of the file. Right now I have a script (that runs daily) which iterates over every file in the folder, but I wish to get it so that only the newest version of each file is used. So far I've been able to retrieve a list of all the files, then parse the date portion of the name into a date time object and append that too a new list. I'm unsure of how to proceed from here, to make it so the list is sorted by date and only the newest versions of each file are selected for further processing?
from pathlib import Path
import pandas as pd
import re
from datetime import datetime
me_data = (r"Path To Folder")
pathlist = Path(me_data).glob('**/*.xlsx')
fyl = []
new_fyls = []
for path in pathlist:
# because path is object not string
path_in_str = str(path)
fyl.append(path.stem)
for entry in fyl:
typ, date1 = entry.split('_')
dt = datetime.strptime(date1,'%m%d%Y')
new_fyls.append((entry, dt))
I suggest you modify your 2nd loop a bit with a dictionary. You can use the filename typ so only one date is kept (plus the filename for convinience). When you encounter a new date in the loop you compare with the previous for that file and store the recent one.
files = {} # the dictionary
for entry in fyl:
typ, date1 = entry.split('_')
dt = datetime.strptime(date1, '%m%d%Y')
if typ not in files or files[typ][0] < dt: # datetime supports comparison
files[typ] = (dt, entry)
in the if statement the typ not in files checks for the first time you encounter a new file in the loop. while the other condition if it needs updating.
Lastly getting the most recent file names you need to get all the values stored and keep the second attribute each time.
new_fyls = [row[1] for row in files.values()]
produces ['AA_06202020', 'BTT_06202020', 'DC_06202020', 'HOO_06202020'] with your example
You could try sorting using a lambda function, like this:
from datetime import datetime
files = ['AA_06182020', 'AA_06202020', 'BTT_06182020', 'BTT_06202020', 'DC_06182020', 'DC_06202020', 'HOO_06182020', 'HOO_06202020']
sorted_files = sorted(files, key=lambda x: datetime.strptime(x.split('_')[1], '%m%d%Y'), reverse=True)
This will produce a sorted files list with the newest files first (according to your naming convention).

How to read multiple files and find the difference between the files?

I have multiple CSV files, I want to compare them. The file contents are the same except for some additional changes, and I want to list those additional changes.
For eg:
files =[1.csv, 2.csv,3.csv]
I want to compare 1.csv and 2.csv, get the difference and store somewhere, then compare 2.csv and 3.csv, store the diff somewhere.
for dirs in glob.glob(INPUT_PATH+"*"):
if (os.path.isdir(dirs)):
for files in glob.glob(dirs+'*/'+'/*.csv'):
## list all the csv files but how to read them to get difference.
you can use pandas to read csv as dataframe in a list then compare them from that list :
import pandas as pd
dfList = []
dfList.append(pd.read_csv('FilePath'))
dfList[0] contains the content of first csv file and so on
So, for comparing between first and 2nd csv you have to compare between dfList[0] and dfList[1]
The first fonction compare 2 files and the second fonction create a additional file with the difference between the 2 files.
import os
def compare(file_compared,file_master):
"""
A = [100,200,300]
B = [400,500,100]
compare(A,B) = [200,300]
"""
file_compared_list = []
file_master_list = []
with open(file_compared,'r') as fc:
for line in fc:
file_compared_list.append(line.strip())
with open(file_master,'r') as fm:
for line in fm:
file_master_list.append(line.strip())
return list(set(file_compared_list) - set(file_master_list))
def create_file(filename):
diff = compare("file1.csv","file2.csv")
with open(filename,'w') as f:
for element in diff:
f.write(element)
create_file("test.csv")

problems with csv files in python language

suppose I have a file/directory in which many .csv files are present and I have a python code that can read only one csv file and do some algorithm and store the output in an another csv file.Now I need to update that python code so that we can check the file/directory and store the output of the all csv files(which are present inside the directory) in different csv files.
import pandas as pd
import statistics as st
import csv
data = pd.read_csv('1mb.csv')
x_or = list(range(len(data['Main Avg Power (mW)'])))
y_or = list(data['Main Avg Power (mW)'])
time=list(data['Time (s)'])
rt=5000
i=time[rt]
k=i
tlist=[]
for i in time:
tlist.append(y_or[rt])
rt+=1
if i-k>4:
break
idp=st.mean(tlist)
sidp=st.stdev(tlist)
newlist=[]
imax=max(tlist)
imin=min(min(tlist),idp-sidp)
while imax>=y_or[rt]>=imin-1:
newlist.append(y_or[rt])
rt+= 1
print(rt,"Mean idle power:",st.mean(newlist),"mW")
midp=st.mean(newlist)
with open('new_1pp.csv','w',newline='') as f:
thewriter=csv.writer(f)
thewriter.writerow(['Idle Power(mW)'])
thewriter.writerow([midp])
this is the code done by me.please update it as required in the problem.
In your code you can use glob to list all the CSV files in a directory, then read them in one at a time passing each through whatever algorithm you have, and then output them again, e.g.
import glob
import os
# set the name of the directory you want to list the files in
csvdir = 'my_directory'
# get a list of all CSV files, assuming they have a '.csv' suffix
csvfiles = glob.glob(os.path.join(csvdir, '*.csv'))
# loop over all the files and run your algorithm
for csvfile in csvfiles:
# read the csvfile using your current code
# apply your algorithm
# output a new file (e.g. with the same name as before, but with '_new' added
newfile = csvfile.rstrip('.csv') + '_new.csv'
# save to 'newfile' using yor current code
Does that help?
Update:
From the comments and the updated questions, does the following code help:
import pandas as pd
import statistics as st
import csv
import glob
# get list of CSV files from current directory
csvfiles = glob.glob('*.csv')
for csvfile in csvfiles:
data = pd.read_csv(csvfile)
x_or = list(range(len(data['Main Avg Power (mW)'])))
y_or = list(data['Main Avg Power (mW)'])
time=list(data['Time (s)'])
rt=5000
i=time[rt]
k=i
tlist=[]
for i in time:
tlist.append(y_or[rt])
rt+=1
if i-k>4:
break
idp=st.mean(tlist)
sidp=st.stdev(tlist)
newlist=[]
imax=max(tlist)
imin=min(min(tlist),idp-sidp)
while imax>=y_or[rt]>=imin-1:
newlist.append(y_or[rt])
rt+= 1
print(rt,"Mean idle power:",st.mean(newlist),"mW")
midp=st.mean(newlist)
# create new file name using the old one and adding '_new'
newfile = csvfile.rstrip('.csv') + '_new.csv'
with open(newfile,'w',newline='') as f:
thewriter=csv.writer(f)
thewriter.writerow(['Idle Power(mW)'])
thewriter.writerow([midp])

Categories

Resources