I have for example 4 csv files. I have many other files with the following naming convention with some other files that don't have 'kd' in their name. I want to select the files with 'kd' and do the follows:
kd_2020_2.csv
kd_2020_2_modified.csv
kd_2021_2.csv
kd_2021_2_modified.csv
pp_2012_2.csv
I want to group the two files that have the same name except for the 'modified' portion and then read those files and do some comparison (therefore, kd_2020_2.csv and kd_2020_2_modified.csv will be grouped together and so on).
So far, I got
import pandas as pd
import os
import glob
import difflib
os.chdir('C:\\New_folder')
FileList = glob.glob('*.csv')
print(FileList)
files=[f for f in FileList if 'kd' in f]
file_name =[files[i].split('.')[0] for i in range(len(files))]
for i in range(len(file_name)):
if difflib.ndiff(file_name[i], file_name[i+1]) == 'modified':
df[i] = pd.read_csv(FileList[i])
df[i+1] = pd.read_csv(FileList[i+1])
It is going out of range since I am doing (i+1). Also, this is not what I intend to do. I want to compare each file name with all the other file names and read only the two files with matching name except the part 'modified'. Thank you for your help.
You can use itertools``groupby to create groups based on the first 9 characters of the filenames:
from itertools import groupby
file_groups = [list(i) for j, i in groupby(FileList, lambda a: a[:9])]
This will output a list of pairs:
[['kd_2020_2.csv', 'kd_2020_2_modified.csv'], ['kd_2021_2.csv, kd_2021_2_modified.csv'], ['pp_2012_2.csv']]
You can then iterate the list and load the pairs and process them:
for i in file_groups:
df1 = pd.read_csv(i[0])
df2 = pd.read_csv(i[1])
Related
Assume using glob, we read a folder which contains several csv files such as:
/C/share\AA_12345_1.csv
/C/share\AA_12345_2.csv
/C/share\AA_12345_3.csv
/C/share\BB_13_1.csv
/C/share\BB_13_2.csv
The goal is to append the csv files based on the similar filename group, example append
/C/share\AA_12345_1.csv
...
/C/share\AA_12345_3.csv
as one dataframe of /C/share\AA_12345
/C/share\BB_13_1.csv
/C/share\BB_13_2.csv
also as one dataframe of /C/share\BB_13
My current approach is using
res = [list(i) for j, i in groupby(lof,
lambda a:a.partition('\.*_\d*_?[0-9]_*(?=_)')[0])]
to get group of [[/C/share\AA_12345_1.csv,/C/share\AA_12345_2.csv , /C/share\AA_12345_3.csv ],[/C/share\BB_13_1.csv,/C/share\BB_13_2.csv]]
and then for each group read the csv and append.
However the result is still one biglist [/C/share\AA_12345_1.csv,...,/C/share\BB_13_2.csv]
Any idea/pointer on how to move forward?
Many thanks in advance!
Here's my different approach to get group of similar file:
key_func = lambda x : re.search(r"\w*\D_\d*",x).group()
res = [list(j) for i, j in itertools.groupby(d, key_func)] #d contains list of csv files
for i in res:
print(i)
I need to read 8 datasets:
df01, df02, df03, df04, df05, df06, df07, df08
This is my current approach:
#Set up file paths
filepath_2_df01= "folderX/df01.csv"
filepath_2_df02= "folderX//df02.csv"
filepath_2_df03= "folderX//df03.csv"
filepath_2_df04= "folderX//df04.csv"
filepath_2_df05= "folderY/df05.csv"
filepath_2_df06= "folderY/df06.csv"
filepath_2_df07= "folderY/df07.csv"
filepath_2_df08= "folderY/df08.csv"
#Read files
df01= pd.read_csv(filepath_2_df01)
df02= pd.read_csv(filepath_2_df02)
df03= pd.read_csv(filepath_2_df03)
df04= pd.read_csv(filepath_2_df04)
df05= pd.read_csv(filepath_2_df05)
df06= pd.read_csv(filepath_2_df06)
df07= pd.read_csv(filepath_2_df07)
df08= pd.read_csv(filepath_2_df08)
Is there a more concise way of doing that?
You can use glob for this:
import glob
dfs = []
for file in glob.glob('folderX/*.csv'):
dfs.append(pd.read_csv(file))
for df in dfs:
print(df)
use glob
from glob import glob
filenames = glob('/**/df*.csv') #get list of filenames, starting with df, in a specific folder. ** will search for subfolders as well(folderX and folderY, in your case)
dataframes = [pd.read_csv(f) for f in filenames] # using list comprehension, get a list of dataframes.
master_df = pd.concat(dataframes) #Concatenate list of dataframes to get a master dataframe
I have multiple CSV files, I want to compare them. The file contents are the same except for some additional changes, and I want to list those additional changes.
For eg:
files =[1.csv, 2.csv,3.csv]
I want to compare 1.csv and 2.csv, get the difference and store somewhere, then compare 2.csv and 3.csv, store the diff somewhere.
for dirs in glob.glob(INPUT_PATH+"*"):
if (os.path.isdir(dirs)):
for files in glob.glob(dirs+'*/'+'/*.csv'):
## list all the csv files but how to read them to get difference.
you can use pandas to read csv as dataframe in a list then compare them from that list :
import pandas as pd
dfList = []
dfList.append(pd.read_csv('FilePath'))
dfList[0] contains the content of first csv file and so on
So, for comparing between first and 2nd csv you have to compare between dfList[0] and dfList[1]
The first fonction compare 2 files and the second fonction create a additional file with the difference between the 2 files.
import os
def compare(file_compared,file_master):
"""
A = [100,200,300]
B = [400,500,100]
compare(A,B) = [200,300]
"""
file_compared_list = []
file_master_list = []
with open(file_compared,'r') as fc:
for line in fc:
file_compared_list.append(line.strip())
with open(file_master,'r') as fm:
for line in fm:
file_master_list.append(line.strip())
return list(set(file_compared_list) - set(file_master_list))
def create_file(filename):
diff = compare("file1.csv","file2.csv")
with open(filename,'w') as f:
for element in diff:
f.write(element)
create_file("test.csv")
I have excel files named like "name1 01.01.2018.xlsx", "name1 01.01.2018.xlsx", "name2 12.23.2019.xlsx", and so on. I want to join all files with matching dates (last 10 characters).
I can join all of them by doing:
import glob
import os
import pandas
os.chdir('filepath')
files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("*.xlsx")]
df = files[0].drop(files[0].tail(0).index).append([files[i].drop(files[i].tail(0).index) for i in range(1,len(files))])
How can I join only when the last characters match? I don't have a list of options for the last 10 characters, I want it to update automatically.
Well, first off, we need to reformat your code a bit. While the line to join the Dataframes is correct, it's very difficult to read and can be accomplished more efficiently:
import glob
import os
import pandas as pd
os.chdir('filepath')
files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("*.xlsx")]
# drop the tail of all files
files = [f.drop(f.tail(0).index) for f in files]
# join all files
df = files[0].append(files[1:])
Then, we need to update it a bit so that you can check the filename of the files you loaded, and have some way to tie them back to the Dataframe you created.
import glob
import os
import pandas as pd
os.chdir('filepath')
# store last 10 characters of original filename
files = [(p[-10:], pd.read_excel(p, skipfooter=1)) for p in glob.glob("*.xlsx")]
# drop the tail of all files
files = [(p, f.drop(f.tail(0).index)) for p, f in files]
# group files by last 10 characters of filename
files = {p: [g for n, g in files if n == p] for p in set(p for p, f in files)}
# join all files with same last 10 characters
for key, value in files.items():
files[key] = value[0].append(value[1:])
I have multiple csv files (each file generated per day) with generic filename (say file_) and I append date-stamps to them.
For example: file_2015_10_19, file_2015_10_18and so on.
Now, I only want to read the 5 latest files and create a comparison plot.
For me plotting is no issue but sorting all the files and reading only the latest 5 is.
You need to read all the files, and then sort them. There isn't a shortcut I'm afraid.
You can sort them by the last modified time, or parse the date component and sort by the date
import glob
import os
import datetime
file_mask = 'file_*'
ts = 'file_%Y_%m_%d'
path_to_files = r'/foo/bar/zoo/'
def get_date_from_file(s):
return datetime.datetime.strptime(s, ts)
all_files = glob.glob(os.path.join(path_to_files, file_mask))
sorted_files = sorted(all_files, key=lambda x: os.path.getmtime(x))[-5:]
sorted_by_date = sorted(all_files, key=get_date_from_file)[-5:]
import os
# list all files in the directory - returns a list of files
files = os.listdir('.')
# sort the list in reverse order
files.sort(reverse=True)
# the top 5 items in the list are the files you need
sorted_files = files[:-5]
Hope this helps!