I have excel files named like "name1 01.01.2018.xlsx", "name1 01.01.2018.xlsx", "name2 12.23.2019.xlsx", and so on. I want to join all files with matching dates (last 10 characters).
I can join all of them by doing:
import glob
import os
import pandas
os.chdir('filepath')
files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("*.xlsx")]
df = files[0].drop(files[0].tail(0).index).append([files[i].drop(files[i].tail(0).index) for i in range(1,len(files))])
How can I join only when the last characters match? I don't have a list of options for the last 10 characters, I want it to update automatically.
Well, first off, we need to reformat your code a bit. While the line to join the Dataframes is correct, it's very difficult to read and can be accomplished more efficiently:
import glob
import os
import pandas as pd
os.chdir('filepath')
files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("*.xlsx")]
# drop the tail of all files
files = [f.drop(f.tail(0).index) for f in files]
# join all files
df = files[0].append(files[1:])
Then, we need to update it a bit so that you can check the filename of the files you loaded, and have some way to tie them back to the Dataframe you created.
import glob
import os
import pandas as pd
os.chdir('filepath')
# store last 10 characters of original filename
files = [(p[-10:], pd.read_excel(p, skipfooter=1)) for p in glob.glob("*.xlsx")]
# drop the tail of all files
files = [(p, f.drop(f.tail(0).index)) for p, f in files]
# group files by last 10 characters of filename
files = {p: [g for n, g in files if n == p] for p in set(p for p, f in files)}
# join all files with same last 10 characters
for key, value in files.items():
files[key] = value[0].append(value[1:])
Related
I have a list of .csv files stored in a local folder and I'm trying to concatenate them into one single dataframe.
Here is the code I'm using :
import pandas as pd
import os
folder = r'C:\Users\_M92\Desktop\myFolder'
df = pd.concat([pd.read_csv(os.path.join(folder, f), delimiter=';') for f in os.listdir(folder)])
display(df)
Only one problem, it happens that one of the files is sometimes empty (0 cols, 0 rows) and in this case, pandas is throwing an EmptyDataError: No columns to parse from file in line 6.
Do you have any suggestions how to bypass the empty csv file ?
And why not how to concatenate csv files in a more efficient/simplest way.
Ideally, I would also like to add a column (to the dataframe df) to carry the name of each .csv.
You can check if a file is empty with:
import os
os.stat(FILE_PATH).st_size == 0
In your use case:
import os
df = pd.concat([
pd.read_csv(os.path.join(folder, f), delimiter=';') \
for f in os.listdir(folder) \
if os.stat(os.path.join(folder, f)).st_size != 0
])
Personally I would filter the files for content first, then merge them using the basic try-except.
import pandas as pd
import os
folder = r'C:\Users\_M92\Desktop\myFolder'
data = []
for f in os.listdir(folder):
try:
temp = pd.read_csv(os.path.join(folder, f), delimiter=';')
# adding original filename column as per request
temp['origin'] = f
data.append(temp)
except pd.errors.EmptyDataError:
continue
df = pd.concat(data)
display(df)
I have for example 4 csv files. I have many other files with the following naming convention with some other files that don't have 'kd' in their name. I want to select the files with 'kd' and do the follows:
kd_2020_2.csv
kd_2020_2_modified.csv
kd_2021_2.csv
kd_2021_2_modified.csv
pp_2012_2.csv
I want to group the two files that have the same name except for the 'modified' portion and then read those files and do some comparison (therefore, kd_2020_2.csv and kd_2020_2_modified.csv will be grouped together and so on).
So far, I got
import pandas as pd
import os
import glob
import difflib
os.chdir('C:\\New_folder')
FileList = glob.glob('*.csv')
print(FileList)
files=[f for f in FileList if 'kd' in f]
file_name =[files[i].split('.')[0] for i in range(len(files))]
for i in range(len(file_name)):
if difflib.ndiff(file_name[i], file_name[i+1]) == 'modified':
df[i] = pd.read_csv(FileList[i])
df[i+1] = pd.read_csv(FileList[i+1])
It is going out of range since I am doing (i+1). Also, this is not what I intend to do. I want to compare each file name with all the other file names and read only the two files with matching name except the part 'modified'. Thank you for your help.
You can use itertools``groupby to create groups based on the first 9 characters of the filenames:
from itertools import groupby
file_groups = [list(i) for j, i in groupby(FileList, lambda a: a[:9])]
This will output a list of pairs:
[['kd_2020_2.csv', 'kd_2020_2_modified.csv'], ['kd_2021_2.csv, kd_2021_2_modified.csv'], ['pp_2012_2.csv']]
You can then iterate the list and load the pairs and process them:
for i in file_groups:
df1 = pd.read_csv(i[0])
df2 = pd.read_csv(i[1])
I need to read 8 datasets:
df01, df02, df03, df04, df05, df06, df07, df08
This is my current approach:
#Set up file paths
filepath_2_df01= "folderX/df01.csv"
filepath_2_df02= "folderX//df02.csv"
filepath_2_df03= "folderX//df03.csv"
filepath_2_df04= "folderX//df04.csv"
filepath_2_df05= "folderY/df05.csv"
filepath_2_df06= "folderY/df06.csv"
filepath_2_df07= "folderY/df07.csv"
filepath_2_df08= "folderY/df08.csv"
#Read files
df01= pd.read_csv(filepath_2_df01)
df02= pd.read_csv(filepath_2_df02)
df03= pd.read_csv(filepath_2_df03)
df04= pd.read_csv(filepath_2_df04)
df05= pd.read_csv(filepath_2_df05)
df06= pd.read_csv(filepath_2_df06)
df07= pd.read_csv(filepath_2_df07)
df08= pd.read_csv(filepath_2_df08)
Is there a more concise way of doing that?
You can use glob for this:
import glob
dfs = []
for file in glob.glob('folderX/*.csv'):
dfs.append(pd.read_csv(file))
for df in dfs:
print(df)
use glob
from glob import glob
filenames = glob('/**/df*.csv') #get list of filenames, starting with df, in a specific folder. ** will search for subfolders as well(folderX and folderY, in your case)
dataframes = [pd.read_csv(f) for f in filenames] # using list comprehension, get a list of dataframes.
master_df = pd.concat(dataframes) #Concatenate list of dataframes to get a master dataframe
I want to perform a analysis on 24 excel doc's 12 containing one side of the story and the other 12 the other side. I managed to load them into python but when i try to get them in two seperate datafranes python combines them back to one.
This is for a windows server using Python3.7
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
files
files_car = [f for f in files if f.startswith("CAR")]
files_car
for f in files_car:
data1 = pd.read_excel(f)
car = car.append(data1)
path = os.getcwd()
files2 = os.listdir(path)
files2
files_ean = [f for f in files2 if f.startswith("ELEK")]
files_ean
ean = pd.DataFrame()
for x in files_ean:
data2 = pd.read_excel(f)
ean = ean.append(data2)
i expected that files_car would contain the 12 files tht start with "CAR"
and that files_ean the 12 files that start with "ELEK"
This should do what you want, based on your comment:
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
car_dfs = [pd.read_excel(f) for f in files if f.startswith("CAR")]
ean_dfs = [pd.read_excel(f) for f in files if f.startswith("ELEK")]
car_df = pd.concat(files_car)
ean_df = pd.concat(files_ean)
Couple of points:
you don't need to recreate files if you're going to run the same command (i.e. os.listdir(path)) unless path were to change
you should not append to a DataFrame in a loop like you're doing, since this creates a copy every time you call append. It's better to create a list of DataFrames and then concatenate that list in to one big DataFrame.
You could shorten this even more by just doing:
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
car_df = pd.concat([pd.read_excel(f) for f in files if f.startswith("CAR")])
ean_df = pd.concat([pd.read_excel(f) for f in files if f.startswith("ELEK")])
unless you have some need for the individual file DataFrames
I have multiple csv files (each file generated per day) with generic filename (say file_) and I append date-stamps to them.
For example: file_2015_10_19, file_2015_10_18and so on.
Now, I only want to read the 5 latest files and create a comparison plot.
For me plotting is no issue but sorting all the files and reading only the latest 5 is.
You need to read all the files, and then sort them. There isn't a shortcut I'm afraid.
You can sort them by the last modified time, or parse the date component and sort by the date
import glob
import os
import datetime
file_mask = 'file_*'
ts = 'file_%Y_%m_%d'
path_to_files = r'/foo/bar/zoo/'
def get_date_from_file(s):
return datetime.datetime.strptime(s, ts)
all_files = glob.glob(os.path.join(path_to_files, file_mask))
sorted_files = sorted(all_files, key=lambda x: os.path.getmtime(x))[-5:]
sorted_by_date = sorted(all_files, key=get_date_from_file)[-5:]
import os
# list all files in the directory - returns a list of files
files = os.listdir('.')
# sort the list in reverse order
files.sort(reverse=True)
# the top 5 items in the list are the files you need
sorted_files = files[:-5]
Hope this helps!