I have seen many questions in regards to importing multiple csv files into a pandas dataframe.
My question is how can you import multiple csv files but ignore the last csv file in your directory? I have had a hard time finding the answer to this.
Also, lets assume that the csv file names are all different which is why the code file is "/*.csv"
any resource would also be greatly appreciated. Thank you!
path =r'C:\DRO\DCL_rawdata_files' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
Try this:
import os
import glob
import pandas as pd
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
path =r'C:\DRO\DCL_rawdata_files' # use your path
fmask = os.path.join(path, '*.csv')
allFiles = sorted(glob.glob(fmask), key=os.path.getmtime)
frame = get_merged_csv(allFiles[:-1], index_col=None, header=0)
Related
I have an Excel file that needs to be refreshed automatically every week. It must be extended by other Excel files. The problem is that these files have different names each time.
So in my opinion i can not use code like:
import pandas as pd
NG = 'NG.xlsx'
df = pd.read_excel(NG)
because the filename is not always "NG" like in this case.
Do you have any ideas?
Best Greetz
You could read all the files in your folder by doing this, because it allows you to ignore name changes:
import sys
import csv
import glob
import pandas as pd
# get data file names
path = r"C:\.......\folder_with_excel"
filenames = glob.glob(path + "/*.xlsx")
DF = []
for df in dfs:
xl_file = pd.ExcelFile(filenames)
df=xl_file.parse('Sheet1')
DF.concat(df, ignore_index=True)
Alternatively:
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path) # list all the files in you directory
files_xls = [f for f in files if f[-3:] == 'xlsx'] # make a list of the xlsx
df = pd.DataFrame()
for f in files_xls:
info = pd.read_excel(f, '<sheet name>') # remove <sheet name if you don't need it
df = df.append(info)
I have problem No objects to concatenate. I can not import .csv files from main and its subdirectories to concatenate them into one DataFrame. I am using pandas. Old answers did not help me so please do not mark as duplicated.
Folder structure is like that
main/*.csv
main/name1/name1/*.csv
main/name1/name2/*.csv
main/name2/name1/*.csv
main/name3/*.csv
import pandas as pd
import os
import glob
folder_selected = 'C:/Users/jacob/Documents/csv_files'
not works
frame = pd.concat(map(pd.read_csv, glob.iglob(os.path.join(folder_selected, "/*.csv"))))
not works
csv_paths = glob.glob('*.csv')
dfs = [pd.read_csv(folder_selected) for folder_selected in csv_paths]
df = pd.concat(dfs)
not works
all_files = []
all_files = glob.glob (folder_selected + "/*.csv")
file_path = []
for file in all_files:
df = pd.read_csv(file, index_col=None, header=0)
file_path.append(df)
frame = pd.concat(file_path, axis=0, ignore_index=False)
You need to search the subdirectories recursively.
folder = 'C:/Users/jacob/Documents/csv_files'
path = folder+"/**/*.csv"
Using glob.iglob
df = pd.concat(map(pd.read_csv, glob.iglob(path, recursive=True)))
Using glob.glob
csv_paths = glob.glob(path, recursive=True)
dfs = [pd.read_csv(csv_path) for csv_path in csv_paths]
df = pd.concat(dfs)
Using os.walk
file_paths = []
for base, dirs, files in os.walk(folder):
for file in fnmatch.filter(files, '*.csv'):
file_paths.append(os.path.join(base, file))
df = pd.concat([pd.read_csv(file) for file in file_paths])
Using pathlib
from pathlib import Path
files = Path(folder).rglob('*.csv')
df = pd.concat(map(pd.read_csv, files))
Check Dask Library as following, which reads many files to one df
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
Read their docs
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files
Python’s pathlib is a tool for such tasks
from pathlib import Path
FOLDER_SELECTED = 'C:/Users/jacob/Documents/csv_files'
path = Path(FOLDER_SELECTED) / Path("main")
# grab all csvs in main and subfolders
df = pd.concat(pd.read_csv(f.name) for f in path.rglob("*.csv"))
Note:
If the CSV need preprocing, you can create a read_csv function to deal with issues and place it in place of pd.read_csv
What's the best way to create a pandas dataframe from one file with any file name located in a specified folder?
I have used pathlib and it's not quite working as the output dataframe is not giving me anything.
from pathlib import Path
import pandas as pd
pth = r'C:\Users\HP\Desktop\IBM\New folder'
fle = Path(pth).glob('*.tsv')
someDf = pd.DataFrame(fle)
someDf
Edit:
I also tried doing the below, but the output dataframe combines all columns into one column separated by a backward slash. How do I fix this?
from pathlib import Path
import pandas as pd
pth = r'C:\Users\HP\Desktop\IBM\New folder'
fle = Path(pth).glob('*.tsv')
dfs = []
for filename in fle:
dfs.append(pd.read_csv(filename))
dfs1 = pd.concat(dfs)
dfs1.head()
The way I did this seems complicated. Is there an easier way to do this?
Please try:
from pathlib import Path
import pandas as pd
import os
pth = r'C:\Users\HP\Desktop\IBM\New folder'
for file_ in os.listdir(pth):
h=os.path.join(pth, file_)
#print (h)
someDf = pd.read_csv(h)
someDf
Try
from glob import glob
files = glob('C:\Users\HP\Desktop\IBM\New folder\*.tsv')
if len(files) == 1:
dfs = pd.read_csv(files[0], sep='\t')
else:
dfs = pd.concat([pd.read_csv(file, sep='\t') for file in files])
The solution I found for this is as below. I missed the sep parameter in pd.read_csv().
from pathlib import Path
import pandas as pd
pth = r'C:\Users\HP\Desktop\IBM\New folder'
fle = Path(pth).glob('*.tsv')
dfs = []
for filename in fle:
dfs.append(pd.read_csv(filename, sep='\t'))
dfs1 = pd.concat(dfs)
dfs1.head()
Trying to batch analyze a folder full of .csv files, then save them out again based on the .csv name. However, I'm having trouble extracting just the file name and assigning it to the dataframe (df).
import glob
import pandas as pd
path = r'csv_in'
allFiles = glob.glob(path + '/*.csv')
for file_ in allFiles:
df = pd.read_csv(file_, header=0)
df.name = file_
print(df.name)
The print result I get is "csv_in/*.csv".
The result I'm looking for is just the csv name, "*.csv"
Create new column with [] and os.path.basename with os.path.normpath:
import os
for file_ in allFiles:
df = pd.read_csv(file_, header=0)
df['name'] = os.path.basename(os.path.normpath(file_))
#if need remove extension (csv)
#df['name'] = os.path.splitext(os.path.basename("hemanth.txt"))[0]
print(df.name)
I am using the below code to read a set of csv files from a folder to a Dataframe. However this folder has a sub-folder along with these csv files. How could I skip the sub-folder and only read the csv file. The below code throws an error when I try to run this folder that has a sub-folder.
import pandas as pd
import glob
import numpy as np
import os
import datetime
import time
path = r'/Users/user/desktop/Sales/'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
list_.append(df)
sale_df = pd.concat(list_)
sale_df
Error message : IsADirectoryError: [Errno 21] Is a directory:
'/Users/user/desktop/Sales/2018-05-03/20180503000513-kevin#store.com-
190982.csv-1525305907670.csv'
Could anyone assist on this. Thanks
EDIT: The issue is the subdirectory has the extension '.csv' present in the subdirectory name.
EDIT in code
path =r'/Users/user/desktop/Sales/2018-05-03/'
files_only = [file for file in
glob.glob('/Users/user/desktop/Sales/2018-05-03/*.csv') if not
os.path.isdir(file)]
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(files_only,index_col=None, header=0)
list_.append(df)
sale_df = pd.concat(list_)
sale_df['filename'] = os.path.basename(csv)
sale_df.append(frame)
sale_df
Get the below error
ValueError: No objects to concatenate
Could you please assist. Thanks..
My suggestion uses glob.glob to get a list of all matching files/directories that match the specified string, then uses the os module to check each matching file/directory to make sure it is a file. It returns a list of ONLY files that match the glob.glob().
import glob
import os
files_only = [file for file in glob.glob('/path/to/files/*.ext') if not os.path.isdir(file)]
You can then use the files_only list in your read_csv loop.
So in your code:
files_only = [file for file in glob.glob('/Users/user/desktop/Sales/2018-05-03/*.csv') if not os.path.isdir(file)]
frame = pd.DataFrame()
list_ = []
for file in files_only:
df = pd.read_csv(file,index_col=None, header=0)
list_.append(df)
sale_df = pd.concat(list_)
sale_df['filename'] = os.path.basename(csv)
sale_df.append(frame)
sale_df
You call allFiles = glob.glob(path + "/*.csv"), even when your path variable ends with a forward slash. That way, it ends up as allFiles = glob.glob("/Users/user/desktop/Sales//*.csv").
See if fixing that helps with your error.