Python - Copying csv files to Dataframe (but skip sub-folders) - python

I am using the below code to read a set of csv files from a folder to a Dataframe. However this folder has a sub-folder along with these csv files. How could I skip the sub-folder and only read the csv file. The below code throws an error when I try to run this folder that has a sub-folder.
import pandas as pd
import glob
import numpy as np
import os
import datetime
import time
path = r'/Users/user/desktop/Sales/'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
list_.append(df)
sale_df = pd.concat(list_)
sale_df
Error message : IsADirectoryError: [Errno 21] Is a directory:
'/Users/user/desktop/Sales/2018-05-03/20180503000513-kevin#store.com-
190982.csv-1525305907670.csv'
Could anyone assist on this. Thanks
EDIT: The issue is the subdirectory has the extension '.csv' present in the subdirectory name.
EDIT in code
path =r'/Users/user/desktop/Sales/2018-05-03/'
files_only = [file for file in
glob.glob('/Users/user/desktop/Sales/2018-05-03/*.csv') if not
os.path.isdir(file)]
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(files_only,index_col=None, header=0)
list_.append(df)
sale_df = pd.concat(list_)
sale_df['filename'] = os.path.basename(csv)
sale_df.append(frame)
sale_df
Get the below error
ValueError: No objects to concatenate
Could you please assist. Thanks..

My suggestion uses glob.glob to get a list of all matching files/directories that match the specified string, then uses the os module to check each matching file/directory to make sure it is a file. It returns a list of ONLY files that match the glob.glob().
import glob
import os
files_only = [file for file in glob.glob('/path/to/files/*.ext') if not os.path.isdir(file)]
You can then use the files_only list in your read_csv loop.
So in your code:
files_only = [file for file in glob.glob('/Users/user/desktop/Sales/2018-05-03/*.csv') if not os.path.isdir(file)]
frame = pd.DataFrame()
list_ = []
for file in files_only:
df = pd.read_csv(file,index_col=None, header=0)
list_.append(df)
sale_df = pd.concat(list_)
sale_df['filename'] = os.path.basename(csv)
sale_df.append(frame)
sale_df

You call allFiles = glob.glob(path + "/*.csv"), even when your path variable ends with a forward slash. That way, it ends up as allFiles = glob.glob("/Users/user/desktop/Sales//*.csv").
See if fixing that helps with your error.

Related

merge excel files with dynamic names

I have an Excel file that needs to be refreshed automatically every week. It must be extended by other Excel files. The problem is that these files have different names each time.
So in my opinion i can not use code like:
import pandas as pd
NG = 'NG.xlsx'
df = pd.read_excel(NG)
because the filename is not always "NG" like in this case.
Do you have any ideas?
Best Greetz
You could read all the files in your folder by doing this, because it allows you to ignore name changes:
import sys
import csv
import glob
import pandas as pd
# get data file names
path = r"C:\.......\folder_with_excel"
filenames = glob.glob(path + "/*.xlsx")
DF = []
for df in dfs:
xl_file = pd.ExcelFile(filenames)
df=xl_file.parse('Sheet1')
DF.concat(df, ignore_index=True)
Alternatively:
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path) # list all the files in you directory
files_xls = [f for f in files if f[-3:] == 'xlsx'] # make a list of the xlsx
df = pd.DataFrame()
for f in files_xls:
info = pd.read_excel(f, '<sheet name>') # remove <sheet name if you don't need it
df = df.append(info)

How to import multiple csv files and concatenate into one DataFrame using pandas

I have problem No objects to concatenate. I can not import .csv files from main and its subdirectories to concatenate them into one DataFrame. I am using pandas. Old answers did not help me so please do not mark as duplicated.
Folder structure is like that
main/*.csv
main/name1/name1/*.csv
main/name1/name2/*.csv
main/name2/name1/*.csv
main/name3/*.csv
import pandas as pd
import os
import glob
folder_selected = 'C:/Users/jacob/Documents/csv_files'
not works
frame = pd.concat(map(pd.read_csv, glob.iglob(os.path.join(folder_selected, "/*.csv"))))
not works
csv_paths = glob.glob('*.csv')
dfs = [pd.read_csv(folder_selected) for folder_selected in csv_paths]
df = pd.concat(dfs)
not works
all_files = []
all_files = glob.glob (folder_selected + "/*.csv")
file_path = []
for file in all_files:
df = pd.read_csv(file, index_col=None, header=0)
file_path.append(df)
frame = pd.concat(file_path, axis=0, ignore_index=False)
You need to search the subdirectories recursively.
folder = 'C:/Users/jacob/Documents/csv_files'
path = folder+"/**/*.csv"
Using glob.iglob
df = pd.concat(map(pd.read_csv, glob.iglob(path, recursive=True)))
Using glob.glob
csv_paths = glob.glob(path, recursive=True)
dfs = [pd.read_csv(csv_path) for csv_path in csv_paths]
df = pd.concat(dfs)
Using os.walk
file_paths = []
for base, dirs, files in os.walk(folder):
for file in fnmatch.filter(files, '*.csv'):
file_paths.append(os.path.join(base, file))
df = pd.concat([pd.read_csv(file) for file in file_paths])
Using pathlib
from pathlib import Path
files = Path(folder).rglob('*.csv')
df = pd.concat(map(pd.read_csv, files))
Check Dask Library as following, which reads many files to one df
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
Read their docs
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files
Python’s pathlib is a tool for such tasks
from pathlib import Path
FOLDER_SELECTED = 'C:/Users/jacob/Documents/csv_files'
path = Path(FOLDER_SELECTED) / Path("main")
# grab all csvs in main and subfolders
df = pd.concat(pd.read_csv(f.name) for f in path.rglob("*.csv"))
Note:
If the CSV need preprocing, you can create a read_csv function to deal with issues and place it in place of pd.read_csv

Concatenating multiple dataframes. Issue with datapaths

I want to concatenate several csv files which I saved in a directory ./Errormeasure. In order to do so, I used the following answer from another thread https://stackoverflow.com/a/51118604/9109556
filepaths =[f for f in listdir('./Errormeasure')if f.endswith('.csv')]
df=pd.concat(map(pd.read_csv,filepaths))
print(df)
However, this code only works, when I have the csv files I want to concatentate both in the ./Errormeasure directory as well as in the directory below, ./venv. This however is obviously not convenient.
When I have the csv files only in the ./Errormeasure, I recieve the following error:
FileNotFoundError: [Errno 2] File b'errormeasure_871687110001543570.csv' does not exist: b'errormeasure_871687110001543570.csv'
Can you give me some tips to tackle this problem? I am using pycharm.
Thanks in advance!
Using os.listdir() only retrieves file names and not parent folders which is needed for pandas.read_csv() at relative (where pandas script resides) or absolute levels.
Instead consider the recursive feature of built-in glob (only available in Python 3.5+) to return full paths of all csv files at top level and subfolders.
import glob
for f in glob.glob(dirpath + "/**/*.csv", recursive=True):
print(f)
From there build data frames in list comprehension (bypassing map -see List comprehension vs map) to be concatenated with pd.concat:
df_files = [pd.read_csv(f) for f in glob.glob(dirpath + "/**/*.csv", recursive=True)]
df = pd.concat(df_files)
print(df)
For Python < 3.5, consider os.walk() + os.listdir() to retrieve full paths of csv files:
import os
import pandas as pd
# COMBINE CSVs IN CURR FOLDER + SUB FOLDERS
fpaths = [os.path.join(dirpath, f)
for f in os.listdir(dirpath) if f.endswith('.csv')] + \
[os.path.join(fdir, fld, f)
for fdir, flds, ffile in os.walk(dirpath)
for fld in flds
for f in os.listdir(os.path.join(fdir, fld)) if f.endswith('.csv')]
df = pd.concat([pd.read_csv(f) in for f in fpaths])
print(df)
import pandas as pd
import glob
path = r'C:\Directory' # use your path
files = glob.glob(path + "/*.csv")
list = []
for file in files:
df = pd.read_csv(file, index_col=None, header=0)
list.append(df)
frame = pd.concat(list, axis=0, ignore_index=True)
Maybe you need to use '\' instead of '/'
file = glob.glob(os.path.join(your\\path , '.csv'))
print(file)
You can run above codes on for loop.

Python Fetching Name Of All CSV Files From Path And Writing Each To Different Folder

I am trying to open all files from a folder, store them in a dataframe and append each csv file with another csv file called Append.csv and am trying to write all the files with their names to a different folder.
For example I have 5 csv files that are saved in a folder called CSV FILES FOLDER. These files are F1.csv, F2.csv, F3.csv, F4.csvand F5.csv. What I am trying to do is open each file using pandas and I do this in a for loop, Append.csv and now store it in a different folder called NEW CSV FILES FOLDER as :
F1_APPENDED.csv
F2_APPENDED.csv
F3_APPENDED.csv
F4_APPENDED.csv
In other words, the _APPENDED is added with each file and then the file with the new name having _APPENDED is saved.
I have already defined the path for this folder but cant save it. The code is as below :
import pandas as pd
import glob
import os.path
import pathlib
path =r'C:\Users\Ahmed Ismail Khalid\Desktop\CSV FILES FOLDER'
allFiles = glob.glob(path + "/*.csv")
path1 = r'C:\Users\Ahmed Ismail Khalid\Desktop\Different Folder\Bitcoin Prices Hourly Based.csv'
outpath = r'C:\Users\Ahmed Ismail Khalid\Desktop\NEW CSV FILES FOLDER'
for f in allFiles:
file = open(f, 'r')
df1 = pd.read_csv(path1)
df2 = pd.read_csv(f)
output = pd.merge(df1, df2, how="inner", on="created_at")
df3 = output.created_at.value_counts().rename_axis('created_at').reset_index(name='count')
df3 = df3.sort_values(by=['created_at'])
#print(df3,'\n\n')
df3.to_csv(outpath+f, encoding='utf-8',index=False)
#print(f,'\n\n')
How can I do this? I tried to look up the official documentation but couldn't understand anything
Any and all help would be appreciated
Thanks
Here, I added a line in the for loop where you can get just the file name. You can use that instead of the full path to the file when you write the file and indicate the output .csv filename.
import pandas as pd
import glob
import os.path
import pathlib
path =r'C:\Users\Ahmed Ismail Khalid\Desktop\CSV FILES FOLDER'
allFiles = glob.glob(path + "/*.csv")
path1 = r'C:/Users/Ahmed Ismail Khalid/Desktop/Different Folder/Bitcoin Prices Hourly Based.csv'
# You need to have a slash at the end so it knows it's a folder
outpath = r'C:/Users/Ahmed Ismail Khalid/Desktop/NEW CSV FILES FOLDER/'
for f in allFiles:
file = open(f, 'r')
_, fname = os.path.split(f)
fname, ext = os.path.splittext(fname)
df1 = pd.read_csv(path1)
df2 = pd.read_csv(f)
output = pd.merge(df1, df2, how="inner", on="created_at")
df3 = output.created_at.value_counts().rename_axis('created_at').reset_index(name='count')
df3 = df3.sort_values(by=['created_at'])
#print(df3,'\n\n')
df3.to_csv(outpath+fname+'_appended.csv', encoding='utf-8',index=False)
#print(f,'\n\n')

python, pandas, csv import and more

I have seen many questions in regards to importing multiple csv files into a pandas dataframe.
My question is how can you import multiple csv files but ignore the last csv file in your directory? I have had a hard time finding the answer to this.
Also, lets assume that the csv file names are all different which is why the code file is "/*.csv"
any resource would also be greatly appreciated. Thank you!
path =r'C:\DRO\DCL_rawdata_files' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
Try this:
import os
import glob
import pandas as pd
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
path =r'C:\DRO\DCL_rawdata_files' # use your path
fmask = os.path.join(path, '*.csv')
allFiles = sorted(glob.glob(fmask), key=os.path.getmtime)
frame = get_merged_csv(allFiles[:-1], index_col=None, header=0)

Categories

Resources