Concatenating multiple dataframes. Issue with datapaths - python

I want to concatenate several csv files which I saved in a directory ./Errormeasure. In order to do so, I used the following answer from another thread https://stackoverflow.com/a/51118604/9109556
filepaths =[f for f in listdir('./Errormeasure')if f.endswith('.csv')]
df=pd.concat(map(pd.read_csv,filepaths))
print(df)
However, this code only works, when I have the csv files I want to concatentate both in the ./Errormeasure directory as well as in the directory below, ./venv. This however is obviously not convenient.
When I have the csv files only in the ./Errormeasure, I recieve the following error:
FileNotFoundError: [Errno 2] File b'errormeasure_871687110001543570.csv' does not exist: b'errormeasure_871687110001543570.csv'
Can you give me some tips to tackle this problem? I am using pycharm.
Thanks in advance!

Using os.listdir() only retrieves file names and not parent folders which is needed for pandas.read_csv() at relative (where pandas script resides) or absolute levels.
Instead consider the recursive feature of built-in glob (only available in Python 3.5+) to return full paths of all csv files at top level and subfolders.
import glob
for f in glob.glob(dirpath + "/**/*.csv", recursive=True):
print(f)
From there build data frames in list comprehension (bypassing map -see List comprehension vs map) to be concatenated with pd.concat:
df_files = [pd.read_csv(f) for f in glob.glob(dirpath + "/**/*.csv", recursive=True)]
df = pd.concat(df_files)
print(df)
For Python < 3.5, consider os.walk() + os.listdir() to retrieve full paths of csv files:
import os
import pandas as pd
# COMBINE CSVs IN CURR FOLDER + SUB FOLDERS
fpaths = [os.path.join(dirpath, f)
for f in os.listdir(dirpath) if f.endswith('.csv')] + \
[os.path.join(fdir, fld, f)
for fdir, flds, ffile in os.walk(dirpath)
for fld in flds
for f in os.listdir(os.path.join(fdir, fld)) if f.endswith('.csv')]
df = pd.concat([pd.read_csv(f) in for f in fpaths])
print(df)

import pandas as pd
import glob
path = r'C:\Directory' # use your path
files = glob.glob(path + "/*.csv")
list = []
for file in files:
df = pd.read_csv(file, index_col=None, header=0)
list.append(df)
frame = pd.concat(list, axis=0, ignore_index=True)
Maybe you need to use '\' instead of '/'
file = glob.glob(os.path.join(your\\path , '.csv'))
print(file)
You can run above codes on for loop.

Related

Sort list of CSV in a folder that contain 10k+ files faster

Hi I'm a newbie in Python and in coding in general. this is my very first post.
I am trying to open and concatenate the last 20 files into a dataframe.
I am succesuful in doing so when i am working with a test folder that contain only 100 files, but as soon as i try my code in the real folder that contain 10k files my code is very slow and take like 5 minutes to finish.
Here is my try :
import pandas as pd
import glob
from datetime import datetime
import numpy as np
import os
path = r'K:/industriel/abc/03_LOG/PRODUCTION/CSV/'
path2 = r'K:/industriel/abc/03_LOG/PRODUCTION/IMG/'
os.chdir(path)
files = glob.glob(path + "/*.csv")
#files = filter(os.path.isfile, os.listdir(path))
files = [os.path.join(path, f) for f in files]
files.sort(key=lambda x: os.path.getctime(x), reverse=False)
dfs = pd.DataFrame()
for i in range(20):
dfs = dfs.append(pd.read_csv(files[i].split('\\')[-1],delimiter=';', usecols=[0,1,3,4,9,10,20]))
dfs = dfs.reset_index(drop=True)
print(dfs.head(10))
Try reading all the individual files to a list and then concat to form your dataframe at the end:
files = [os.path.join(path, f) for f in os.listdir(path) if f.endswith(".csv")]
files.sort(key=lambda x: os.path.getctime(x), reverse=False)
dfs = list()
for i, file in enumerate(files[:20]):
dfs.append(pd.read_csv(file, delimiter=';', usecols=[0,1,3,4,9,10,20]))
dfs = pd.concat(dfs)
You can use pd.concat() with a list of read files. You can replace your code after files.sort(...) with the following
dfs = pd.concat([
pd.read_csv(files[i].split('\\')[-1], delimiter=';', usecols=[0,1,3,4,9,10,20])
for file in files[20:]
])
dfs = dfs.reset_index(drop=True)
print(dfs.head(10))

How to import multiple csv files and concatenate into one DataFrame using pandas

I have problem No objects to concatenate. I can not import .csv files from main and its subdirectories to concatenate them into one DataFrame. I am using pandas. Old answers did not help me so please do not mark as duplicated.
Folder structure is like that
main/*.csv
main/name1/name1/*.csv
main/name1/name2/*.csv
main/name2/name1/*.csv
main/name3/*.csv
import pandas as pd
import os
import glob
folder_selected = 'C:/Users/jacob/Documents/csv_files'
not works
frame = pd.concat(map(pd.read_csv, glob.iglob(os.path.join(folder_selected, "/*.csv"))))
not works
csv_paths = glob.glob('*.csv')
dfs = [pd.read_csv(folder_selected) for folder_selected in csv_paths]
df = pd.concat(dfs)
not works
all_files = []
all_files = glob.glob (folder_selected + "/*.csv")
file_path = []
for file in all_files:
df = pd.read_csv(file, index_col=None, header=0)
file_path.append(df)
frame = pd.concat(file_path, axis=0, ignore_index=False)
You need to search the subdirectories recursively.
folder = 'C:/Users/jacob/Documents/csv_files'
path = folder+"/**/*.csv"
Using glob.iglob
df = pd.concat(map(pd.read_csv, glob.iglob(path, recursive=True)))
Using glob.glob
csv_paths = glob.glob(path, recursive=True)
dfs = [pd.read_csv(csv_path) for csv_path in csv_paths]
df = pd.concat(dfs)
Using os.walk
file_paths = []
for base, dirs, files in os.walk(folder):
for file in fnmatch.filter(files, '*.csv'):
file_paths.append(os.path.join(base, file))
df = pd.concat([pd.read_csv(file) for file in file_paths])
Using pathlib
from pathlib import Path
files = Path(folder).rglob('*.csv')
df = pd.concat(map(pd.read_csv, files))
Check Dask Library as following, which reads many files to one df
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
Read their docs
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files
Python’s pathlib is a tool for such tasks
from pathlib import Path
FOLDER_SELECTED = 'C:/Users/jacob/Documents/csv_files'
path = Path(FOLDER_SELECTED) / Path("main")
# grab all csvs in main and subfolders
df = pd.concat(pd.read_csv(f.name) for f in path.rglob("*.csv"))
Note:
If the CSV need preprocing, you can create a read_csv function to deal with issues and place it in place of pd.read_csv

Import multiple excel files and merge into single pandas df with source name as column

I'm trying to merge a bunch of xlsx files into a single pandas dataframe in python. Furthermore, I want to include a column that lists the source file for each row. My code is as follows:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import glob
import os
# get the path for where the xlsx files are
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# create new dataframe
df = pd.DataFrame()
# read data from files and add into dataframe
for f in files_xlsx:
data = pd.read_excel(f, 'Sheet 1')
df['Source_file'] = f
df = df.append(data)
however, when I look at the 'Source_file' column it lists the final file it reads as the name for every row. I've spent way more time than I should trying to fix this. What am I doing wrong?
within your for loop you are writing over each iteration of df so you'll only get back the final file,
what you need to do is delcare a list before hand and append do that,
since you called glob lets use that as well.
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
df = pd.concat(dfs)
if you want to add the filename into the df too then,
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [os.path.basename(f) for f in files]
df = pd.concat(dfs,keys=file_names)
Using Pathlib module (recommended Python 3.4+)
from pathlib import Path
files = [f for f in Path.cwd().glob('*.xlsx')]
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [f.stem for f in files]
df = pd.concat(dfs,keys=file_names)
or as a one liner :
df = pd.concat([pd.read_excel(f) for f in Path.cwd().glob('*.xlsx')],keys=[f.stem for f in Path.cwd().glob('*.xlsx')],sort=False)

Is there a way to create two dataFrames of two types of xlsx docs from my cwd

I want to perform a analysis on 24 excel doc's 12 containing one side of the story and the other 12 the other side. I managed to load them into python but when i try to get them in two seperate datafranes python combines them back to one.
This is for a windows server using Python3.7
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
files
files_car = [f for f in files if f.startswith("CAR")]
files_car
for f in files_car:
data1 = pd.read_excel(f)
car = car.append(data1)
path = os.getcwd()
files2 = os.listdir(path)
files2
files_ean = [f for f in files2 if f.startswith("ELEK")]
files_ean
ean = pd.DataFrame()
for x in files_ean:
data2 = pd.read_excel(f)
ean = ean.append(data2)
i expected that files_car would contain the 12 files tht start with "CAR"
and that files_ean the 12 files that start with "ELEK"
This should do what you want, based on your comment:
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
car_dfs = [pd.read_excel(f) for f in files if f.startswith("CAR")]
ean_dfs = [pd.read_excel(f) for f in files if f.startswith("ELEK")]
car_df = pd.concat(files_car)
ean_df = pd.concat(files_ean)
Couple of points:
you don't need to recreate files if you're going to run the same command (i.e. os.listdir(path)) unless path were to change
you should not append to a DataFrame in a loop like you're doing, since this creates a copy every time you call append. It's better to create a list of DataFrames and then concatenate that list in to one big DataFrame.
You could shorten this even more by just doing:
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
car_df = pd.concat([pd.read_excel(f) for f in files if f.startswith("CAR")])
ean_df = pd.concat([pd.read_excel(f) for f in files if f.startswith("ELEK")])
unless you have some need for the individual file DataFrames

Python Fetching Name Of All CSV Files From Path And Writing Each To Different Folder

I am trying to open all files from a folder, store them in a dataframe and append each csv file with another csv file called Append.csv and am trying to write all the files with their names to a different folder.
For example I have 5 csv files that are saved in a folder called CSV FILES FOLDER. These files are F1.csv, F2.csv, F3.csv, F4.csvand F5.csv. What I am trying to do is open each file using pandas and I do this in a for loop, Append.csv and now store it in a different folder called NEW CSV FILES FOLDER as :
F1_APPENDED.csv
F2_APPENDED.csv
F3_APPENDED.csv
F4_APPENDED.csv
In other words, the _APPENDED is added with each file and then the file with the new name having _APPENDED is saved.
I have already defined the path for this folder but cant save it. The code is as below :
import pandas as pd
import glob
import os.path
import pathlib
path =r'C:\Users\Ahmed Ismail Khalid\Desktop\CSV FILES FOLDER'
allFiles = glob.glob(path + "/*.csv")
path1 = r'C:\Users\Ahmed Ismail Khalid\Desktop\Different Folder\Bitcoin Prices Hourly Based.csv'
outpath = r'C:\Users\Ahmed Ismail Khalid\Desktop\NEW CSV FILES FOLDER'
for f in allFiles:
file = open(f, 'r')
df1 = pd.read_csv(path1)
df2 = pd.read_csv(f)
output = pd.merge(df1, df2, how="inner", on="created_at")
df3 = output.created_at.value_counts().rename_axis('created_at').reset_index(name='count')
df3 = df3.sort_values(by=['created_at'])
#print(df3,'\n\n')
df3.to_csv(outpath+f, encoding='utf-8',index=False)
#print(f,'\n\n')
How can I do this? I tried to look up the official documentation but couldn't understand anything
Any and all help would be appreciated
Thanks
Here, I added a line in the for loop where you can get just the file name. You can use that instead of the full path to the file when you write the file and indicate the output .csv filename.
import pandas as pd
import glob
import os.path
import pathlib
path =r'C:\Users\Ahmed Ismail Khalid\Desktop\CSV FILES FOLDER'
allFiles = glob.glob(path + "/*.csv")
path1 = r'C:/Users/Ahmed Ismail Khalid/Desktop/Different Folder/Bitcoin Prices Hourly Based.csv'
# You need to have a slash at the end so it knows it's a folder
outpath = r'C:/Users/Ahmed Ismail Khalid/Desktop/NEW CSV FILES FOLDER/'
for f in allFiles:
file = open(f, 'r')
_, fname = os.path.split(f)
fname, ext = os.path.splittext(fname)
df1 = pd.read_csv(path1)
df2 = pd.read_csv(f)
output = pd.merge(df1, df2, how="inner", on="created_at")
df3 = output.created_at.value_counts().rename_axis('created_at').reset_index(name='count')
df3 = df3.sort_values(by=['created_at'])
#print(df3,'\n\n')
df3.to_csv(outpath+fname+'_appended.csv', encoding='utf-8',index=False)
#print(f,'\n\n')

Categories

Resources