I have a folder with several excel files in the format xls and xlsx and I am trying to read them and concatenate them in one single Dataframe. The problem that I am facing is that python does not read the files in the folder in the correct order.
My folder contains the following files:
190.xls , 195.xls , 198.xls , 202.xlsx , 220.xlsx and so on
This is my code:
import pandas as pd
from pathlib import Path
my_path = 'my_Dataset/'
xls_files = pd.concat([pd.read_excel(f2) for f2 in Path(my_path).rglob('*.xls')], sort = False)
xlsx_files = pd.concat([pd.read_excel(f1) for f1 in Path(my_path).rglob('*.xlsx')],sort = False)
all_files = pd.concat([xls_files,xlsx_files],sort = False).reset_index(drop=True))
I get what I want but the FILES ARE NOT CONCATENATED IN ORDER AS THEY WERE IN THE FOLDER!!!!!
meaning that in the all_files Dataframe I first have data from 202.xlsx and then from 190.xls
How can I solve this problem?
Thank you in advance!
Try using
import pandas as pd
from pathlib import Path
my_path = 'my_Dataset/'
all_files = pd.concat([pd.read_excel(f) for f in sorted(list(Path(my_path).rglob('*.xls')) + list(Path(my_path).rglob('*.xlsx')), key=lambda x: int(x.stem))],sort = False).reset_index(drop=True)
Update this
all_files = pd.concat([xls_files,xlsx_files],sort = False).reset_index(drop=True))
to this
all_files = pd.concat([xlsx_files,xls_files],sort = False).reset_index(drop=True))
Related
I have an Excel file that needs to be refreshed automatically every week. It must be extended by other Excel files. The problem is that these files have different names each time.
So in my opinion i can not use code like:
import pandas as pd
NG = 'NG.xlsx'
df = pd.read_excel(NG)
because the filename is not always "NG" like in this case.
Do you have any ideas?
Best Greetz
You could read all the files in your folder by doing this, because it allows you to ignore name changes:
import sys
import csv
import glob
import pandas as pd
# get data file names
path = r"C:\.......\folder_with_excel"
filenames = glob.glob(path + "/*.xlsx")
DF = []
for df in dfs:
xl_file = pd.ExcelFile(filenames)
df=xl_file.parse('Sheet1')
DF.concat(df, ignore_index=True)
Alternatively:
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path) # list all the files in you directory
files_xls = [f for f in files if f[-3:] == 'xlsx'] # make a list of the xlsx
df = pd.DataFrame()
for f in files_xls:
info = pd.read_excel(f, '<sheet name>') # remove <sheet name if you don't need it
df = df.append(info)
I have problem No objects to concatenate. I can not import .csv files from main and its subdirectories to concatenate them into one DataFrame. I am using pandas. Old answers did not help me so please do not mark as duplicated.
Folder structure is like that
main/*.csv
main/name1/name1/*.csv
main/name1/name2/*.csv
main/name2/name1/*.csv
main/name3/*.csv
import pandas as pd
import os
import glob
folder_selected = 'C:/Users/jacob/Documents/csv_files'
not works
frame = pd.concat(map(pd.read_csv, glob.iglob(os.path.join(folder_selected, "/*.csv"))))
not works
csv_paths = glob.glob('*.csv')
dfs = [pd.read_csv(folder_selected) for folder_selected in csv_paths]
df = pd.concat(dfs)
not works
all_files = []
all_files = glob.glob (folder_selected + "/*.csv")
file_path = []
for file in all_files:
df = pd.read_csv(file, index_col=None, header=0)
file_path.append(df)
frame = pd.concat(file_path, axis=0, ignore_index=False)
You need to search the subdirectories recursively.
folder = 'C:/Users/jacob/Documents/csv_files'
path = folder+"/**/*.csv"
Using glob.iglob
df = pd.concat(map(pd.read_csv, glob.iglob(path, recursive=True)))
Using glob.glob
csv_paths = glob.glob(path, recursive=True)
dfs = [pd.read_csv(csv_path) for csv_path in csv_paths]
df = pd.concat(dfs)
Using os.walk
file_paths = []
for base, dirs, files in os.walk(folder):
for file in fnmatch.filter(files, '*.csv'):
file_paths.append(os.path.join(base, file))
df = pd.concat([pd.read_csv(file) for file in file_paths])
Using pathlib
from pathlib import Path
files = Path(folder).rglob('*.csv')
df = pd.concat(map(pd.read_csv, files))
Check Dask Library as following, which reads many files to one df
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
Read their docs
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files
Python’s pathlib is a tool for such tasks
from pathlib import Path
FOLDER_SELECTED = 'C:/Users/jacob/Documents/csv_files'
path = Path(FOLDER_SELECTED) / Path("main")
# grab all csvs in main and subfolders
df = pd.concat(pd.read_csv(f.name) for f in path.rglob("*.csv"))
Note:
If the CSV need preprocing, you can create a read_csv function to deal with issues and place it in place of pd.read_csv
What's the best way to create a pandas dataframe from one file with any file name located in a specified folder?
I have used pathlib and it's not quite working as the output dataframe is not giving me anything.
from pathlib import Path
import pandas as pd
pth = r'C:\Users\HP\Desktop\IBM\New folder'
fle = Path(pth).glob('*.tsv')
someDf = pd.DataFrame(fle)
someDf
Edit:
I also tried doing the below, but the output dataframe combines all columns into one column separated by a backward slash. How do I fix this?
from pathlib import Path
import pandas as pd
pth = r'C:\Users\HP\Desktop\IBM\New folder'
fle = Path(pth).glob('*.tsv')
dfs = []
for filename in fle:
dfs.append(pd.read_csv(filename))
dfs1 = pd.concat(dfs)
dfs1.head()
The way I did this seems complicated. Is there an easier way to do this?
Please try:
from pathlib import Path
import pandas as pd
import os
pth = r'C:\Users\HP\Desktop\IBM\New folder'
for file_ in os.listdir(pth):
h=os.path.join(pth, file_)
#print (h)
someDf = pd.read_csv(h)
someDf
Try
from glob import glob
files = glob('C:\Users\HP\Desktop\IBM\New folder\*.tsv')
if len(files) == 1:
dfs = pd.read_csv(files[0], sep='\t')
else:
dfs = pd.concat([pd.read_csv(file, sep='\t') for file in files])
The solution I found for this is as below. I missed the sep parameter in pd.read_csv().
from pathlib import Path
import pandas as pd
pth = r'C:\Users\HP\Desktop\IBM\New folder'
fle = Path(pth).glob('*.tsv')
dfs = []
for filename in fle:
dfs.append(pd.read_csv(filename, sep='\t'))
dfs1 = pd.concat(dfs)
dfs1.head()
I want to concatenate several csv files which I saved in a directory ./Errormeasure. In order to do so, I used the following answer from another thread https://stackoverflow.com/a/51118604/9109556
filepaths =[f for f in listdir('./Errormeasure')if f.endswith('.csv')]
df=pd.concat(map(pd.read_csv,filepaths))
print(df)
However, this code only works, when I have the csv files I want to concatentate both in the ./Errormeasure directory as well as in the directory below, ./venv. This however is obviously not convenient.
When I have the csv files only in the ./Errormeasure, I recieve the following error:
FileNotFoundError: [Errno 2] File b'errormeasure_871687110001543570.csv' does not exist: b'errormeasure_871687110001543570.csv'
Can you give me some tips to tackle this problem? I am using pycharm.
Thanks in advance!
Using os.listdir() only retrieves file names and not parent folders which is needed for pandas.read_csv() at relative (where pandas script resides) or absolute levels.
Instead consider the recursive feature of built-in glob (only available in Python 3.5+) to return full paths of all csv files at top level and subfolders.
import glob
for f in glob.glob(dirpath + "/**/*.csv", recursive=True):
print(f)
From there build data frames in list comprehension (bypassing map -see List comprehension vs map) to be concatenated with pd.concat:
df_files = [pd.read_csv(f) for f in glob.glob(dirpath + "/**/*.csv", recursive=True)]
df = pd.concat(df_files)
print(df)
For Python < 3.5, consider os.walk() + os.listdir() to retrieve full paths of csv files:
import os
import pandas as pd
# COMBINE CSVs IN CURR FOLDER + SUB FOLDERS
fpaths = [os.path.join(dirpath, f)
for f in os.listdir(dirpath) if f.endswith('.csv')] + \
[os.path.join(fdir, fld, f)
for fdir, flds, ffile in os.walk(dirpath)
for fld in flds
for f in os.listdir(os.path.join(fdir, fld)) if f.endswith('.csv')]
df = pd.concat([pd.read_csv(f) in for f in fpaths])
print(df)
import pandas as pd
import glob
path = r'C:\Directory' # use your path
files = glob.glob(path + "/*.csv")
list = []
for file in files:
df = pd.read_csv(file, index_col=None, header=0)
list.append(df)
frame = pd.concat(list, axis=0, ignore_index=True)
Maybe you need to use '\' instead of '/'
file = glob.glob(os.path.join(your\\path , '.csv'))
print(file)
You can run above codes on for loop.
I have a script to output a whole bunch of CSVs to folder c:\Scripts\CSV. This particular script is looping through all of the dataframes and counting the usage of the top 100 words in the data set. The top 100 words and their count are added to a list, the dataframes are concatenated, and then the csv should export. The print contains the correct information, but the script doesn't output any file.
#! python3
import pandas as pd
import os
path = r'Scripts\\CSV\\'
directory = os.path.join("c:\\",path)
appended_data = []
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
thread = pd.read_csv(directory + file)
thread.columns = ['num', 'id', 'body', 'title', 'url']
s = pd.Series(''.join(thread['body']).lower().split()).value_counts()[:100]
appended_data.append(s)
thatdata = pd.concat(appended_data)
#print(appended_data)
thatdata.to_csv = (directory + 'somename.csv')
Try using pathlib instead:
from pathlib import PureWindowsPath
directory = PureWindowsPath('c:/Scripts/CSV/')
for csv_f in directory.glob('**/*.csv'):
# process inputs
target_path = directory / 'somename.csv'
thatdata.to_csv(target_path)