Iterate through all sheets of all workbooks in a directory - python

I am trying to combine all spreadsheets from all workbooks in a directory into a single df. I've tried with glob and with os.scandir but either way I keep only getting the first sheet of all workbooks.
First attempt:
import pandas as pd
import glob
workbooks = glob.glob(r"\mydirectory\*.xlsx")
list = []
for file in workbooks:
df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index = True)
list.append(df)
dataframe = pd.concat(list, axis = 0)
Second attempt:
import os
import pandas as pd
df = pd.DataFrame()
path = r"\mydirectory\"
with os.scandir(path) as files:
for file in files:
data = pd.read_excel(file, sheet_name = None)
df = df.append(data)
I think the issue lies with the for loop but I'm too inexperienced to pin down the problem. Any help would be greatly appreciated, thx!!!

If I understand what you have written correctly, you want something like this:
import pandas as pd
import glob
# list of workbooks in directory
workbooks = glob.glob(r"\mydirectory\*.xlsx")
l = []
# for each file in list
for file in workbooks:
# Class for file allows for retrieving sheet names
xl_file = pd.ExcelFile(file)
# concatenate DataFrames created from each sheet in the file
df = pd.concat([pd.read_excel(file, sheet) for sheet in xl_file.sheet_names], ignore_index=True)
# append to list
l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)
This loops through the sheets within the Excel file for the concatenation, the only difference to what you had already written.
Alternative:
Alternatively, without needing to first find the sheet names, the dictionary created by pd.read_excel(file, sheet_name=None) can used.
import pandas as pd
import glob
# list of workbooks in directory
workbooks = glob.glob(r"\mydirectory\*.xlsx")
l = []
# for each file in list
for file in workbooks:
# concatenate the dictionary of dataframes from pd.read_excel
df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index=True)
l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)
A good explanation/example of the use of sheet_name=None can be found here. In short, the use of this returns a dictionary of DataFrames for each sheet. This can then be concatenated to one DataFrame, as above, or an individual sheet's DataFrame could be accessed through dictionary["sheet_name"].

Related

How to use python to merge multiple sheets from an excel file and values from particular cells

I have an excel file with multiple sheets, the actual data I need from each sheet is from cell B7 to F38, how can I merge all the sheets' data into one by using Python?
Collect the data with pandas, then concatenate the contents of the sheets (if they have the same column names) and insert the resulting dataframe somewhere (e.g. on the first sheet), see:
import xlwings as xw
import pandas as pd
path = r"test.xlsx"
df_dict = pd.read_excel(path, sheet_name=None)
df_result = pd.concat(df_dict.values(), axis=0)
wb = xw.Book(path)
ws = wb.sheets[0]
ws["A1"].options(pd.DataFrame, index=False).value = df_result

How to combine data from different tables on the same Excel list in pandas?

I have 1 list of excel with different tables:
List1:
I want to unify them into one table in pandas. Could you advice me how to do it in pandas?
Thanks in advance!
If you want to concatenate the first sheet of some excel files, you can use the following code block:
import os
import pandas as pd
cwd = os.path.abspath('')
files = os.listdir(cwd)
## gets the first sheet of a given file
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file), ignore_index=True)
df.head()
df.to_excel('total_sales.xlsx')
and if you want to merge various sheets of a given excel file in one pandas data frame, you can use the following code:
##gets all sheets of a given file
df_total = pd.DataFrame()
for file in files: # loop through Excel files
if file.endswith('.xlsx'):
excel_file = pd.ExcelFile(file)
sheets = excel_file.sheet_names
for sheet in sheets: # loop through sheets inside an Excel file
df = excel_file.parse(sheet_name = sheet)
df_total = df_total.append(df)
df_total.to_excel('combined_file.xlsx')
You can put your different tables in various sheets of one excel file or different excel files and then concatenate them using the above codes.

How to read multiple ann files (from brat annotation) within a folder into one pandas dataframe?

I can read one ann file into pandas dataframe as follows:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
But I don't know how to read multiple ann files into one pandas dataframe. I tried to use concat, but the result is not what I expected.
How can I read many ann files into one pandas dataframe?
It sounds like you need to use glob to pull in all the .ann files from a folder and add them to a list of dataframes. After that you probably want to join/merge/concat etc. as required.
I don't know your exact requirements but the code below should get you close. As it stands at the moment the script assumes, from where you are running the Python script, you have a subfolder called files and in that you want to pull in all the .ann files (it will not look at anything else). Obviously review and change as required as it's commented per line.
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())

How to concatenate multiple csv files into a single csv file using a column as index using python

I have to merge different csv files which contain features about a place based on place_id into one so that I can create a model to predict a rating for a particular place.
I have already tried using pandas.concat and merging the files through linux terminal but I just get null values for all the other features as the place_id keeps on repeating
#importing libraries
import pandas as pd
import numpy as np
import glob
#creating a single dataframe
fileList = glob.glob('chef*.csv')
fileList.append('rating_final.csv')
dfList = []
for file in fileList:
print(file)
df = pd.read_csv(file)
dfList.append(df)
concatDf = pd.concat(dfList, axis=0)
I expect to get a csv file with different features according to a single place_id but what I get is a csv file in which place_id keeps on repeating with a single feature only.
Try this,
import pandas as pd
df2 = pd.read_csv('rating_final.csv')
df2.to_csv('chef*.csv', mode='a', header=False, index=False)
test_df = pd.concat([pd.read_csv('chef*.csv'), df2], ignore_index=True, sort=True)
print(test_df)
The merged output will be available in chef*.csv file.

python efficient way to append all worksheets in multiple excel into pandas dataframe

I have around 20++ xlsx files, inside each xlsx files might contain different numbers of worksheets. But thank god, all the columns are the some in all worksheets and all xlsx files. By referring to here", i got some idea. I have been trying a few ways to import and append all excel files (all worksheet) into a single dataframe (around 4 million rows of records).
Note: i did check here" as well, but it only include file level, mine consits file and down to worksheet level.
I have tried below code
# import all necessary package
import pandas as pd
from pathlib import Path
import glob
import sys
# set source path
source_dataset_path = "C:/Users/aaa/Desktop/Sample_dataset/"
source_dataset_list = glob.iglob(source_dataset_path + "Sales transaction *")
for file in source_dataset_list:
#xls = pd.ExcelFile(source_dataset_list[i])
sys.stdout.write(str(file))
sys.stdout.flush()
xls = pd.ExcelFile(file)
out_df = pd.DataFrame() ## create empty output dataframe
for sheet in xls.sheet_names:
sys.stdout.write(str(sheet))
sys.stdout.flush() ## # View the excel files sheet names
#df = pd.read_excel(source_dataset_list[i], sheet_name=sheet)
df = pd.read_excel(file, sheetname=sheet)
out_df = out_df.append(df) ## This will append rows of one dataframe to another(just like your expected output)
Question:
My approach is like first read the every single excel file and get a list of sheets inside it, then load the sheets and append all sheets. The looping seems not very efficient expecially when datasize increase for every append.
Is there any other efficient way to import and append all sheets from multiple excel files?
Use sheet_name=None in read_excel for return orderdict of DataFrames created from all sheetnames, then join together by concat and last DataFrame.append to final DataFrame:
out_df = pd.DataFrame()
for f in source_dataset_list:
df = pd.read_excel(f, sheet_name=None)
cdf = pd.concat(df.values())
out_df = out_df.append(cdf,ignore_index=True)
Another solution:
cdf = [pd.read_excel(excel_names, sheet_name=None).values()
for excel_names in source_dataset_list]
out_df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
If i understand your problem correctly, set sheet_name=None in pd.read_excel does the trick.
import os
import pandas as pd
path = "C:/Users/aaa/Desktop/Sample_dataset/"
dfs = [
pd.concat(pd.read_excel(path + x, sheet_name=None))
for x in os.listdir(path)
if x.endswith(".xlsx") or x.endswith(".xls")
]
df = pd.concat(dfs)
I have a pretty straight forward solution if you want to read all the sheets.
import pandas as pd
df = pd.concat(pd.read_excel(path+file_name, sheet_name=None),
ignore_index=True)

Categories

Resources