I am trying to stack multiple workbooks by sheet in python. Currently my folder contains over 50 individual workbooks which are separated by date and usually contain up to 3 sheets, although not always.
The sheets include "Pay", "Tablet", "SIMO".
I want to try to stack all data from the "Tablet" sheet into a new workbook and have been using the following code.
**import os
import pandas as pd
path = r"path_location"
files = os.listdir(path)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file, sheet_name='Tablet'), ignore_index=True)
df.head()
df.to_csv('Tablet_total.csv')**
however after checking the files I realised this is not pulling from all workbooks that have the sheet "Tablet". I suspect this may be due to the fact that not all workbooks have this sheet but in case I missed anything I'd greatly appreciate some ideas to what I may be doing wrong.
Also, as a final request, the sheet "Tablet" across all workbooks has unnecessary columns in the beginning.
I have tried incorporating df.drop(index=df.index[:7], axis=0, inplace=True) into my loop yet this only removes the first 7 rows from the first iteration. Again any support with this would be greatly appreciated.
Many thanks
First, I would check that you do not have any .xls files or other excel file suffixes with:
import os
path = r"path_location"
files = os.listdir(path)
print({
file.split('.')[1]
for file in files
})
Then, I would check that you don't have any sheets with trailing white spaces or capitalization issues with:
import os
import pandas
path = r"path_location"
files = os.listdir(path)
print({
sheet_name
for file in files
if file.endswith('.xlsx')
for sheet_name in pandas.ExcelFile(file).sheet_names
})
I would use pandas.concat() with a list comprehension to concatenate the sheets. I would also add a check to ensure that the worksheet as a sheet named 'Tablet'. Finally, if want to not use the first seven columns you should a) do it on each dataframe as it's read in and before it is concatenated together with the other dataframes, and b) first include all the rows and then specify the columns with .iloc[:, 7:]
import os
import pandas
path = r"path_location"
files = os.listdir(path)
df = pandas.concat([
pandas.read_excel(file, sheet_name='Tablet').iloc[:, 7:]
for file in files
if file.endswith('.xlsx') and 'Tablet' in pandas.ExcelFile(file).sheet_names
])
df.head()
Check if you have excel files with other extensions - xlsm, xlsb and others?
In order to remove seven rows at each iteration, you need to select a temporary dataframe when reading from excel and delete them from it.
df_tmp = pd.read_excel(file, sheet_name='Tablet')
df_tmp.drop(index=df_tmp.index[:7], axis=0, inplace=True)
Since append is deprecated, use concat() instead.
pandas.DataFrame.append
pd.concat([df, df_tmp], ignore_index=True)
Related
I am trying to collect multiple csvs files into one excel workbook and keeping the names of csvs files on each sheet but the loop can not save the sheet for each step and I only get only the last sheet only ?
for i in range(0,len(dir)):
for filee in os.listdir(dir):
if filee.endswith(".csv"):
file_path = os.path.join(dir, filee)
df = pd.read_csv(file_path, on_bad_lines='skip')
df.to_excel("output.xlsx",sheet_name=filee, index=False)
i=i+1
I have tried ExcelWriter but the file got error
could anyone help to fix this problem
Regards
This code would produce a SyntaxError since the first for loop is not defined properly. However, assuming that it is an IndentationError and moving to the for-loop body.
In each .csv file, the for-loop reads that into a pandas.DataFrame and writes it into output.xlsx. Basically, you override the file in each iteration. Thus, you only see the last sheet only.
Please! have a look to this link: Add worksheet to existing Excel file with pandas
Usually, the problem is the type of the sheet name. For example in df.to_excel("Output.xlsx",sheet_name = '1') If I don't put the 1 in the quotation, I will get an error. It must always be of str type
For example, I have the following csv files in Google Collab files:
With the following code, I first put all of them in df and then transfer them to the Excel file (in separate sheets).
import pandas as pd
df = {}
for i in range(1,5):
df[i] = pd.read_csv('sample_data/file'+str(i)+'.csv')
with pd.ExcelWriter('output.xlsx') as writer:
for i in range(1,5):
df[i].to_excel(writer, sheet_name = str(i))
It works fine for me and I don't get any errors.
You can use a dict comp to store all dfs and file names from each csv then pass it to a function. Unpack dict with a list comp and write to sheets.
from pathlib import Path
import pandas as pd
path = "/path/to/csv/files"
def write_sheets(file_map: dict) -> None:
with pd.ExcelWriter(f"{path}/output.xlsx", engine="xlsxwriter") as writer:
[df.to_excel(writer, sheet_name=sheet_name, index=False) for sheet_name, df in file_map.items()]
file_mapping = {Path(file).stem: pd.read_csv(file) for file in Path(path).glob("*csv")}
write_sheets(file_mapping)
I am attempting to import a large group of excels and the code that selects what to import is included below.
df = pd.read_excel (file, sheet_name = ['Sheet1', 'Sheet2'])
I know that the excels either use sheet1 or sheet2, however they do not use both. This makes my code error out. Is there anyway to tell pandas to try importing sheet1, and if that errors, trying sheet2?
Thanks for any help.
try:
df = pd.read_excel (file, sheet_name = ['Sheet1'])
except:
df = pd.read_excel (file, sheet_name = ['Sheet2'])
Assuming your Excel files aren't too large to import everything, you could do this:
df = pd.read_excel(file, sheet_name=None)
That would return all the sheets in the file as a dict, where the key is sheet name and the value is the dataframe. You can then test for the key you want and use that sheet, and drop the rest.
(Edit: I'll note that this may be a heavy-handed approach, but I tried to generalize the answer to how to select one or more sheets when you aren't sure of their names)
how to there is around 10k .csv files named as data0,data1 like that in sequence, want to combine them and want to have a master sheet in one file or at least couple of sheets using python because i think there is limitation of around 1070000 rows in one excel file i think?
import pandas as pd
import os
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('master file.CSV', index=False)
A few things to note:
Please check your csv file content first. It would easily mismatch columns when reading csv with text(maybe ; in the content). Or you can try changing the csv engine
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8',engine='python')
If you want to combing into one sheet, your can concat into one dataframe first, then to_excel
df = pd.concat([df,sh_tmp],axis=0,sort=False)
note: concat or append would be a straightforward way to combine data. However, 10k would lead to a perfomance topic. Try list instead of pd.concat if you facing perfomance issue.
Excel has maximum row limitation. 10k files would easily exceed the limit (1048576). You might change the output to csv file or split into multiple .xlsx
----update the 3rd----
You can try grouping the data first (1000k each), then write to sheet one by one.
row_limit = 1000000
master_df['group']=master_df.index//row_limit
writer = pd.ExcelWriter(path_out)
for gr in range(0,master_df['group'].max()+1):
master_df.loc[master_df['group']==gr].to_excel(writer,sheet_name='Sheet'+str(gr),index=False)
writer.save()
I have been able to generate several CSV files through an API. Now I am trying to combine all CSV's into a unique Master file so that I can then work on it. But it does not work. Below code is what I have attempted What am I doing wrong?
import glob
import pandas as pd
from pandas import read_csv
master_df = pd.DataFrame()
for file in files:
df = read_csv(file)
master_df = pd.concat([master_df, df])
del df
master_df.to_csv("./master_df.csv", index=False)
Although it is hard to tell what the precise problem is without more information (i.e., error message, pandas version), I believe it is that in the first iteration, master_df and df do not have the same columns. master_df is an empty DataFrame, whereas df has whatever columns are in your CSV. If this is indeed the problem, then I'd suggest storing all your data-frames (each of which represents one CSV file) in a single list, and then concatenating all of them. Like so:
import pandas as pd
df_list = [pd.read_csv(file) for file in files]
pd.concat(df_list, sort=False).to_csv("./master_df.csv", index=False)
Don't have time to find/generate a set of CSV files and test this right now, but am fairly sure this should do the job (assuming pandas version 0.23 or compatible).
firstly I ask admin not to close the topic. Because last time I opened and it was closed that there are similar topics. But those are not same. Thanks in advance.
Every day I receive 15-20 excel files with huge number of worksheets (more than 200). Fortunately worksheet names and counts are same for all excel files. I want to merge all excel files into one but with multiple worksheets. I am new in Python, actually I watched and read a lot about the options how to do but could not find a way. Thanks for your support.
example I tried: I have two files with two sheets (actual sheet count is huge as mentioned above), I want to merge both files into one with two sheets as sum.xlsx.
Data1.xlsx
Data1.xlsx
sum.xlsx
import os
import openpyxl
import pandas as pd
files = os.listdir(r'C:\Python\db') # we open the folder where files are located
os.chdir(r'C:\Python\db') # we change working direction
df = pd.DataFrame() # we create an empty data frame
wb = openpyxl.load_workbook(r'C:\Python\db\Data1.xlsx') # we load one of file to extract list of sheet names
sh_name = wb.sheetnames # we extract list of names into a list
for i in sh_name:
for f in files:
data = pd.read_excel(f, sheet_name=i)
df = df.append(data)
df.to_excel('sum.xlsx', index=False, sheet_name=i)