Convert multiple xlsm files automatically to multiple csv files by using pandas

Convert multiple xlsm files automatically to multiple csv files by using pandas - python

I have 300 raw datas (.xlsm) and wanne to extract useful datas and turn them to csv files as input for subsequent neural network, now i try to implement them with 10 datas as example, i have sucessfully extracted the informations what i need, but i dont know how to convert them to csv files with the same name, for single data we can use df.to_csv, but how about for all the datas? with for function?
import glob
import pandas as pd
import numpy as np
import csv
import os
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
directory = '/Beispiel'
for files in excel_files:
data = pd.read_excel(files)
# getting the list of rows and columns you need
list_of_dfs = pd.DataFrame(data.values[0:600:,12:26],
columns=data.columns[12:26]).drop(['Sauberkeit', 'Temparatur'], axis=1)
# converting pandas dataframe columns to numeric: string into float
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
list_of_dfs[cols] = list_of_dfs[cols].apply(pd.to_numeric, errors='coerce', axis=1)
# Filling down from a column through missing data
for fec in list_of_dfs[cols]:
list_of_dfs[fec].fillna(method='ffill', inplace=True)
csvfilename = files.split('/')[-1].split('.')[0] + '.csv'
newtempfile = os.path.join(directory,csvfilename)
print(newtempfile)
print(list_of_dfs.head(2))
problem is solved.
folder_name = 'Beispiel'
csvfilename = files.split('/')[-1].split('.')[0] + '.csv' # change into csv files
newtempfile = os.path.join(folder_name, csvfilename)
# Verify if directory exists
if not os.path.exists(folder_name):
os.makedirs(folder_name) # If not, create it
print(newtempfile)
list_of_dfs.to_csv(newtempfile, index=False)

The easiest way of doing this is to get the filename from the excel and then use the os.path.join() method to save it to the directory you want.
directory = "C:/Test"
for files in excel_files:
csvfilename = (os.path.basename(file)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
Since you already have the excel df you want to push into the csv file, just add the above code to the loop and change the output csv file to 'newtempfile' and that should do it.
df.to_csv(newtempfile, 'Beispel/data{0}.csv'.format(idx))
Hope this helps. :)
Updated Code:
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
for file in excel_files:
data = pd.read_excel(file, columns = cols) # import only the columns you need to the dataframe
csvfilename = (os.path.basename(files)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
# converting pandas dataframe columns to numeric: string into float
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)
data[cols].fillna(method='ffill', inplace=True)
data.to_csv(newtempfile).format(idx)

Related

Extracting multiple excel files as Pandas data frame

I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in the pandas data frame. The structuring of the tabs in each of the excel files is the same. Any help would be appreciated!!
folder = "specified_path"
files = os.listdir(folder)
sheet_contents = {}
for file in files:
data = pd.ExcelFile(folder+file)
file_data = {}
for sheet in data.sheet_names:
file_data[sheet] = data.parse(sheet)
sheet_contents[file[:-5]] = file_data

One of the ways to create a dataframe for each excelfile (stored in a specific folder and that holds multiple sheets) is by using pandas.read_excel and pandas.concat combined. By passing the parameter sheet_name=None to pandas.read_excel, we can read in all the sheets in the excelfile at one time.
Try this :
import os
import pandas as pd
folder = 'specified_path'
excel_files = [file for file in os.listdir(folder)]
list_of_dfs = []
for file in excel_files :
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None), ignore_index=True)
df['excelfile_name'] = file.split('.')[0]
list_of_dfs.append(df)
To access to one of the dataframes created, you can use its index (e.g, list_of_dfs[0]) :
print(type(list_of_dfs[0]))
<class 'pandas.core.frame.DataFrame'>

Append filename to column header when reading multiple csv files

I want to read multiple .csv files and append the datetime part of their filename to the column header. Each csv file contains data acquired at a specific datetime. Each filename has the following format:
yyyy-mm-dd_hh-mm-ss_someothertext
Each file contains only one column of data.
I successfully import multiple files as a list of dataframes as follows:
import pandas as pd
import glob
path = r'C:\Users\...' #path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
I then concatenate the files into one dataframe such that each column contains the data from one of the files:
frame = pd.concat(li, axis=1, ignore_index=True)
However, this is where I lose the filename information. The column headers are now just a series of numbers. My question is: how can I append the datetime portion of each filename to its respective column header in frame?
The closest I have got is being able to append the whole filename, not just the datetime part, in a roundabout way by transposing frame, adding the whole filename as a new column, transposing back, then setting the filename row as the header row...
import os
frame=pd.DataFrame.transpose(frame)
frame['filename'] = os.path.basename(filename)
frame=pd.DataFrame.transpose(frame)
frame.reset_index(drop=True)
frame.columns = frame.iloc[6628] #row 6628 is where the row with the filenames ends up after transposing
This seems terribly inefficient though and ends up with the whole filename as the header rather than just the datetime part.

This would be my suggested approach, squeezing the DataFrame and using Regex:
import re
import os
import glob
import pandas as pd
path = 'C:\Users\....'
files = glob.glob(f'{path}\*.csv')
li = []
for file in files:
name = os.path.basename(file) # get filename
date = re.search(r'\d{4}-\d{2}-\d{2}', name).group(0) # extract yyyy-mm-dd from filename
# read file, squeeze to Series, rename to date
li.append(pd.read_csv(file, index_col=None, header=0, squeeze=True).rename(date))
frame = pd.concat(li, axis=1, ignore_index=False)

Grabbing a single Excel worksheet from multiple workbooks into a pandas dataframe and saving this

I need to extract an Excel worksheet from multiple workbooks and saving it to a dataframe and in turn saving that dataframe.
I have a spreadsheet that is generated at the end of each month (e.g.
June 2019.xlsx, May 2019.xlsx, April 2019.xlsx).
I need to grab a worksheet 'Sheet1'from each of these workbooks and convert these to a dataframe (df1).
I would like to have this dataframe saved.
As a nice to have, I would also like some way just to append the next month's data after the initial 'data grab'.
I'm relatively new to this, so I haven't made much progress.
import os
import glob
import pandas as pd
import xlrd
import json
import io
import flatten_json
files = glob.glob('/Users/ngove/Documents/Python Scripts/2019/*.xlsx')
dfs={}
for f in files:
dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)

You can drop all of your files in a directory (e.g. current directory). Then append all of your excel files in a list (e.g. files_xls). Iterate over all your files and use pandas.read_excel to get the respective dataframes (e.g. list_frames).
Below, you can find an example:
import os
import pandas as pd
path = os.getcwd() # get cur dir
files = os.listdir(path) # get all the files in your cur dir
# get only the xls or xlsm (this depends on you)
files_xls = [f for f in files if (f[-3:] == 'xls' or f[-4:] == 'xlsm')]
df = pd.DataFrame()
list_frames = []
for f in files_xls:
print("Processing file: %s" %f)
try:
# the following will give you the dataframe
# the fun params depends on your data format
data = pd.read_excel(f, 'Sheet1', header=0, index_col=None,
sep='delimiter', error_bad_lines=False,
skip_blank_lines=True, comment=',,')
except:
pass
list_frames.append(data)
# at the end you can concat your data if you want and remove any dublicate
df = pd.concat(list_frames, sort=False).fillna(0)
df = df.drop_duplicates()
# at the end you can save it
writer = pd.ExcelWriter("your_title" + ".xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name="Sheets1", index=False)
writer.save()
I hope this helps.

I interpreted your statement that you want to save the dataframe as that you want to save it as a combined Excel file. This will combine all files in the folder specified that end in xlsx.
import os
import pandas as pd
from pandas import ExcelWriter
os.chdir("H:/Python/Reports/") #edit this to be your path
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
df = pd.DataFrame()
for f in files_xlsx:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
writer=ExcelWriter('Combined_Data.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()
You could update the code to grab all 2019 files by changing the one line to this:
files_xlsx = [f for f in files if f[-9:] == '2019.xlsx']
I referenced this question for most of the code and updated for xlsx and added the file save portion of the code

How to handle another Excel file in Python

Good Morning.
I'm starting with Python and I have a problem.
I need to find all .xls files (all have the same header) and merge all into a single DataFrame, so I need to say that the first line of the file should be ignored.
The current code I'm using is this:
os.chdir("file folder path")
fileLista = glob.glob('*.xls')
df = list()
for arquivo in fileLista:
df = df.append(pd.read_excel(arquivo))
Company= pd.concat(df)
Company.columns = Company.columns.str.strip()
I am using Glob to return all the .xls extension files,
df.append is to merge all the files that have been returned and put inside a DataFrame,
Company concat is to form a single file,
Company strip is to remove the spaces that it has in the column header.
When I run the code it returns me this error:
"erro NoneType' object is not iterable"
Can anyone help me with this mistake?

What about this instead?
fileLista = glob.glob('*.xls')
Company = pd.DataFrame()
for arquivo in fileLista:
df = pd.read_excel(arquivo)
Company= pd.concat([Company,df])
Company.columns = Company.columns.str.strip()

This should do what you want.
import pandas as pd
import numpy as np
import glob
glob.glob("C:/your_path_here/*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:/your_path_here/*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
print(all_data)
Here is another option to consider.
import pandas as pd
# filenames
excel_names = ["C:/your_path_here/Book1.xlsx", "C:/your_path_here/Book2.xlsx", "C:/your_path_here/Book3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
# Results go to the default directory if not assigned somewhere else.
# C:\Users\Excel\.spyder-py3

Python Pandas add Filename Column CSV

My python code works correctly in the below example. My code combines a directory of CSV files and matches the headers. However, I want to take it a step further - how do I add a column that appends the filename of the CSV that was used?
import pandas as pd
import glob
globbed_files = glob.glob("*.csv") #creates a list of all csv files
data = [] # pd.concat takes a list of dataframes as an agrument
for csv in globbed_files:
frame = pd.read_csv(csv)
data.append(frame)
bigframe = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
bigframe.to_csv("Pandas_output2.csv")

This should work:
import os
for csv in globbed_files:
frame = pd.read_csv(csv)
frame['filename'] = os.path.basename(csv)
data.append(frame)
frame['filename'] creates a new column named filename and os.path.basename() turns a path like /a/d/c.txt into the filename c.txt.

Mike's answer above works perfectly. In case any googlers run into the following error:
>>> TypeError: cannot concatenate object of type "<type 'str'>";
only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
It's possibly because the separator is not correct. I was using a custom csv file so the separator was ^. Becuase of that I needed to include the separator in the pd.read_csv call.
import os
for csv in globbed_files:
frame = pd.read_csv(csv, sep='^')
frame['filename'] = os.path.basename(csv)
data.append(frame)

files variable contains all list of csv files in your present directory. Such as
['FileName1.csv',FileName2.csv']. You also need to remove ".csv". You can use .split() function. Below is simple logic. This will work for you.
files = glob.glob("*.csv")
for i in files:
df=pd.read_csv(i)
df['New Column name'] = i.split(".")[0]
df.to_csv(i.split(".")[0]+".csv")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert multiple xlsm files automatically to multiple csv files by using pandas - python

Related

Extracting multiple excel files as Pandas data frame

Append filename to column header when reading multiple csv files

Grabbing a single Excel worksheet from multiple workbooks into a pandas dataframe and saving this

How to handle another Excel file in Python

Python Pandas add Filename Column CSV

Categories

Resources