I have done extensive researches, however have not been able to come up with the right solution for my needs.
I have python script which generates multiple .csv files.
Each .csv file only has data in column A & B.
Each .csv file has a unique name, and im trying to work out how to copy the .csv file based on its name, into an existing excel workbook, into a specific tab/sheet of the same name.
The .csv files will always be in the same folder.
I would ideally like to use python for this task.
You can try something like this
import os
import glob
import csv
from xlsxwriter.workbook import Workbook
workbook = Workbook('Existing.xlsx')
for csvfile in glob.glob(os.path.join('.', '*.csv')):
worksheet = workbook.add_worksheet(os.path.splitext(csvfile)[0]) # worksheet with csv file name
with open(csvfile, 'rb') as f:
reader = csv.reader(f)
for r, row in enumerate(reader):
for c, col in enumerate(row):
worksheet.write(r, c, col) # write the csv file content into it
workbook.close()
using module csv to read data from csv file
and write .xlsx(https://pypi.python.org/pypi/XlsxWriter) with XlsxWriter is recommended
Related
I am starting in the world of Data analysis and Python and my current job is import large CSV files with tweets and save them as xlsx, with format:Unicode UTF-8. I have been doing it the classic way one by one, but I have over hundreds of them and more will come so I need to automate it.
The process I need to do is the following in order to not loose data.
I have tried to do it with python but so far only managed to do it folder by folder ( improve from file by file) but te code looses some data and I think that is because It only opens as csv the file and saves it as xlsx ( I don't know it exactly because the code is a collection from others in the internet, sorry).
import os
currentDirectory = os.getcwd()
os.chdir (currentDirectory)
import os
import glob
import csv
import openpyxl # from https://pythonhosted.org/openpyxl/ or PyPI (e.g. via pip)
for csvfile in glob.glob(os.path.join('.', '*.csv')):
wb = openpyxl.Workbook()
ws = wb.active
with open(csvfile, 'rt', encoding='UTF-8') as f:
reader = csv.reader(f)
for r, row in enumerate(reader, start=1):
for c, val in enumerate(row, start=1):
ws.cell(row=r, column=c).value = val
wb.save(csvfile.replace ('.csv', '.xlsx')) #.csv' + '.xlsx')
I am trying to improve it adding new things but if someone knows how to do the exact process in Python or VBA or another language I would be so grateful if you could share.
Edit: To answer the comment and to after running some file comparisons it seems that the only difference is the format, but it doesn't seem to be a loss in data itself. However my client is asking me to make it auto but maintaining the format of the first one. The first one is the format I want and the second one is the automicatially generated file:
Thank you
Instead of using openpyxl directly, I would use pandas, which internally uses openpyxl to do the detailed work. Together with the standard library pathlib, this short script will do the same:
from pathlib import Path
import pandas as pd
p = Path('.')
for csvfile in list(p.glob('**/*.csv')):
df = pd.read_csv(csvfile)
excelfile = csvfile.with_suffix('.xlsx')
df.to_excel(excelfile)
print(csvfile.parent, csvfile.name, excelfile.name)
I have a large dataset, which consists of both floats, ints, and strings in cells.
My original data set is csv format. when converting to xlsx, I get numbers stored as text error.
I have seen this which gives script if you are manually writing to cells.
as well as this which shows how to convert csv to text.
This is my splicing of the two scripts:
import csv
import os
import glob
import xlsxwriter
from xlsxwriter import Workbook
workbook = xlsxwriter.Workbook('file.xlsx', {'strings_to_numbers': True})
for csvfile in glob.glob(os.path.join('.', '*.csv')):
workbook = Workbook(csvfile[:-4] + '.xlsx')
worksheet = workbook.add_worksheet()
with open(csvfile, 'rt', encoding='utf8') as f:
reader = csv.reader(f)
for r, row in enumerate(reader):
for c, col in enumerate(row):
worksheet.write(r, c, col)
workbook.close()
Which doesn't work for resolving the numbers stored as text issue. That issue persists.
I want to write numbers as numbers as it is converting the csv to xlsx
The first line of your for loop is overriding the workbook variable that you have defined using the "strings_to_numbers". Try removing that line since it looks like you want to add worksheets to a single workbook, correct?
Based on the examples you linked, it looks like this is just a small copy/paste error :).
I have a folder with a large number of Excel workbooks. Is there a way to convert every file in this folder into a CSV file using Python's xlrd, xlutiles, and xlsxWriter?
I would like the newly converted CSV files to have the extension '_convert.csv'.
OTHERWISE...
Is there a way to merge all the Excel workbooks in the folder to create one large file?
I've been searching for ways to do both, but nothing has worked...
Using pywin32, this will find all the .xlsx files in the indicated directory and open and resave them as .csv. It is relatively easy to figure out the right commands with pywin32...just record an Excel macro and perform the open/save manually, then look at the resulting macro.
import os
import glob
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
for f in glob.glob('tmp/*.xlsx'):
fullname = os.path.abspath(f)
xl.Workbooks.Open(fullname)
xl.ActiveWorkbook.SaveAs(Filename=fullname.replace('.xlsx','.csv'),
FileFormat=win32com.client.constants.xlCSVMSDOS,
CreateBackup=False)
xl.ActiveWorkbook.Close(SaveChanges=False)
I will give a try with my library pyexcel:
from pyexcel import Book, BookWriter
import glob
import os
for f in glob.glob("your_directory/*.xlsx"):
fullname = os.path.abspath(f)
converted_filename = fullname.replace(".xlsx", "_converted.csv")
book = Book(f)
converted_csvs = BookWriter(converted_filename)
converted_csvs.write_book_reader(book)
converted_csvs.close()
If you have a xlsx that has more than 2 sheets, I imagine you will have more than 2 csv files generated. The naming convention is: "file_converted_%s.csv" % your_sheet_name. The script will save all converted csv files in the same directory where you had xlsx files.
In addition, if you want to merge all in one, it is super easy as well.
from pyexcel.cookbook import merge_all_to_a_book
import glob
merge_all_to_a_book(glob.glob("your_directory/*.xlsx"), "output.xlsx")
If you want to do more, please read the tutorial
Look at openoffice's python library. Although, I suspect openoffice would support MS document files.
Python has no native support for Excel file.
Sure. Iterate over your files using something like glob and feed them into one of the modules you mention. With xlrd, you'd use open_workbook to open each file by name. That will give you back a Book object. You'll then want to have nested loops that iterate over each Sheet object in the Book, each row in the Sheet, and each Cell in the Row. If your rows aren't too wide, you can append each Cell in a Row into a Python list and then feed that list to the writerow method of a csv.writer object.
Since it's a high-level question, this answer glosses over some specifics like how to call xlrd.open_workbook and how to create a csv.writer. Hopefully googling for examples on those specific points will get you where you need to go.
You can use this function to read the data from each file
import xlrd
def getXLData(Filename, min_row_len=1, get_datemode=False, sheetnum=0):
Data = []
book = xlrd.open_workbook(Filename)
sheet = book.sheets()[sheetnum]
rowcount = 0
while rowcount < sheet.nrows:
row = sheet.row_values(rowcount)
if len(row)>=min_row_len: Data.append(row)
rowcount+=1
if get_datemode: return Data, book.datemode
else: return Data
and this function to write the data after you combine the lists together
import csv
def writeCSVFile(filename, data, headers = []):
import csv
if headers:
temp = [headers]
temp.extend(data)
data = temp
f = open(filename,"wb")
writer = csv.writer(f)
writer.writerows(data)
f.close()
Keep in mind you may have to re-format the data, especially if there are dates or integers in the Excel files since they're stored as floating point numbers.
Edited to add code calling the above functions:
import glob
filelist = glob.glob("*.xls*")
alldata = []
headers = []
for filename in filelist:
data = getXLData(filename)
headers = data.pop(0) # omit this line if files do not have a header row
alldata.extend(data)
writeCSVFile("Output.csv", alldata, headers)
I want to write a Python script that reads in an Excel spreadsheet and saves some of its worksheets as CSV files.
How can I do this?
I tried to do my own code, see it below.
import xlrd
import csv
def csv_from_excel():
wb = xlrd.open_workbook('ArquivoAgencias.xls')
sh = wb.sheet_by_name('AGENCIA')
AgenciaFile = open('AgenciaFile.csv', 'wb')
wr = csv.writer(AgenciaFile, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
AgenciaFile.close()
But i don`t know how i convert the XLS file with ; delimiter to CSV, i appreciate if anyone have any idea.
Thanks.
I am trying to extract the header row (the first row) from multiple files, each of which has multiple sheets. The output of each sheet should be saved and appened in a new master file that contains all the headers from each sheet and each file.
The easiest way I have found is to use the command row_slice. However, the output from the file is a list of Cell objects and I cannot seem to access their indices.
I am looking for a way to save the data extracted into a new workbook.
Here is the code I have so far:
from xlrd import open_workbook,cellname
book = open_workbook('E:\Files_combine\MOU worksheets 2012\Walmart-GE_MOU 2012-209_worksheet_v03.xls')
last_index = len(book.sheet_names())
for sheet_index in range(last_index):
sheet = book.sheet_by_index(sheet_index)
print sheet.name
print sheet.row_slice(0,1)
I cannot get the output and store it as an input to a new file. Also, any ideas on how to automate this process for 100+ files will be appreciated.
You can store the output in a csv file and you can use the os.listdir and a for loop to loop over all the file names
import csv
import os
from xlrd import open_workbook, cellname
EXCEL_DIR = 'E:\Files_combine\MOU worksheets 2012'
with open("headers.csv", 'w') as csv_file:
writer = csv.writer(csv_file)
for file_name in os.listdir(EXCEL_DIR):
if file_name.endswith("xls"):
book = open_workbook(os.path.join(EXCEL_DIR, file_name))
for index, name in enumerate(book.sheet_names()):
sheet = book.sheet_by_index(index)
#the write row method takes a sequence
#I assume that row_slice returns a list or a tuple
writer.writerow(sheet.row_slice(0,1))