I am starting in the world of Data analysis and Python and my current job is import large CSV files with tweets and save them as xlsx, with format:Unicode UTF-8. I have been doing it the classic way one by one, but I have over hundreds of them and more will come so I need to automate it.
The process I need to do is the following in order to not loose data.
I have tried to do it with python but so far only managed to do it folder by folder ( improve from file by file) but te code looses some data and I think that is because It only opens as csv the file and saves it as xlsx ( I don't know it exactly because the code is a collection from others in the internet, sorry).
import os
currentDirectory = os.getcwd()
os.chdir (currentDirectory)
import os
import glob
import csv
import openpyxl # from https://pythonhosted.org/openpyxl/ or PyPI (e.g. via pip)
for csvfile in glob.glob(os.path.join('.', '*.csv')):
wb = openpyxl.Workbook()
ws = wb.active
with open(csvfile, 'rt', encoding='UTF-8') as f:
reader = csv.reader(f)
for r, row in enumerate(reader, start=1):
for c, val in enumerate(row, start=1):
ws.cell(row=r, column=c).value = val
wb.save(csvfile.replace ('.csv', '.xlsx')) #.csv' + '.xlsx')
I am trying to improve it adding new things but if someone knows how to do the exact process in Python or VBA or another language I would be so grateful if you could share.
Edit: To answer the comment and to after running some file comparisons it seems that the only difference is the format, but it doesn't seem to be a loss in data itself. However my client is asking me to make it auto but maintaining the format of the first one. The first one is the format I want and the second one is the automicatially generated file:
Thank you
Instead of using openpyxl directly, I would use pandas, which internally uses openpyxl to do the detailed work. Together with the standard library pathlib, this short script will do the same:
from pathlib import Path
import pandas as pd
p = Path('.')
for csvfile in list(p.glob('**/*.csv')):
df = pd.read_csv(csvfile)
excelfile = csvfile.with_suffix('.xlsx')
df.to_excel(excelfile)
print(csvfile.parent, csvfile.name, excelfile.name)
Related
my code goes as follows:
import csv
with open('Remarks_Drug.csv', newline='', encoding ='utf-8') as myFile:
reader = csv.reader(myFile)
for row in reader:
product = row[0].lower()
filename = row[1]
product_patterns = ', '.join([i.split("+")[0].strip() for i in product.split(",")])
print(product_patterns, filename)
which outputs as below: (where film-coated tab should be one column and the filename should be another column)
film-coated tablet RECD outcome AUBAGIO IAIN-21 AoR.txt
solution for injection 093 Acceptance NO Safety profil.txt
I want to output this to a csv file with one column as product_patterns and another as filename. I wrote the below code but only the last row gets appended. Can anyone please help me with the looping here. The code I wrote is:
with open ('drug_output.csv', 'a') as csvfile:
fieldnames = ['product_patterns', 'filename']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'product_patterns':product_patterns, 'filename':filename})
enter image description here
Depending on the environment that you can use, it might be more practical to use more dedicated programs to solve your problem.
Especially the pandas package seems useful in your case.
Then you can load the csv using:
import pandas as pd
df=pd.read_csv(file_path)
After doing the necessary manipulations, you can save it again with
df.to_csv(file_path)
This will save you a lot of issues that typically occur when parsing line by line, and it should also increase performance a bit. Pandas is a pretty good package to learn anyway if you need to do some data manipulation.
I have a simple excel file:
A1 = 200
A2 = 300
A3 = =SUM(A1:A2)
this file works in excel and shows proper value for SUM, but while using openpyxl module for python I cannot get value in data_only=True mode
Python code from shell:
wb = openpyxl.load_workbook('writeFormula.xlsx', data_only = True)
sheet = wb.active
sheet['A3']
<Cell Sheet.A3> # python response
print(sheet['A3'].value)
None # python response
while:
wb2 = openpyxl.load_workbook('writeFormula.xlsx')
sheet2 = wb2.active
sheet2['A3'].value
'=SUM(A1:A2)' # python response
Any suggestions what am I doing wrong?
It depends upon the provenance of the file. data_only=True depends upon the value of the formula being cached by an application like Excel. If, however, the file was created by openpyxl or a similar library, then it's probable that the formula was never evaluated and, thus, no cached value is available and openpyxl will report None as the value.
I have replicated the issue with Openpyxl and Python.
I am currently using openpyxl version 2.6.3 and Python 3.7.4. Also I am assuming that you are trying to complete an exercise from ATBSWP by Al Sweigart.
I tried and tested Charlie Clark's answer, considering that Excel may indeed cache values. I opened the spreadsheet in Excel, copied and pasted the formula into the same exact cell, and finally saved the workbook. Upon reopening the workbook in Python with Openpyxl with the data_only=True option, and reading the value of this cell, I saw the proper value, 500, instead of the wrong value, the None type.
I hope this helps.
I had the same issue. This may not be the most elegant solution, but this is what worked for me:
import xlwings
from openpyxl import load_workbook
excel_app = xlwings.App(visible=False)
excel_book = excel_app.books.open('writeFormula.xlsx')
excel_book.save()
excel_book.close()
excel_app.quit()
workbook = load_workbook(filename='writeFormula.xlsx', data_only=True)
I have suggestion to this problem. Convert xlsx file to csv :).
You will still have the original xlsx file. The conversion is done by libreoffice (it is that subprocess.call() line).You can use also Pandas for this as a more pythonic way.
from subprocess import call
from openpyxl import load_workbook
from csv import reader
filename="test"
wb = load_workbook(filename+".xlsx")
spread_range = wb['Sheet1']
#what ever function there is in A1 cell to be evaluated
print(spread_range.cell(row=1,column=1).value)
wb.close()
#this line can be done with subprocess or os.system()
#libreoffice --headless --convert-to csv $filename --outdir $outdir
call("libreoffice --headless --convert-to csv "+filename+".xlsx", shell=True)
with open(filename+".csv", newline='') as f:
reader = reader(f)
data = list(reader)
print(data[0][0])
or
# importing pandas as pd
import pandas as pd
# read an excel file and convert
# into a dataframe object
df = pd.DataFrame(pd.read_excel("Test.xlsx"))
# show the dataframe
df
I hope this helps somebody :-)
Yes, #Beno is right. If you want to edit the file without touching it, you can make a little "robot" that edits your excel file.
WARNING: This is a recursive way to edit the excel file. These libraries are depend on your machine, make sure you set time.sleep properly before continuing the rest of the code.
For instance, I use time.sleep, subprocess.Popen, and pywinauto.keyboard.send_keys, just add random character to any cell that you set, then save it. Then the data_only=True is working perfectly.
for more info about pywinauto.keyboard: pywinauto.keyboard
# import these stuff
import subprocess
from pywinauto.keyboard import send_keys
import time
import pygetwindow as gw
import pywinauto
excel_path = r"C:\Program Files\Microsoft Office\root\Office16\EXCEL.EXE"
excel_file_path = r"D:\test.xlsx"
def focus_to_window(window_title=None): # function to focus to window. https://stackoverflow.com/a/65623513/8903813
window = gw.getWindowsWithTitle(window_title)[0]
if not window.isActive:
pywinauto.application.Application().connect(handle=window._hWnd).top_window().set_focus()
subprocess.Popen([excel_path, excel_file_path])
time.sleep(1.5) # wait excel to open. Depends on your machine, set it propoerly
focus_to_window("Excel") # focus to that opened file
send_keys('%{F3}') # excel's name box | ALT+F3
send_keys('AA1{ENTER}') # whatever cell do you want to insert somthing | Type 'AA1' then press Enter
send_keys('Stackoverflow.com') # put whatever you want | Type 'Stackoverflow.com'
send_keys('^s') # save | CTRL+S
send_keys('%{F4}') # exit | ALT+F4
print("Done")
Sorry for my bad english.
As others already mentioned, Openpyxl only reads cashed formula value in data_only mode. I have used PyWin32 to open and save each XLSX file before it's processed by Openpyxl to read the formulas result value. This works for me well, as I don't process large files. This solution will work only if you have MS Excel installed on your PC.
import os
import win32com.client
from openpyxl import load_workbook
# Opening and saving XLSX file, so results for each stored formula can be evaluated and cashed so OpenPyXL can read them.
excel_file = os.path.join(path, file)
excel = win32com.client.gencache.EnsureDispatch('Excel.Application')
excel.DisplayAlerts = False # disabling prompts to overwrite existing file
excel.Workbooks.Open(excel_file )
excel.ActiveWorkbook.SaveAs(excel_file, FileFormat=51, ConflictResolution=2)
excel.DisplayAlerts = True # enabling prompts
excel.ActiveWorkbook.Close()
wb = load_workbook(excel_file)
# read your formula values with openpyxl and do other stuff here
I ran into the same issue. After reading through this thread I managed to fix it by simply opening the excel file, making a change then saving the file again. What a weird issue.
I am saving some data from a .h5 file to an excel file.
I am using openpyxl for that. And, I may not be doing it in a good way but, seems like it is taking too much time for a (quite) small .h5 file.
Do you have any recommendations?
I am currently taking a look at XlsxWriter, but is it really good enought?.
Here is the simple code I am using:
from openpyxl import Workbook
from tables import *
import os
import time
def saveExcel(pyTableName):
t1 = time.time()
wb_write = Workbook()
wsh_write = wb_write.active
r = 2
with openFile(pyTableName, 'r') as f:
tab = f.getNode('/absoluteData')
for row in tab.iterrows():
wsh_write.cell(row=r, column=1).value = row['sheet']
wsh_write.cell(row=r, column=2).value = str(row['IDnum'])+','+str(row['name'])
wsh_write.cell(row=r, column=3).value = row['line'])
wsh_write.cell(row=r, column=4).value = row['is_1']
wsh_write.cell(row=r, column=5).value = row['is_0']
wsh_write.cell(row=r, column=6).value = row['is_unknown']
wsh_write.cell(row=r, column=7).value = row['is_ok']
r+=1
wb_write.save(os.path.join(os.getcwd(),'Results.xlsx'))
print "SAVED in: ", time.time() - t1
And some performance data after running this code:
For a pyTable with 235200 rows x 17 cols it needed 152.976000071 secs
Both openpyxl and xlsxwriter are suitable for the task; xlsxwriter is probably the fastest for just writing files but openpyxl also has a write_only mode for this kind of task which is very fast if you also have lxml installed. If you don't have lxml installed then you should see a considerable speedup.
There are several limiting factors:
converting from the source objects to Python to XML (in this case probably h5, numpy, Python and XML)
the fact that xlsx doesn't support streaming
In openpyxl we've tried to simplify the API so that you can simply append rows to a cell without worrying too much about coordinates.
Your modified code might look something like this:
wb = Workbook(write_only=True)
ws = wb.create_sheet("Sheet1")
for row in tab.iterrows():
ws.append({'A':'row['sheet'], 'B': '%s%s' %(row['IDnum'], r(row['name'])}
If you do wish to follow the CSV route then it's probably best to use h5dump and define a data source in Excel which might also allow you to choose the columns the way you want.
You can simply write to CSV and load that into Excel. Here's the rough code:
with openFile(pyTableName, 'r') as f:
tab = f.getNode('/absoluteData')
outpath = os.path.join(os.getcwd(),'Results.csv')
np.savetxt(outpath, tab, delimiter=',')
That is, you should be able to write the entire CSV using NumPy (or Pandas if you want fancier options, perhaps), without any slow Python loops.
You can also consider Pandas' to_excel method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html
I have a folder with a large number of Excel workbooks. Is there a way to convert every file in this folder into a CSV file using Python's xlrd, xlutiles, and xlsxWriter?
I would like the newly converted CSV files to have the extension '_convert.csv'.
OTHERWISE...
Is there a way to merge all the Excel workbooks in the folder to create one large file?
I've been searching for ways to do both, but nothing has worked...
Using pywin32, this will find all the .xlsx files in the indicated directory and open and resave them as .csv. It is relatively easy to figure out the right commands with pywin32...just record an Excel macro and perform the open/save manually, then look at the resulting macro.
import os
import glob
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
for f in glob.glob('tmp/*.xlsx'):
fullname = os.path.abspath(f)
xl.Workbooks.Open(fullname)
xl.ActiveWorkbook.SaveAs(Filename=fullname.replace('.xlsx','.csv'),
FileFormat=win32com.client.constants.xlCSVMSDOS,
CreateBackup=False)
xl.ActiveWorkbook.Close(SaveChanges=False)
I will give a try with my library pyexcel:
from pyexcel import Book, BookWriter
import glob
import os
for f in glob.glob("your_directory/*.xlsx"):
fullname = os.path.abspath(f)
converted_filename = fullname.replace(".xlsx", "_converted.csv")
book = Book(f)
converted_csvs = BookWriter(converted_filename)
converted_csvs.write_book_reader(book)
converted_csvs.close()
If you have a xlsx that has more than 2 sheets, I imagine you will have more than 2 csv files generated. The naming convention is: "file_converted_%s.csv" % your_sheet_name. The script will save all converted csv files in the same directory where you had xlsx files.
In addition, if you want to merge all in one, it is super easy as well.
from pyexcel.cookbook import merge_all_to_a_book
import glob
merge_all_to_a_book(glob.glob("your_directory/*.xlsx"), "output.xlsx")
If you want to do more, please read the tutorial
Look at openoffice's python library. Although, I suspect openoffice would support MS document files.
Python has no native support for Excel file.
Sure. Iterate over your files using something like glob and feed them into one of the modules you mention. With xlrd, you'd use open_workbook to open each file by name. That will give you back a Book object. You'll then want to have nested loops that iterate over each Sheet object in the Book, each row in the Sheet, and each Cell in the Row. If your rows aren't too wide, you can append each Cell in a Row into a Python list and then feed that list to the writerow method of a csv.writer object.
Since it's a high-level question, this answer glosses over some specifics like how to call xlrd.open_workbook and how to create a csv.writer. Hopefully googling for examples on those specific points will get you where you need to go.
You can use this function to read the data from each file
import xlrd
def getXLData(Filename, min_row_len=1, get_datemode=False, sheetnum=0):
Data = []
book = xlrd.open_workbook(Filename)
sheet = book.sheets()[sheetnum]
rowcount = 0
while rowcount < sheet.nrows:
row = sheet.row_values(rowcount)
if len(row)>=min_row_len: Data.append(row)
rowcount+=1
if get_datemode: return Data, book.datemode
else: return Data
and this function to write the data after you combine the lists together
import csv
def writeCSVFile(filename, data, headers = []):
import csv
if headers:
temp = [headers]
temp.extend(data)
data = temp
f = open(filename,"wb")
writer = csv.writer(f)
writer.writerows(data)
f.close()
Keep in mind you may have to re-format the data, especially if there are dates or integers in the Excel files since they're stored as floating point numbers.
Edited to add code calling the above functions:
import glob
filelist = glob.glob("*.xls*")
alldata = []
headers = []
for filename in filelist:
data = getXLData(filename)
headers = data.pop(0) # omit this line if files do not have a header row
alldata.extend(data)
writeCSVFile("Output.csv", alldata, headers)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What is the best way to read Excel (XLS) files with Python (not CSV files).
Is there a built-in package which is supported by default in Python to do this task?
I highly recommend xlrd for reading .xls files. But there are some limitations(refer to xlrd github page):
Warning
This library will no longer read anything other than .xls files. For
alternatives that read newer file formats, please see
http://www.python-excel.org/.
The following are also not supported but will safely and reliably be
ignored:
- Charts, Macros, Pictures, any other embedded object, including embedded worksheets.
- VBA modules
- Formulas, but results of formula calculations are extracted.
- Comments
- Hyperlinks
- Autofilters, advanced filters, pivot tables, conditional formatting, data validation
Password-protected files are not supported and cannot be read by this
library.
voyager mentioned the use of COM automation. Having done this myself a few years ago, be warned that doing this is a real PITA. The number of caveats is huge and the documentation is lacking and annoying. I ran into many weird bugs and gotchas, some of which took many hours to figure out.
UPDATE:
For newer .xlsx files, the recommended library for reading and writing appears to be openpyxl (thanks, Ikar Pohorský).
You can use pandas to do this, first install the required libraries:
$ pip install pandas openpyxl
See code below:
import pandas as pd
xls = pd.ExcelFile(r"yourfilename.xls") # use r before absolute file path
sheetX = xls.parse(2) #2 is the sheet number+1 thus if the file has only 1 sheet write 0 in paranthesis
var1 = sheetX['ColumnName']
print(var1[1]) #1 is the row number...
You can choose any one of them http://www.python-excel.org/
I would recommended python xlrd library.
install it using
pip install xlrd
import using
import xlrd
to open a workbook
workbook = xlrd.open_workbook('your_file_name.xlsx')
open sheet by name
worksheet = workbook.sheet_by_name('Name of the Sheet')
open sheet by index
worksheet = workbook.sheet_by_index(0)
read cell value
worksheet.cell(0, 0).value
I think Pandas is the best way to go. There is already one answer here with Pandas using ExcelFile function, but it did not work properly for me. From here I found the read_excel function which works just fine:
import pandas as pd
dfs = pd.read_excel("your_file_name.xlsx", sheet_name="your_sheet_name")
print(dfs.head(10))
P.S. You need to have the xlrd installed for read_excel function to work
Update 21-03-2020: As you may see here, there are issues with the xlrd engine and it is going to be deprecated. The openpyxl is the best replacement. So as described here, the canonical syntax should be:
dfs = pd.read_excel("your_file_name.xlsx", sheet_name="your_sheet_name", engine="openpyxl")
For xlsx I like the solution posted earlier as https://web.archive.org/web/20180216070531/https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python. I uses modules from the standard library only.
def xlsx(fname):
import zipfile
from xml.etree.ElementTree import iterparse
z = zipfile.ZipFile(fname)
strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
rows = []
row = {}
value = ''
for e, el in iterparse(z.open('xl/worksheets/sheet1.xml')):
if el.tag.endswith('}v'): # Example: <v>84</v>
value = el.text
if el.tag.endswith('}c'): # Example: <c r="A3" t="s"><v>84</v></c>
if el.attrib.get('t') == 's':
value = strings[int(value)]
letter = el.attrib['r'] # Example: AZ22
while letter[-1].isdigit():
letter = letter[:-1]
row[letter] = value
value = ''
if el.tag.endswith('}row'):
rows.append(row)
row = {}
return rows
Improvements added are fetching content by sheet name, using re to get the column and checking if sharedstrings are used.
def xlsx(fname,sheet):
import zipfile
from xml.etree.ElementTree import iterparse
import re
z = zipfile.ZipFile(fname)
if 'xl/sharedStrings.xml' in z.namelist():
# Get shared strings
strings = [element.text for event, element
in iterparse(z.open('xl/sharedStrings.xml'))
if element.tag.endswith('}t')]
sheetdict = { element.attrib['name']:element.attrib['sheetId'] for event,element in iterparse(z.open('xl/workbook.xml'))
if element.tag.endswith('}sheet') }
rows = []
row = {}
value = ''
if sheet in sheets:
sheetfile = 'xl/worksheets/sheet'+sheets[sheet]+'.xml'
#print(sheet,sheetfile)
for event, element in iterparse(z.open(sheetfile)):
# get value or index to shared strings
if element.tag.endswith('}v') or element.tag.endswith('}t'):
value = element.text
# If value is a shared string, use value as an index
if element.tag.endswith('}c'):
if element.attrib.get('t') == 's':
value = strings[int(value)]
# split the row/col information so that the row leter(s) can be separate
letter = re.sub('\d','',element.attrib['r'])
row[letter] = value
value = ''
if element.tag.endswith('}row'):
rows.append(row)
row = {}
return rows
If you need old XLS format. Below code for ansii 'cp1251'.
import xlrd
file=u'C:/Landau/task/6200.xlsx'
try:
book = xlrd.open_workbook(file,encoding_override="cp1251")
except:
book = xlrd.open_workbook(file)
print("The number of worksheets is {0}".format(book.nsheets))
print("Worksheet name(s): {0}".format(book.sheet_names()))
sh = book.sheet_by_index(0)
print("{0} {1} {2}".format(sh.name, sh.nrows, sh.ncols))
print("Cell D30 is {0}".format(sh.cell_value(rowx=29, colx=3)))
for rx in range(sh.nrows):
print(sh.row(rx))
For older .xls files, you can use xlrd
either you can use xlrd directly by importing it. Like below
import xlrd
wb = xlrd.open_workbook(file_name)
Or you can also use pandas pd.read_excel() method, but do not forget to specify the engine, though the default is xlrd, it has to be specified.
pd.read_excel(file_name, engine = xlrd)
Both of them work for older .xls file formats.
Infact I came across this when I used OpenPyXL, i got the below error
InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
You can use any of the libraries listed here (like Pyxlreader that is based on JExcelApi, or xlwt), plus COM automation to use Excel itself for the reading of the files, but for that you are introducing Office as a dependency of your software, which might not be always an option.
You might also consider running the (non-python) program xls2csv. Feed it an xls file, and you should get back a csv.
Python Excelerator handles this task as well. http://ghantoos.org/2007/10/25/python-pyexcelerator-small-howto/
It's also available in Debian and Ubuntu:
sudo apt-get install python-excelerator
with open(csv_filename) as file:
data = file.read()
with open(xl_file_name, 'w') as file:
file.write(data)
You can turn CSV to excel like above with inbuilt packages. CSV can be handled with an inbuilt package of dictreader and dictwriter which will work the same way as python dictionary works. which makes it a ton easy
I am currently unaware of any inbuilt packages for excel but I had come across openpyxl. It was also pretty straight forward and simple You can see the code snippet below hope this helps
import openpyxl
book = openpyxl.load_workbook(filename)
sheet = book.active
result =sheet['AP2']
print(result.value)
For older Excel files there is the OleFileIO_PL module that can read the OLE structured storage format used.
If the file is really an old .xls, this works for me on python3 just using base open() and pandas:
df = pandas.read_csv(open(f, encoding = 'UTF-8'), sep='\t')
Note that the file I'm using is tab delimited. less or a text editor should be able to read .xls so that you can sniff out the delimiter.
I did not have a lot of luck with xlrd because of – I think – UTF-8 issues.