how to convert xlsx to tab delimited files

how to convert xlsx to tab delimited files - python

I have quite a lot of xlsx files which is a pain to convert them one by one to tab delimited files
I would like to know if there is any solution to do this by python. Here what I found and what tried to do with failure
This I found and I tried the solution but did not work Mass Convert .xls and .xlsx to .txt (Tab Delimited) on a Mac
I also tried to do it for one file to see how it works but with no success
#!/usr/bin/python
import xlrd
import csv
def main():
# I open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# I don't know the name of sheet
mysheet = myfile.sheet_by_index(0)
# I open the output csv
myCsvfile = open('my.csv', 'wb')
# I write the file into it
wr = csv.writer(myCsvfile, delimiter="\t")
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
myCsvfile.close()
if __name__ == '__main__':
main()

No real need for the main function.
And not sure about your indentation problems, but this is how I would write what you have. (And should work, according to first comment above)
#!/usr/bin/python
import xlrd
import csv
# open the output csv
with open('my.csv', 'wb') as myCsvfile:
# define a writer
wr = csv.writer(myCsvfile, delimiter="\t")
# open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# get a sheet
mysheet = myfile.sheet_by_index(0)
# write the rows
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))

Why go with so much pain when you can do it in 3 lines:
import pandas as pd
file = pd.read_excel('myfile.xlsx')
file.to_csv('myfile.xlsx',
sep="\t",
index=False)

Related

Compress excel file in python

Right now my final output is in excel format. I wanted to compressed my excel file using gzip. Is there a way to do it ?
import pandas as pd
import gzip
import re
def renaming_ad_unit():
with gzip.open('weekly_direct_house.xlsx.gz') as f:
df = pd.read_excel(f)
result = df['Ad unit'].to_list()
for index, a_string in enumerate(result):
modified_string = re.sub(r"\([^()]*\)", "", a_string)
df.at[index,'Ad unit'] = modified_string
return df.to_excel('weekly_direct_house.xlsx',index=False)

Yes, this is possible.
To create a gzip file, you can open the file like this:
with gzip.open('filename.xlsx.gz', 'wb') as f:
...
Unfortunately, when I tried this, I found that I get the error OSError: Negative seek in write mode. This is because the Pandas excel writer moves backwards in the file when writing, and uses multiple passes to write the file. This is not allowed by the gzip module.
To fix this, I created a temporary file, and wrote the excel file there. Then, I read the file back, and write it to the compressed archive.
I wrote a short program to demonstrate this. It reads an excel file from a gzip archive, prints it out, and writes it back to another gzip file.
import pandas as pd
import gzip
import tempfile
def main():
with gzip.open('apportionment-2020-table02.xlsx.gz') as f:
df = pd.read_excel(f)
print(df)
with tempfile.TemporaryFile() as excel_f:
df.to_excel(excel_f, index=False)
with gzip.open('output.xlsx.gz', 'wb') as gzip_f:
excel_f.seek(0)
gzip_f.write(excel_f.read())
if __name__ == '__main__':
main()
Here's the file I'm using to demonstrate this: Link

You could also use io.BytesIO to create file in memory and write excel in this file and next write this file as gzip on disk.
I used link to excel file from Nick ODell answer.
import pandas as pd
import gzip
import io
df = pd.read_excel('https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment-2020-table02.xlsx')
buf = io.BytesIO()
df.to_excel(buf)
buf.seek(0) # move to the beginning of file
with gzip.open('output.xlsx.gz', 'wb') as f:
f.write(buf.read())
Similar to Nick ODell answer.
import pandas as pd
import gzip
import io
df = pd.read_excel('https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment-2020-table02.xlsx')
with io.BytesIO() as buf:
df.to_excel(buf)
buf.seek(0) # move to the beginning of file
with gzip.open('output.xlsx.gz', 'wb') as f:
f.write(buf.read())
Tested on Linux

Saving XLSX workbooks as multiple CSV files

Trying to save Excel files with multiple sheets as corresponding CSV files.
I tried the following method:
import xlrd
from openpyxl import Workbook, load_workbook
import pathlib
import shutil
import pandas as pd
def strip_xlsx(inputdir, file_name, targetdir):
wb = load_workbook(inputdir)
sheets = wb.sheetnames
for s in sheets:
temp_df = pd.read_excel(inputdir, sheet_name=s)
temp_df.to_csv(targetdir + "/" + file_name.strip(".xlsx") + "_" + s + ".csv", encoding='utf-8-sig')
Where inputdir is an absolute path to a the Excel file (say: "/Users/me/test/t.xlsx"), file_name is just the name of the file ("t.xlsx") and target_dir is a path to which I wish to save the csv files.
The methods works well, thought super slow. I'm a newbie to Python and feel like I implemented the method in a very inefficient way.
Would appreciate tips from the masters.

You may have better luck if you keep everything in pandas. I see you are using openpyxl to get the sheet names, you can do this in pandas. As for speed, you'll just have to see:
EDIT:
As Charlie (the person who probably knows the most about openpyxl on the planet) pointed out, using just openpyxl will be faster. In this case about 25% faster (9.29 ms -> 6.87 ms for my two-sheet test):
from os import path, mkdir
from openpyxl import load_workbook
import csv
def xlsx_to_multi_csv(xlsx_path: str, out_dir: str = '.') -> None:
"""Write each sheet of an Excel file to a csv
"""
# make the out directory if it does not exist (this is not EAFP)
if not path.exists(out_dir):
mkdir(out_dir)
# set the prefix
prefix = path.splitext(xlsx_path)[0]
# load the workbook
wb = load_workbook(xlsx_path, read_only=True)
for sheet_name in wb.sheetnames:
# generate the out path
out_path = path.join(out_dir, f'{prefix}_{sheet_name}.csv')
# open that file
with open(out_path, 'w', newline='') as file:
# create the writer
writer = csv.writer(file)
# get the sheet
sheet = wb[sheet_name]
for row in sheet.rows:
# write each row to the csv
writer.writerow([cell.value for cell in row])
xlsx_to_multi_csv('data.xlsx')

You just need to specify a path to save the csv's to, and iterate through a dictionary created by pandas to save the frames to the directory.
csv_path = '\path\to\dir'
for name,df in pd.read_excel('xl_path',sheet_name=None).items():
df.to_excel(os.path.join(csv_path,name)

Adding csv filename to a column in python (200 files)

I have 200 files with dates in the file name. I would like to add date from this file name into new column in each file.
I created macro in Python:
import pandas as pd
import os
import openpyxl
import csv
os.chdir(r'\\\\\\\')
for file_name in os.listdir(r'\\\\\\'):
with open(file_name,'r') as csvinput:
reader = csv.reader(csvinput)
all = []
row = next(reader)
row.append('FileName')
all.append(row)
for row in reader:
row.append(file_name)
all.append(row)
with open(file_name, 'w') as csvoutput:
writer = csv.writer(csvoutput, lineterminator='\n')
writer.writerows(all)
if file_name.endswith('.csv'):
workbook = openpyxl.load_workbook(file_name)
workbook.save(file_name)
csv_filename = pd.read_csv(r'\\\\\\')
csv_data= pd.read_csv(csv_filename, header = 0)
csv_data['filename'] = csv_filename`
Right now I see "InvalidFileException: File is not a zip file" and only first file has added column with the file name.
Can you please advise what am I doing wrong? BTW I,m using Python 3.4.
Many thanks,
Lukasz

First problem, this section:
with open(file_name, 'w') as csvoutput:
writer = csv.writer(csvoutput, lineterminator='\n')
writer.writerows(all)
should be indented, to be included in the for loop. Now it is only executed once after the loop. This is why you only get one output file.
Second problem, the exception is probably caused by openpyxl.load_workbook(file_name). Presumably openpyxl can only open actual Excel files (which are .zip files with other extension), no CSV files. Why do you want to open and save it after all? I think you can just remove those three lines.

What's the equivalent of reader = csv.reader(...) for xlsx sheets?

I have a script with
path=r"mypath\myfile.xlsx"
with open(path) as f:
reader = csv.reader(f)
but it won't work because the code is trying to open an xlsx file with a module made for csv files.
So, does an expression equivalent for xlsx files exist?

The equivalent of the highlighted code for xlsx sheets is:
path=r"mypath\myfile.xlsx"
import pandas as pd
with open(path) as f:
reader = pd.read_excel(f)

Append a sheet to an existing excel file using openpyxl

I saw this post to append a sheet using xlutils.copy:
https://stackoverflow.com/a/38086916/2910740
Is there any solution which uses only openpyxl?

I found solution. It was very easy:
def store_excel(self, file_name, sheet_name):
if os.path.isfile(file_name):
self.workbook = load_workbook(filename = file_name)
self.worksheet = self.workbook.create_sheet(sheet_name)
else:
self.workbook = Workbook()
self.worksheet = self.workbook.active
self.worksheet.title = time.strftime(sheet_name)
.
.
.
self.worksheet.cell(row=row_num, column=col_num).value = data

I would recommend storing data in a CSV file, which is a ubiquitous file format made specifically to store tabular data. Excel supports it fully, as do most open source Excel-esque programs.
In that case, it's as simple as opening up a file to append to it, rather than write or read:
with open("output.csv", "a") as csvfile:
wr = csv.writer(csvfile, dialect='excel')
wr.writerow(YOUR_LIST)
As for Openpyxl:
end_of_sheet = your_sheet.max_row
will return how many rows your sheet is so that you can start writing to the position after that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to convert xlsx to tab delimited files - python

Why go with so much pain when you can do it in 3 lines: import pandas as pd file = pd.read_excel('myfile.xlsx') file.to_csv('myfile.xlsx', sep="\t", index=False)

Related

Compress excel file in python

Saving XLSX workbooks as multiple CSV files

Adding csv filename to a column in python (200 files)

What's the equivalent of reader = csv.reader(...) for xlsx sheets?

Append a sheet to an existing excel file using openpyxl

Categories

Resources