So, I have successfully got a function built where I can gather some data, convert it to a dataframe, and then convert it into an excel document. The problem I am having is, I don't know how to create a responsive title name or storage location. I have the path using pandas hard coded like so:
developmentdata = pd.DataFrame(dev_names).to_excel(r'C:\Desktop\Dev\devproj\test.xlsx', header=False, index=False)
and the excel is titled 'test.xlsx'. What can I do to fix this?
Python offers many different possibilities for programmatically constructing text strings. A modern approach that I personally would recommend is to use f-strings. There are also many different possibilities to construct and format dates, but I think that python's datetime module should suit your needs. As an example:
from datetime import datetime
# Variables with identifying information
output_dir = r"C:\Desktop\Dev\devproj"
language = "php"
location = "boston"
# Retrieve current date as a formatted string
current_date = datetime.now().strftime("%Y%m%d")
# Construct output path using f-strings
file_name = f"{location}_{language}_{current_date}.xlsx"
output_path = rf"{output_dir}\{file_name}"
print(output_path)
# Output:
# C:\Desktop\Dev\devproj\boston_php_20190821.xlsx
Note 1: to use f-strings you must have python 3.6 or higher.
Note 2: for dates in file names I'd always recommend to use "year-month-day" formatting (as done in the code example). This has the benefit that when you sort multiple of these files by name, you automatically get a proper chronological sorting as well.
Related
I am trying to use pyarrow.dataset.write_dataset function to write data into hdfs. However, if i write into a directory that already exists and has some data, the data is overwritten as opposed to a new file being created. Is there a way to "append" conveniently to already existing dataset without having to read in all the data first? I do not need the data to be in one file, i just don't want to delete the old one.
What i currently do and doesn't work:
import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(
use_deprecated_int96_timestamps = True,
coerce_timestamps = None,
allow_truncated_timestamps = True)
ds.write_dataset(data = data, base_dir = 'my_path', filesystem = hdfs_filesystem, format = parquet_format, file_options = write_options)
Currently, the write_dataset function uses a fixed file name template (part-{i}.parquet, where i is a counter if you are writing multiple batches; in case of writing a single Table i will always be 0).
This means that when writing multiple times to the same directory, it might indeed overwrite pre-existing files if those are named part-0.parquet.
How you can solve this is by ensuring that write_dataset uses unique file names for each write through the basename_template argument, eg:
ds.write_dataset(data=data, base_dir='my_path',
basename_template='my-unique-name-{i}.parquet', ...)
If you want to have automatically a unique name each time you write, you could eg generate a random string to include in the file name. One option for this is using the python uuid stdlib module: basename_template = "part-{i}-" + uuid.uuid4().hex + ".parquet".
Another option could be to include the current time of writing in the filename to make it unique, eg with basename_template = "part-{:%Y%m%d}-{{i}}.parquet".format(datetime.datetime.now())
See https://issues.apache.org/jira/browse/ARROW-10695 for some more discussion about this (customizing the template), and I opened a new issue specifically about the issue of silently overwriting data: https://issues.apache.org/jira/browse/ARROW-12358
For those that are here to work out how to use make_write_options() with write_dataset, try this:
import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(use_deprecated_int96_timestamps = False, coerce_timestamps = 'us', allow_truncated_timestamps = True)
I have lots of excel files(xlsx format) and want to read and handle them.
For example, file names are ex201901, ex201902, .... 201912.
Its name is made by exYYYYMM format.
Anyway, to import these files in pandas as an usual case, it's easy.
import pandas as pd
df201901 = pd.read_excel(r'C:\\users\ex201901.xlsx)
df201902 = pd.read_excel(r'C:\\users\ex201902.xlsx)
df201903 = pd.read_excel(r'C:\\users\ex201903.xlsx)
df201904 = pd.read_excel(r'C:\\users\ex201904.xlsx)
....
df201912 = pd.read_excel(r'C:\\users\ex201912.xlsx)
However, it seem to be a boring and tedius.
In SAS program, I use Macro() syntax. But in python, I have no idea how to handle.
Can you help me how to handle the multiple and repeated jobs in easy way, like a SAS MACRO().
Thanks for reading.
Given that you'll probably want to somehow work with all data frames at once afterwards, it's a smell if you even put them into separate local variables, and in general, whenever you're experiencing a "this task feels repetitive because I'm doing the same thing over and over again", that calls for introducing loops of some sort. As you're planning to use pandas, chances are that you'll be iterating soon again (now that you have your files, you're probably going to be performing some transformations on the rows of those files), in which case you'll probably be best off looking into how control flow a la loops works in Python (and indeed in pandas) in general; good tutorials are plentiful.
In your particular case, depending on what kind of processing you are planning on doing afterwards, you'd probably benefit from having something like
df2019 = [pd.read_excel(rf'C:\users\ex2019{str(i).zfill(2)}.xlsx') for i in range(1, 13)]
With that, you can access the individual data frames through e.g. df2019[5] to get the data frame corresponding to June, or you can collapse all of them into a single data frame using df = pd.concat(df2019) if that's what suits your need.
If you have less structure in your file names, glob can come in handy. With that, the above could become something like
import glob
df2019 = list(map(pd.read_excel, glob.glob(r'C:\users\ex2019*.xlsx')))
You can use OS module from python. It has a method listdir which stores all the file names in the folder. Check the code below:
import os, re
listDir = os.listdir(FILE_PATH)
dfList = []
for aFile in listDir:
if re.search(r'ex20190[0-9]{1}.xlsx', aFile):
tmpDf = pd.read_excel(FILE_PATH + aFile)
dfList.append(tmpDf)
outDf = pd.concat(dfList)
I am working on a process to automate generation of offer letters for candidates. The candidate information is in Excel and contains standard information needed for offer letter generation such as candidate name, date of joining, location, job title, CTC etc.
Is there a way to generate multiple offer letters (output file name _.docx) while preserving the formatting of the docx template?
Using Stackoverflow's help, I was able to utilize python-docx package and generate multiple offer letters. Thus approach however strips all the formatting from the offer letter.
import os
from pandas import *
import datetime
from docxtpl import DocxTemplate
doc = DocxTemplate("\\template\\offer_letter_template.docx")
xls = ExcelFile("\\data\\candidate_data.xlsx")
df = xls.parse(xls.sheet_names[0])
print (df.to_json(orient='records'))
Output:
[{"offer_letter_date":"July 27, 2019","candidate_name":"John Wick","candidate_email":"john.wick#gmail.com","candidate_location":"NYC","candidate_job_title":"Business Development Executive","candidate_ctc":283000},{"offer_letter_date":"July 17, 2019","candidate_name":"Jane Doe","candidate_email":"jane.doe#gmail.com","candidate_location":"NYC","candidate_job_title":"Business Development Executive","candidate_ctc":290000}]
context = df.to_json(orient='records')
doc.render(context)
I am struggling with creating a loop around context so that candidate information is saved in respective file rather than one file itself. Can someone please help?
Jinja2 for word templating was really helpful but I could not replicate it with a loop.
It is possible to create multiple docx files, unfortunately nobody said in the docxtpl documentation that once you load the template, replacements are done in-place, thus preventing any further context replacements.
A workaround which you may like would be reopening the file at every iteration.
Something like:
context=df.to_json(orient='records')
for i in len(context):
doc = DocxTemplate("\\template\\offer_letter_template.docx")
template.render(context[i])
template.save("docs-folder\\%s%(context[i][candidate_name]))
^Might need some revision, but you get the point.
This is probably an easy fix, but I can't seem to figure it out...
outputting a list to CSV in Python using the following code:
w = csv.writer(file('filename.csv','wb'))
w.writerows(mylist)
One of the list items is a ratio, so it contains values like '23/54', '9/12', etc. Excel is recognizing some of these values (like 9/12) as a date. What's the easiest way to solve this?
Thanks
Because csv is a text-only format, you cannot tell Excel anything about how to interpret the data, I am afraid.
You'd have to generate actual Excel files (using xlwt for example, documentation and tutorials available on http://www.python-excel.org/).
You could do this:
# somelist contains data like '12/51','9/43' etc
mylist = ["'" + val + "'" for val in somelist]
w = csv.writer(open('filename.csv','wb'))
for me in mylist:
w.writerow([me])
This would ensure your data is written as it is to csv.
I have to port an algorithm from an Excel sheet to python code but I have to reverse engineer the algorithm from the Excel file.
The Excel sheet is quite complicated, it contains many cells in which there are formulas that refer to other cells (that can also contains a formula or a constant).
My idea is to analyze with a python script the sheet building a sort of table of dependencies between cells, that is:
A1 depends on B4,C5,E7 formula: "=sqrt(B4)+C5*E7"
A2 depends on B5,C6 formula: "=sin(B5)*C6"
...
The xlrd python module allows to read an XLS workbook but at the moment I can access to the value of a cell, not the formula.
For example, with the following code I can get simply the value of a cell:
import xlrd
#open the .xls file
xlsname="test.xls"
book = xlrd.open_workbook(xlsname)
#build a dictionary of the names->sheets of the book
sd={}
for s in book.sheets():
sd[s.name]=s
#obtain Sheet "Foglio 1" from sheet names dictionary
sheet=sd["Foglio 1"]
#print value of the cell J141
print sheet.cell(142,9)
Anyway, It seems to have no way to get the formul from the Cell object returned by the .cell(...) method.
In documentation they say that it is possible to get a string version of the formula (in english because there is no information about function name translation stored in the Excel file). They speak about formulas (expressions) in the Name and Operand classes, anyway I cannot understand how to get the instances of these classes by the Cell class instance that must contains them.
Could you suggest a code snippet that gets the formula text from a cell?
[Dis]claimer: I'm the author/maintainer of xlrd.
The documentation references to formula text are about "name" formulas; read the section "Named references, constants, formulas, and macros" near the start of the docs. These formulas are associated sheet-wide or book-wide to a name; they are not associated with individual cells. Examples: PI maps to =22/7, SALES maps to =Mktng!$A$2:$Z$99. The name-formula decompiler was written to support inspection of the simpler and/or commonly found usages of defined names.
Formulas in general are of several kinds: cell, shared, and array (all associated with a cell, directly or indirectly), name, data validation, and conditional formatting.
Decompiling general formulas from bytecode to text is a "work-in-progress", slowly. Note that supposing it were available, you would then need to parse the text formula to extract the cell references. Parsing Excel formulas correctly is not an easy job; as with HTML, using regexes looks easy but doesn't work. It would be better to extract the references directly from the formula bytecode.
Also note that cell-based formulas can refer to names, and name formulas can refer both to cells and to other names. So it would be necessary to extract both cell and name references from both cell-based and name formulas. It may be useful to you to have info on shared formulas available; otherwise having parsed the following:
B2 =A2
B3 =A3+B2
B4 =A4+B3
B5 =A5+B4
...
B60 =A60+B59
you would need to deduce the similarity between the B3:B60 formulas yourself.
In any case, none of the above is likely to be available any time soon -- xlrd priorities lie elsewhere.
Update: I have gone and implemented a little library to do exactly what you describe: extracting the cells & dependencies from an Excel spreadsheet and converting them to python code. Code is on github, patches welcome :)
Just to add that you can always interact with excel using win32com (not very fast but it works). This does allow you to get the formula. A tutorial can be found here [cached copy] and details can be found in this chapter [cached copy].
Essentially you just do:
app.ActiveWorkbook.ActiveSheet.Cells(r,c).Formula
As for building a table of cell dependencies, a tricky thing is parsing the excel expressions. If I remember correctly the Trace code you mentioned does not always do this correctly. The best I have seen is the algorithm by E. W. Bachtal, of which a python implementation is available which works well.
So I know this is a very old post, but I found a decent way of getting the formulas from all the sheets in a workbook as well as having the newly created workbook retain all the formatting.
First step is to save a copy of your .xlsx file as .xls
-- Use the .xls as the filename in the code below
Using Python 2.7
from lxml import etree
from StringIO import StringIO
import xlsxwriter
import subprocess
from xlrd import open_workbook
from xlutils.copy import copy
from xlsxwriter.utility import xl_cell_to_rowcol
import os
file_name = '<YOUR-FILE-HERE>'
dir_path = os.path.dirname(os.path.realpath(file_name))
subprocess.call(["unzip",str(file_name+"x"),"-d","file_xml"])
xml_sheet_names = dict()
with open_workbook(file_name,formatting_info=True) as rb:
wb = copy(rb)
workbook_names_list = rb.sheet_names()
for i,name in enumerate(workbook_names_list):
xml_sheet_names[name] = "sheet"+str(i+1)
sheet_formulas = dict()
for i, k in enumerate(workbook_names_list):
xmlFile = os.path.join(dir_path,"file_xml/xl/worksheets/{}.xml".format(xml_sheet_names[k]))
with open(xmlFile) as f:
xml = f.read()
tree = etree.parse(StringIO(xml))
context = etree.iterparse(StringIO(xml))
sheet_formulas[k] = dict()
for _, elem in context:
if elem.tag.split("}")[1]=='f':
cell_key = elem.getparent().get(key="r")
cell_formula = elem.text
sheet_formulas[k][cell_key] = str("="+cell_formula)
sheet_formulas
Structure of Dictionary 'sheet_formulas'
{'Worksheet_Name': {'A1_cell_reference':'cell_formula'}}
Example results:
{u'CY16': {'A1': '=Data!B5',
'B1': '=Data!B1',
'B10': '=IFERROR(Data!B12,"")',
'B11': '=IFERROR(SUM(B9:B10),"")',
It seems that it is impossible now to do what you want with xlrd. You can have a look at this post for the detailed description of why it is so difficult to implement the functionality you need.
Note that the developping team does a great job for support at the python-excel google group.
I know this post is a little late but there's one suggestion that hasn't been covered here. Cut all the entries from the worksheet and paste using paste special (OpenOffice). This will convert the formulas to numbers so there's no need for additional programming and this is a reasonable solution for small workbooks.
Ye! With win32com it's works for me.
import win32com.client
Excel = win32com.client.Dispatch("Excel.Application")
# python -m pip install pywin32
file=r'path Excel file'
wb = Excel.Workbooks.Open(file)
sheet = wb.ActiveSheet
#Get value
val = sheet.Cells(1,1).value
# Get Formula
sheet.Cells(6,2).Formula