Python function inputting variables into strings - python

I have outputted a pandas df into an excel file using xlsxwriter. I'm trying to create a totals row at the top. To do so, I'm trying to create a function that dynamically populates the totals based off the column I choose.
Here is an example of what I'm intending to do:
worksheet.write_formula('G4', '=SUM(G4:G21)')
#G4 = Where total should be placed
I need this to be a function because the row counts can change (summation range should be dynamic), and I want there to be an easy way to apply this formula to various columns.
Therefore I've come up with the following:
def get_totals(column):
start_row = '4' #row which the totals will be on
row_count = str(tl_report.shape[0]) #number of rows in the table so I can sum up every row.
return (worksheet.write_formula(str(column+start_row),"'=SUM("+str(column+start_row)+":"+str(column+row_count)+")'") )
When running get_totals("G") it just results in 0. I'm suspecting it has to do with the STR operator that I had to apply because its adding single quotes to the formula, and therefore rendering it incorrectly.
However I cannot take the str operator out because I cannot concatenate INTs apparently?
Maybe I'm coding this all wrong, new to python, any help appreciated.
Thank you!

You could also do something like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,2,3,4], 'B': [5,6,7,8],
'C': [np.nan, np.nan, np.nan, np.nan]})
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow = 2)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
def get_totals(start_row, sum_column, column1='A', column2='B'):
for row in range(start_row,df.shape[0]+start_row):
worksheet.write_formula(f'{sum_column}{row}', f'=SUM({column1}{row},{column2}{row})')
get_totals(4, 'C')
writer.save()
Output:

In almost all cases XlsxWriter methods support two forms of notation to designate the position of cells: Row-column notation and A1 notation.
Row-column notation uses a zero based index for both row and column while A1 notation uses the standard Excel alphanumeric sequence of column letter and 1-based row. For example:
(6, 2) # Row-column notation.
('C7') # The same cell in A1 notation.
So for your case you could do the following and set the row-column values programatically (you may have to adjust by -1 to get zero indexing):
worksheet.write_formula(start_row, start_column, '=SUM(G4:G21)')
For the formula you could use of of XlsxWriter's utility functions:
from xlsxwriter.utility import xl_range
my_range = xl_range(3, 6, 20, 6) # G4:G21

Related

Converting every other csv file column from python list to value

I have several large csv filess each 100 columns and 800k rows. Starting from the first column, every other column has cells that are like python list, for example: in cell A2, I have [1000], in cell A3: I have [2300], and so forth. Column 2 is fine and are numbers, but columns 1, 3, 5, 7, etc, ...99 are similar to the column 1, their values are inside list. Is there an efficient way to remove the sign of the list [] from those columns and make their cells like normal numbers?
files_directory: r":D\my_files"
dir_files =os.listdir(r"D:\my_files")
for file in dir_files:
edited_csv = pd.read_csv("%s\%s"%(files_directory, file))
for column in list(edited_csv.columns):
if (column % 2) != 0:
edited_csv[column] = ?
Please try:
import pandas as pd
df = pd.read_csv('file.csv', header=None)
df.columns = df.iloc[0]
df = df[1:]
for x in df.columns[::2]:
df[x] = df[x].apply(lambda x: float(x[1:-1]))
print(df)
When reading the cells, for example column_1[3], which in this case is [4554.8433], python will read them as arrays. To read the numerical value inside the array, simply read the values like so:
value = column_1[3]
print(value[0]) #prints 4554.8433 instead of [4554.8433]

ipysheet.sheet converting to DataFrame with saving manual changes done

The aim is to create an interaction dataframe where I can handle values of the cells without coding.
For me it seems should be in the following way:
creating ipysheet.sheet
handling cells manually
converting it to pandas dataframe
the problem is:
after creating a ipysheet.sheet I manualy changed the values of some cells and then convert it to pandas dataframe, but changes done are not reflected in this datafarme; if you just call this sheet without converting you can see these changes;
d = {'col1': [2,8], 'col2': [3,6]}
df = pd.DataFrame(data=d)
sheet1 = ipysheet.sheet(rows=len(df.columns) +1 , columns=3)
first_col = df.columns.to_list()
first_col.insert(0, 'Attribute')
column = ipysheet.column(0, value = first_col, row_start = 0)
cell_value1 = ipysheet.cell(0, 1, 'Format')
cell_value2 = ipysheet.cell(0, 2, 'Description')
sheet1
creating a sheet1
ipysheet.to_dataframe(sheet1)
converting to pd.DataFrame
Solved by predefining all empty spaces as np.nan. You can handle it manually and it transmits to DataFrame when converting.

Sorting columns with pandas / xslxwriter after adding formulas

I have my dataframe with few columns, that's actually irrelevant to this problem, but I wanted to sort my columns in specific order.
Now, the issue is that I have a bunch of formulas that refer to excel tables (that I'm creating with xslxwriter worksheet.add_table), like for example:
planned_units = '=Table1[#[Spend]]/Table1[#[CP]]'
So if I will add those formulas by simply adding a column in pandas:
df['newformula'] = planned_units
it won't work, I think because I added a formula that references a table before actually adding a table. So sorting those columns before adding formulas won't work because:
I'm adding formulas later (after creating a table) but I also want to sort columns that I just added
if I'm adding formulas referencing an excel table before add_table, then those formulas
won't work in excel
It seems that xslxwriter doesn't allow me to sort columns in any way (maybe im wrong?) so I don't see any possibility of sorting columns after I have my final 'product' (after adding all columns with formulas).
It's still better to have working formulas instead of sorted columns, but I will happily welcome any ideas on how to sort them at this point.
thanks!
PS Code example:
import pandas as pd
import xlsxwriter
# simple dataframe with 3 columns
input_df = pd.DataFrame({'column_a': ['x', 'y', 'z'],
'column_b': ['red', 'white', 'blue'],
'column_c': ['a', 'e', 'i'],
})
output_file = 'output.xlsx'
# formula I want to add
column_concatenation = '=CONCATENATE(Table1[#[column_a]], " ", Table1[#[column_b]])'
# now if adding formulas with pandas would be possible, I would do it like this:
# input_df['concatenation'] = column_concatenation
# but its not possible since excel gives you errors while opening!
# adding excel table with xlsxwriter:
workbook = xlsxwriter.Workbook(output_file)
worksheet = workbook.add_worksheet("Sheet with formula")
# here I would change column order only IF formulas added with pandas would work! so no-no
'''
desired_column_order = ['columnB', 'concatenation', 'columnC', 'columnA']
input_df = input_df[desired_column_order]
'''
data = input_df
worksheet.add_table('A1:D4', {'data': data.values.tolist(),
'columns': [{'header': c} for c in data.columns.tolist()] +
[{'header': 'concatenation',
'formula': column_concatenation}
],
'style': 'Table Style Medium 9'})
workbook.close()
Now before workbook.close() I'd love to use that table 'desired_column_order' to re-order my columns after I've added formulas.
thanks:)
It looks like there are two issues here: sorting and the table formula.
Sorting is something that Excel does at runtime, in the Excel application and it isn't a property of, or something that can be triggered in, the file format. Since XlsxWriter only deals with the file format it cannot do any sorting. However, the data can be sorted in Python/Pandas prior to writing it with XlsxWriter.
The formula issue is due to the fact that Excel had an original [#This Row] syntax (Excel 2007) and a later # syntax (Excel 2010+). See the XlsxWriter docs on Working with Worksheet Tables - Columns:
The Excel 2007 style [#This Row] and Excel 2010 style # structural references are supported within the formula. However, other Excel 2010 additions to structural references aren’t supported and formulas should conform to Excel 2007 style formulas.
So basically you need to use the Excel 2007 syntax, since that is what is stored in the file format, even if Excel displays the Excel 2010+ syntax externally.
When you add formulas via the add_table() method XlsxWriter does the conversion for you but if you add the formulas in another way, such as via Pandas, you need to use the Excel 2007 syntax. So instead of a formula like this:
=CONCATENATE(Table1[#[column_a]], " ", Table1[#[column_b]])
You need to add this:
=CONCATENATE(Table1[[#This Row],[column_a]], " ", Table1[[#This Row],[column_b]])
(You can see why the moved to the shorter syntax in later Excel versions.)
Then your program will work as expected:
import pandas as pd
import xlsxwriter
input_df = pd.DataFrame({'column_a': ['x', 'y', 'z'],
'column_b': ['red', 'white', 'blue'],
'column_c': ['a', 'e', 'i'],
})
output_file = 'output.xlsx'
column_concatenation = '=CONCATENATE(Table1[[#This Row],[column_a]], " ", Table1[[#This Row],[column_b]])'
input_df['concatenation'] = column_concatenation
workbook = xlsxwriter.Workbook(output_file)
worksheet = workbook.add_worksheet("Sheet with formula")
desired_column_order = ['column_b', 'concatenation', 'column_c', 'column_a']
input_df = input_df[desired_column_order]
data = input_df
# Make the columns wider for clarity.
worksheet.set_column(0, 3, 16)
worksheet.add_table('A1:D4', {'data': data.values.tolist(),
'columns': [{'header': c} for c in data.columns.tolist()] +
[{'header': 'concatenation'}],
'style': 'Table Style Medium 9'})
workbook.close()
Output:

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Read CSV without string formatting in python

I have a CSV file and I would like to read this cell-by-cell so that I can write it into excel. I am using csv.reader and enumerating the result so that I can put values into corresponding cells in Excel.
With the current code, once I enumerate the values turn into strings. If I write to excel with sheet.write(rowi,coli,value), all cells are formatted as text. I can't have this, because I need to sum columns afterward and they need to be treated as numbers
For example, my text file will have: 1, a, 3, 4.0, 5, 6, 7
After first enumeration, the first row: (0, '1, a, 3, 4.0, 5, 6, 7')
After second enumeration, first column of first row: (0, 0, '1')
QUESTION: How can I read this csv file to yield (0, 0, 1) (etc.)?
Here's some code I'm working with:
import csv, xlwt
with open('file.csv', 'rb') as csvfile:
data = csv.reader ((csvfile), delimiter=",")
wbk= xlwt.Workbook()
sheet = wbk.add_sheet("file")
for rowi, row in enumerate(data):
for coli, value in enumerate(row):
sheet.write(rowi,coli,value)
#print(rowi,coli,value) gives (rowi, coli, 'value')
import csv, xlwt
with open('file.csv', 'rb') as csvfile:
data = csv.reader ((csvfile), delimiter=",")
wbk= xlwt.Workbook()
sheet = wbk.add_sheet("file")
for rowi, row in enumerate(data):
for coli, value in enumerate(row):
sheet.write(rowi,coli,value)
wbk.save("workbook_file")
Even though print(rowi,coli,value) shows 'value', the cell in the outputted file should show it without quotes.
If your data is in the format 1, 2, 3 and not 1,2,3 include this after your for coli, value in enumerate(row): line:
value = value.lstrip(" ")
Well I think the csv module of python is still lacking a crystal ball ... More seriously, in the csv file there is no indication of the type of the variable, integer, float, string or date. By default, the Reader transforms a row in an list of strings.
If you want some columns to be integer, you can add to your script a list of boolean. Say you have 4 columns and the third is integer
int_col = [ false, false, true, false ]
...
for rowi, row in enumerate(data):
for coli, value in enumerate(row):
val = int(value) if int_col(coli) else value
sheet.write(rowi,coli,val)
You can also try to guess what columns are integer, reading n rows (for example n = 10) and saying that for each column where you found n integers you treat that column as integer.
Or you can even imagine a 2 pass operation : first pass determine the type of the columns and second does the inserts.
I find Python's standard library functions a bit lacking for processing CSV files. I prefer to work with pandas when possible.
import xlwt
from pandas.io.parsers import read_csv
df = read_csv('file.csv')
#number the columns sequentially
df.columns = [i for i, e in enumerate(df.columns)]
#unstack the columns to make 2 indices plus a column, make row come before col,
#sort row major order, and then unset the indices to get a DataFrame
newDf = df.unstack().swaplevel(0,1).sort_index().reset_index()
#rename the cols to reflect the types of data
newDf.columns = ['row', 'col', 'value']
#write to excel
newDf.to_excel('output.xls', index=False)
This will also keep the row and column numbers as integer values values. I took an example csv file and row and col both were integer valued, not string.

Categories

Resources