xlwings udf function erase next cell on excel - python

I have a little problem on excel using xlwings and i really don't know how to fix it.
When i'm using an UDF function that return for example a panda dataframe, let suppose that my dataframe is 3 colums width (no necessary condition on rows), then on the 4th columns in excel, if i write some datas on it, my panda dataframe will erase it as soon as i calculate the sheet... Although the dataframe is not using this column at all while it's 3 columns large and not 4 ...
I don't know if i'm clear enough. Let me know !
thank you very much in advance.
#xw.func
#xw.ret(expand='table')
def hello(nb):
nb = int(nb)
return [["hello","you"] for i in range(nb)]
before recalculate the sheet
after recalculate the sheet

It seems that in the documentation of xlwings, it is necessary to have an empty row and column at the bottom and to the right. if not it will overwrite it
http://docs.xlwings.org/en/stable/api.html#xlwings.xlwings.ret

Related

Converting 0-1 values in dataset with the name of the column if the value of the cell is 1

I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically.
This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success.
I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better.
Thank you,
Brikena
Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective
abonojnë,,,,,,,1,1,0,1,0,0
çokasin,,,,,,,1,1,0,1,0,1
gërgasin,,,,,,,1,1,0,1,0,0
godasin,,,,,,,1,1,0,1,0,0
përkasin,,,,,,,1,1,1,1,0,0
përdjegin,,,,,,,1,1,0,1,0,0
lakadredhin,,,,,,,1,1,0,1,1,0
përdredhin,,,,,,,1,1,0,1,0,0
spërdredhin,,,,,,,1,1,0,1,0,0
përmbledhin,,,,,,,1,1,0,1,0,0
shpërdredhin,,,,,,,1,1,0,1,0,0
arsejnë,,,,,,,1,1,0,1,1,0
çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy:
# pip install pandas
import pandas as pd
# read data (here example with csv, but use "read_excel" for excel)
df = pd.read_csv('input.csv').set_index('Text')
# reshape and export
(df.mul(df.columns).where(df.eq(1))
.stack().rename('xxx')
.groupby(level=0).apply('_'.join)
).to_csv('output.csv') # here use "to_excel" for excel format
output file:
Text,xxx
abonojnë,token_pos_verb
arsejnë,token_pos_verb_noun
godasin,token_pos_verb
gërgasin,token_pos_verb
lakadredhin,token_pos_verb_noun
përdjegin,token_pos_verb
përdredhin,token_pos_verb
përkasin,token_pos_punctuation_verb
përmbledhin,token_pos_verb
shpërdredhin,token_pos_verb
spërdredhin,token_pos_verb
çapëlejnë,token_pos_verb
çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence.
The other alternative is this:
Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic.
I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.

Merging and cleaning up csv files in Python

I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!

python: looping through indeces in Excel and replacing with string

So I have an excel sheet with the following format:
Now what I'm looking to do is to loop trough each index cell in column A and assign all cells the same value until the next 0 is reached. so for example:
Now I have tried importing the excel file into a pandas dataframe and then using for loops to do this, but I can't seem to make it work. Any suggestions or directions to the appropriate method would be much appreciated!
Thank you for your time
Edit:
Using #wen-ben's method: s.index=pd.Series((s.index==0).cumsum()).map({1:'bananas',2:'cherries',3:'pineapples'})
just enters the first element (bananas) for all cells in Column A
Assuming you have dataframe s using cumsum
s.index=pd.Series((s.index==0).cumsum()).map({1:'bananas',2:'cherries',3:'pineapples'})

Maintaining Formulae when Adding Rows/Columns

I'm doing some excel sheet Python automation using openpyxl and I'm having an issue when I try to insert columns or rows into my sheet.
I'm modifying an existing excel sheet which has basic formula in it (i.e. =F2-G2) however when I insert a row or column before these cells, the formula do not adjust accordingly like they would if you would perform that action in excel.
For example, inserting a column before column F should change the formula to =G2-H2 but instead it stays at =F2-G2...
Is there any way to work around this issue? I can't really iterate through all the cells and fix the formula because the file contains many columns with formula in them.
openpyxl is a file format library and not an application like Excel and does not attempt to provide the same functionality. Translating formulae in cells that are moved should be possible with the library's tokeniser but this ignores any formulae that refer to the cells being moved on the same worksheet or in the same workbook.
Easy, just iterate from your inserted row downward to the max row and change formulae's row number accordingly, below code is just a example:
#insert a new row after identified row
ws.insert_rows(InsertedRowNo)
#every time you insert a new row, you need to adjust all formulas row numbers after the new row.
for i in range (InsertedRowNo,ws.max_row):
ws.cell(row=i,column=20).value='=HYPERLINK(VLOOKUP(TRIM(A{0}),dict!$A$2:$B$1001,2,0),A{0})'.format(i)

How can I preserve a pandas multi-index between a to_excel() and a read_excel()?

According to the pandas documentation for read_excel, I can put the index column names on a separate line and then the method will which columns should be used as indices.
I want to create an Excel file from a multi-indexed dataframe that can be read in as such, but I can't figure out how to get pandas to write to_excel in such a way that this additional row is created (from a multi-indexed dataframe).
I can't imagine that storing a multi-indexed dataframe as an Excel worksheet and then pulling it back in later is that uncommon a use case, so I'm wondering if I just haven't figured out how to do this.
Here's an example of a dataframe I'd like to 'freeze' in Excel before reading back in without having to tell read_excel which columns are the indices:
ipdb> my_df
Date Amount
Rec Section Row
0 Top Section 2 2015-05-01 -105.00
1 Middle Section 3 2015-05-04 90247.60
2 Middle Section 4 2015-05-05 -2992.99
3 Bottom Section 5 2015-05-08 -800.00
In my example, there are three index columns: Rec, Section, and Row.
When I write this to Excel and then read it back in, I don't want to have to tell it this. Since read_excel seems to have a method that infers the index names when they appear on a separate row, I want to have it just figure it out (assuming I correctly write the Excel file).
What am I missing?
I was encountering the same issue when trying to write a pivot table to Excel. I was able to get this to work by modifying the frame.py file in ../pandas/core. Changing if self.columns.nlevels > 1 to if self.columns.nlevels > 1 and not index got me what I needed.
As this functionality is still not supported by Pandas, you may still encounter funny output. Also, this will likely not solve the issue for read_excel either. Hopefully this helps a little!
I referenced 'onesandzeros' in his GitHub comment.

Categories

Resources