I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically.
This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success.
I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better.
Thank you,
Brikena
Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective
abonojnë,,,,,,,1,1,0,1,0,0
çokasin,,,,,,,1,1,0,1,0,1
gërgasin,,,,,,,1,1,0,1,0,0
godasin,,,,,,,1,1,0,1,0,0
përkasin,,,,,,,1,1,1,1,0,0
përdjegin,,,,,,,1,1,0,1,0,0
lakadredhin,,,,,,,1,1,0,1,1,0
përdredhin,,,,,,,1,1,0,1,0,0
spërdredhin,,,,,,,1,1,0,1,0,0
përmbledhin,,,,,,,1,1,0,1,0,0
shpërdredhin,,,,,,,1,1,0,1,0,0
arsejnë,,,,,,,1,1,0,1,1,0
çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy:
# pip install pandas
import pandas as pd
# read data (here example with csv, but use "read_excel" for excel)
df = pd.read_csv('input.csv').set_index('Text')
# reshape and export
(df.mul(df.columns).where(df.eq(1))
.stack().rename('xxx')
.groupby(level=0).apply('_'.join)
).to_csv('output.csv') # here use "to_excel" for excel format
output file:
Text,xxx
abonojnë,token_pos_verb
arsejnë,token_pos_verb_noun
godasin,token_pos_verb
gërgasin,token_pos_verb
lakadredhin,token_pos_verb_noun
përdjegin,token_pos_verb
përdredhin,token_pos_verb
përkasin,token_pos_punctuation_verb
përmbledhin,token_pos_verb
shpërdredhin,token_pos_verb
spërdredhin,token_pos_verb
çapëlejnë,token_pos_verb
çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence.
The other alternative is this:
Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic.
I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.
I'm doing some excel sheet Python automation using openpyxl and I'm having an issue when I try to insert columns or rows into my sheet.
I'm modifying an existing excel sheet which has basic formula in it (i.e. =F2-G2) however when I insert a row or column before these cells, the formula do not adjust accordingly like they would if you would perform that action in excel.
For example, inserting a column before column F should change the formula to =G2-H2 but instead it stays at =F2-G2...
Is there any way to work around this issue? I can't really iterate through all the cells and fix the formula because the file contains many columns with formula in them.
openpyxl is a file format library and not an application like Excel and does not attempt to provide the same functionality. Translating formulae in cells that are moved should be possible with the library's tokeniser but this ignores any formulae that refer to the cells being moved on the same worksheet or in the same workbook.
Easy, just iterate from your inserted row downward to the max row and change formulae's row number accordingly, below code is just a example:
#insert a new row after identified row
ws.insert_rows(InsertedRowNo)
#every time you insert a new row, you need to adjust all formulas row numbers after the new row.
for i in range (InsertedRowNo,ws.max_row):
ws.cell(row=i,column=20).value='=HYPERLINK(VLOOKUP(TRIM(A{0}),dict!$A$2:$B$1001,2,0),A{0})'.format(i)
According to the pandas documentation for read_excel, I can put the index column names on a separate line and then the method will which columns should be used as indices.
I want to create an Excel file from a multi-indexed dataframe that can be read in as such, but I can't figure out how to get pandas to write to_excel in such a way that this additional row is created (from a multi-indexed dataframe).
I can't imagine that storing a multi-indexed dataframe as an Excel worksheet and then pulling it back in later is that uncommon a use case, so I'm wondering if I just haven't figured out how to do this.
Here's an example of a dataframe I'd like to 'freeze' in Excel before reading back in without having to tell read_excel which columns are the indices:
ipdb> my_df
Date Amount
Rec Section Row
0 Top Section 2 2015-05-01 -105.00
1 Middle Section 3 2015-05-04 90247.60
2 Middle Section 4 2015-05-05 -2992.99
3 Bottom Section 5 2015-05-08 -800.00
In my example, there are three index columns: Rec, Section, and Row.
When I write this to Excel and then read it back in, I don't want to have to tell it this. Since read_excel seems to have a method that infers the index names when they appear on a separate row, I want to have it just figure it out (assuming I correctly write the Excel file).
What am I missing?
I was encountering the same issue when trying to write a pivot table to Excel. I was able to get this to work by modifying the frame.py file in ../pandas/core. Changing if self.columns.nlevels > 1 to if self.columns.nlevels > 1 and not index got me what I needed.
As this functionality is still not supported by Pandas, you may still encounter funny output. Also, this will likely not solve the issue for read_excel either. Hopefully this helps a little!
I referenced 'onesandzeros' in his GitHub comment.