I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically.
This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success.
I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better.
Thank you,
Brikena
Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective
abonojnë,,,,,,,1,1,0,1,0,0
çokasin,,,,,,,1,1,0,1,0,1
gërgasin,,,,,,,1,1,0,1,0,0
godasin,,,,,,,1,1,0,1,0,0
përkasin,,,,,,,1,1,1,1,0,0
përdjegin,,,,,,,1,1,0,1,0,0
lakadredhin,,,,,,,1,1,0,1,1,0
përdredhin,,,,,,,1,1,0,1,0,0
spërdredhin,,,,,,,1,1,0,1,0,0
përmbledhin,,,,,,,1,1,0,1,0,0
shpërdredhin,,,,,,,1,1,0,1,0,0
arsejnë,,,,,,,1,1,0,1,1,0
çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy:
# pip install pandas
import pandas as pd
# read data (here example with csv, but use "read_excel" for excel)
df = pd.read_csv('input.csv').set_index('Text')
# reshape and export
(df.mul(df.columns).where(df.eq(1))
.stack().rename('xxx')
.groupby(level=0).apply('_'.join)
).to_csv('output.csv') # here use "to_excel" for excel format
output file:
Text,xxx
abonojnë,token_pos_verb
arsejnë,token_pos_verb_noun
godasin,token_pos_verb
gërgasin,token_pos_verb
lakadredhin,token_pos_verb_noun
përdjegin,token_pos_verb
përdredhin,token_pos_verb
përkasin,token_pos_punctuation_verb
përmbledhin,token_pos_verb
shpërdredhin,token_pos_verb
spërdredhin,token_pos_verb
çapëlejnë,token_pos_verb
çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence.
The other alternative is this:
Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic.
I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.
I have the following dataframe Current Dataframe loaded from a csv, that I want to use do some sampling tests.
For that I wanted to use all of the current columns, but trying to transform Element_Count and Tag_Count into separate columns from the values from Element_Count(e.g link: 10) and Tag_Count(separately).
I want to extract each value and turn it into a column. The final dataframe would be something like this(obviously depending on the values inside of Element/Tag_Count) :
Index (the 0,1,2 etc from the dataframe its self) PageID ,Uri, A, AA, AAA, link (and its value inside of Element_Count, e.g link as column and in the case of the first one in the picture 44 in the row for that specific url) etc, html, etc (with all the values of Tag_Count present in all of the content inside of the rows of the column Tag_Count as explained for Element_Count)
The current code to generate the dataframe is the following:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore") #to ignore some warnings which have no effect in this particular case.
df = pd.read_csv('test.csv', sep=';')
df.head()
I have searched google, also in here for some answers to no avail.
Have tried changing the test csv to achieve my goal, with no success. Have also tried after seeing a question on here to use:
pd.DataFrame(df.ranges.str.split(',').tolist())
to achieve the desired result with no success.
Any ideas in how to achieve this via dataframes, or by any other method?
(Anything that I have forgot to mention that u feel is important to understand the problem please say and I will edit it in)
Edit :
Although logic would say that the element and tag count arrays should be in dictionary form and easily dividable, that is not the case as shown in the print
I am trying to export a Pandas dataframe to Excel where all columns are of text format. By default, the pandas.to_excel() function lets Excel decide the data type. Exporting a column with [1,2,'w'] results in the cells containing 1 and 2 to be numeric, and the cell containing 'w' to be text. I'd like all rows in the column to be text (i.e. ['1','2','w']).
I was able to solve the problem by assigning the column I need to be text using the .astype(str). However, if the data is large, I am concerned that I will run into performance issues. If I understand correctly, df[col] = df[col].astype(str) makes a copy of the data, which is not efficient.
import pandas as pd
df = pd.DataFrame({'a':[1,2,'w'], 'b':['x','y','z']})
df['a'] = df['a'].astype(str)
df.to_excel(r'c:\tmp\test.xlsx')
Is there a more efficient way to do this?
I searched SO several times and didn't see anything on this. Forgive me if this has been answered before. This is my first post, and I'm really happy to participate in this cool forum.
Edit: Thanks to the comments I've received, I see that Converting a series of ints to strings - Why is apply much faster than astype? gives me other options to astype(str). This is really useful. I also wanted to know if astype(str) was inefficient because it made a copy of the data, which I now see that it does not.
I don't think that you'll not have performance issues with that approach since data is not copied but replaced. You may also convert the whole dataframe into string type using
df = df.astype(str)