I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically.
This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success.
I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better.
Thank you,
Brikena
Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective
abonojnë,,,,,,,1,1,0,1,0,0
çokasin,,,,,,,1,1,0,1,0,1
gërgasin,,,,,,,1,1,0,1,0,0
godasin,,,,,,,1,1,0,1,0,0
përkasin,,,,,,,1,1,1,1,0,0
përdjegin,,,,,,,1,1,0,1,0,0
lakadredhin,,,,,,,1,1,0,1,1,0
përdredhin,,,,,,,1,1,0,1,0,0
spërdredhin,,,,,,,1,1,0,1,0,0
përmbledhin,,,,,,,1,1,0,1,0,0
shpërdredhin,,,,,,,1,1,0,1,0,0
arsejnë,,,,,,,1,1,0,1,1,0
çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy:
# pip install pandas
import pandas as pd
# read data (here example with csv, but use "read_excel" for excel)
df = pd.read_csv('input.csv').set_index('Text')
# reshape and export
(df.mul(df.columns).where(df.eq(1))
.stack().rename('xxx')
.groupby(level=0).apply('_'.join)
).to_csv('output.csv') # here use "to_excel" for excel format
output file:
Text,xxx
abonojnë,token_pos_verb
arsejnë,token_pos_verb_noun
godasin,token_pos_verb
gërgasin,token_pos_verb
lakadredhin,token_pos_verb_noun
përdjegin,token_pos_verb
përdredhin,token_pos_verb
përkasin,token_pos_punctuation_verb
përmbledhin,token_pos_verb
shpërdredhin,token_pos_verb
spërdredhin,token_pos_verb
çapëlejnë,token_pos_verb
çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence.
The other alternative is this:
Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic.
I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.
I'm trying to filter through a CSV and make a new CSV which is the exact same except for it gets rid of any rows that have a value of greater than 100 billion in the 'marketcap' column.
The code I've written so just spits out the same CSV as the original out over again and doesn't cut out any lines from the old CSV to the new CSV.
Code:
db = pd.read_csv('SF1_original.csv')
db = db[db['marketcap']<= 100000000000]
db.to_csv('new_SF1_original.csv')
Example of old CSV (It's long don't look through whole thing, just to give you an idea):
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
Example New CSV (Exact same when this line should have been cut):
,ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
0,A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000.0,7107000000.0,,4982000000.0,2125000000.0,,10.219,-30000000.0,1368000000.0,1368000000.0,1160000000.0,131000000.0,2.41,0.584,665000000.0,111000000.0,554000000.0,665000000.0,281000000.0,96000000.0,0.0,0.0,0.0,202000000.0,298000000.0,0.133,298000000.0,202000000.0,202000000.0,0.3,0.3,0.3,4486000000.0,,4486000000.0,50960600000.0,,,354000000.0,0.8059999999999999,1.0,1086000000.0,0.484,0.0,0.0,4337000000.0,,1567000000.0,42000000.0,42000000.0,0.0,2621000000.0,2067000000.0,554000000.0,51663600000.0,1368000000.0,-160000000.0,2068000000.0,111000000.0,0.0,1192000000.0,-208000000.0,-42000000.0,384000000.0,0.0,131000000.0,131000000.0,131000000.0,0.0,0.0,0.057999999999999996,915000000.0,171000000.0,635000000.0,0.0,11.517000000000001,,,1408000000.0,0.0,114.3,,,1445000000.0,131000000.0,2246000000.0,2246000000.0,290000000.0,,,,,0.0,625000000.0,1.0,452000000.0,439000000.0,440000000.0,5.1160000000000005,7107000000.0,0.0,71000000.0,113000000.0,16.189,2915000000.0
I've seen two questions somewhat related to this on StackOverflow, but they haven't helped me much. This one uses CSV library instead of pandas (which is an option for me). This one is more helpful since it uses pandas but still hasn't been interacted with and isn't exactly the same as my use case.
You can get the indexes of the rows with "marketcap" over 100 billion rows like so:
df.loc[df["marketcap"] > 100000000000]["marketcap"].index
All that's left to do is drop them from the DataFrame:
df.drop(df.loc[df["marketcap"] > 100000000000]["marketcap"].index, inplace=True)
Reading from CSV and writing to the CSV is already correctly taken care of in your code.