I am using a CSV file with news about crypto. My goal is to practice string manipulation and methods. The CSV looks something like this :
publishdate headlinetext
20130504 COnSTELlATIon DaG iS nOW liStEd On kucoiN eXC?haNGE
20130511 ItA*lys cRypTOCUrREnCy BITgrAil suspeNds OpERatIOnS
20130511 THe diffeRENCe bETWEEn sHarEs aNd cRYpToCUrReN€CiES
20130512 fedS seIzE 47 mIlLION In bItCoinS in FAke ID ST=ing
20130514 ThE diG sTarteD ASiCboOST neTwORK AnD b#ItcoIN cAsH
20130516 BINAncE far atualiZAO progRaMadA NEsTa QuarTAFEIR?a
20130516 tHe EUropeaN UniOn IS pLaNninG tO rEgULAtE Bi=TcOIn
20130516 i!BeROBiT HELpiNG bItcOIn To GO MAinStream IN sPaIN
20130521 EuropES sMALLEr €banKS WELComE CrypTOcuRRENCy uSerS
20130604 BiTcoIn btc hIghER bTc= price BRiNgs mOrE Btc SCAmS
20130610 BITCOin brEAkS 9000 iN latEST LANDmARK Pr#iCE pOinT
20130613 ubcoiN mArkEt movEs ItS HEAdQUArTERS To €SiNgaPorE
20130624 reeds jEwelErs TaKinG bitCOiN ONLiNe And IN Sto$Res
20130705 CoNtrOvE!RSY turnS to cLoSuRe as LItePAy SHUts DowN
20130709 bUll rESiSTAncE BITCOIn pRicE nEeDs brEAK AbOve 9K*
20130714 DIVoRcE DISpUte co#Uple fIghtS ovEr 830k Of BITcoin
20130718 10K agaIn For BitcoiN buT oT!her CRyptOs OUTperfORM
20130724 FACebOoKS liBRa crYptoCUrrency wHER$E aRE ThE BANkS
20130726 COULd eNjIn coiN ReacH A neW AlltiM=E hIGh in APRIL
20130827 the GReaT Tug of WaR betWE=eN bItCOiNS anD AltCOINs
20130827 The SacRAMento kINgs mINE EthEreuM ETh for C#HArItY
20130905 cryPtOCuRREncY aTMs tHE KEY T*o WIdeSPRead ADoPtiOn
20130909 GraySCales EtHereUM TRusT pRICE= VaLUeS Eth aT 6000
(...)
Then I used pandas to read the CSV.
import pandas as pd
news_headlines = pd.read_csv('/content/sample_data/crypto_headlines.csv')
news_headlines
Now I need to get the strings to work with them and change them to lower or upper case and then remove special characters.
However , I don't know which method I should use to extract a string from this variable I created called news_headlines.
Let's say I wanted to extract the 2nd row, with the publish date on 20130511.
Any help ?
Thanks in advance
You can use iloc(), as in the example below, to extract the string related to the second row:
news_headlines.iloc[1, 1]
Related
Please forgive me I am teaching myself and am pretty new to this. I have searched and found nothing that addressed my issue, lots of things dealing with os and file directories but I couldn't figure out how to implement them here. I also am not super familiar with regex, and have tried implementing that as well but kept getting errors.
So I have a large text file (9GB) that is actually a list of discussion board posts.
I have a list of words I want to add to the stoplist for topic modeling. (I can do that)
However I also want to add any term that ends with any of the words in my list.
A sample of my data and lists is below.
txt = ['satoshiFounderSr MemberOfflineActivity Merit Welcome to the new Bitcoin forumNovember PMMeritedbyVlad Vlad Claymore krogothmanhattan negeroy Referee Vod suchmoon alani Lesbian Cow cryptohunter hv janggernaut matt Jeremycoin MaoChao Kda roslinpl gold MicroGuy elokk notaek BitcoinFX EcuaMobi Lutpin Lincoln Echo Nomad avatar kiyoshi saugwurm BALIK anggriani teeGUMES dooglus bitbollo klarki franckuestein legendster techman Provok mrcash paxmao jeks Cent MrCryptHodl DireWolfM BarbieCasino theunbeatable mindrust fillippone Mister k LFC Bitcoin nutildah Oceat digit Woshib ubay undeadbitcoiner pushups btcrocks realdantreccia Dq Atabey limtjoehua LoyceV anonymousminer MagicByt vizique coinlocket Altcoinsintel baeva OgNasty o solo miner Janation Kalemder sujonali MoparMiningLLC Eddyc jonemil Kryptowolf green slmn TyfrTR cr mprep Searing EFS adaseb notbatman Lucius boltz layer gfx seoincorporation AGD Phinnaeus Gage tabas pawel Lafu pangu Blind Legs Parker itod Potato Chips wonko Arriemoller Coin ruletheworld Halab coupable o e l e o TheBeardedBaby MoxnatyShmel monsanto amishmanish xtraelv Husna QA madnessteat Bthd taikuri dvd rw Toxic styca WorldCoiner bubbalex xyzzy V saya jets crypto trader xzEXrP xlcus solosequenosenada VB MishaSER dragonvslinux Zocadas jahepahit risatrakib chimk Porfirii YuT Coin adrianto famososMuertos angel Financisto RareFortune jakoylantern bere kin mdayonliner sncc squallw cryptjh jazmuzika wishxy markleal BlackHatCoiner an sha ldah DEMENTOR mustangy TaShoKi Adriane Poker Player StackItUp PIOUPIOU loreRex tasadar wego Gustavo Livecoins Palmholder CryptoPravda barjan Crypto Collection collapse jukeee Cuk ng bitc in LBTC Pyrojason M BTC vanobe shortcircuit Toqo Vxv BiT pOL songsunling bitcoinokulu AlexMay Kaonashi Neo Baudrillard RussaX morkaii Welcome to the new Bitcoin forumThe old forum can still be reached here http bitcoinsourceforgenet boards indexphpI ll repost some selected threads here and add updated answers to questions where I canFAQhttp bitcoinsourceforgenet wiki indexphppage FAQDownloadhttp sourceforgenet projects bitcoin files satoshiFounderSr MemberOfflineActivity Merit Welcome to the new Bitcoin forumNovember PMMeritedbyVlad Vlad Claymore krogothmanhattan negeroy Referee Vod suchmoon alani Lesbian Cow cryptohunter hv janggernaut matt Jeremycoin MaoChao Kda roslinpl gold MicroGuy elokk notaek BitcoinFX EcuaMobi Lutpin Lincoln Echo Nomad avatar kiyoshi saugwurm BALIK anggriani teeGUMES dooglus bitbollo klarki franckuestein legendster techman Provok mrcash paxmao jeks Cent MrCryptHodl DireWolfM BarbieCasino theunbeatable mindrust fillippone Mister k LFC Bitcoin nutildah Oceat digit Woshib ubay undeadbitcoiner pushups btcrocks realdantreccia Dq Atabey limtjoehua LoyceV anonymousminer MagicByt vizique coinlocket Altcoinsintel baeva OgNasty o solo miner Janation Kalemder sujonali MoparMiningLLC Eddyc jonemil Kryptowolf green slmn TyfrTR cr mprep Searing EFS adaseb notbatman Lucius boltz layer gfx seoincorporation AGD Phinnaeus Gage tabas pawel Lafu pangu Blind Legs Parker itod Potato Chips wonko Arriemoller Coin ruletheworld Halab coupable o e l e o TheBeardedBaby MoxnatyShmel monsanto amishmanish xtraelv Husna QA madnessteat Bthd taikuri dvd rw Toxic styca WorldCoiner bubbalex xyzzy V saya jets crypto trader xzEXrP xlcus solosequenosenada VB MishaSER dragonvslinux Zocadas jahepahit risatrakib chimk Porfirii YuT Coin adrianto famososMuertos angel Financisto RareFortune jakoylantern bere kin mdayonliner sncc squallw cryptjh jazmuzika wishxy markleal BlackHatCoiner an sha ldah DEMENTOR mustangy TaShoKi Adriane Poker Player StackItUp PIOUPIOU loreRex tasadar wego Gustavo Livecoins Palmholder CryptoPravda barjan Crypto Collection collapse jukeee Cuk ng bitc in LBTC Pyrojason M BTC vanobe shortcircuit Toqo Vxv BiT pOL songsunling bitcoinokulu AlexMay Kaonashi Neo Baudrillard RussaX morkaii Welcome to the new Bitcoin forumThe old forum can still be reached here http bitcoinsourceforgenet boards indexphpI ll repost some selected threads here and add updated answers to questions where I canFAQhttp bitcoinsourceforgenet wiki indexphppage FAQDownloadhttp sourceforgenet projects bitcoin files satoshiFounderSr MemberOfflineActivity Merit Welcome to the new Bitcoin forumNovember PMMeritedbyVlad Vlad Claymore krogothmanhattan negeroy Referee Vod suchmoon alani Lesbian Cow cryptohunter hv janggernaut matt Jeremycoin MaoChao Kda roslinpl gold MicroGuy elokk notaek BitcoinFX EcuaMobi Lutpin Lincoln Echo Nomad avatar kiyoshi saugwurm BALIK anggriani teeGUMES dooglus bitbollo klarki franckuestein legendster techman Provok mrcash paxmao jeks Cent MrCryptHodl DireWolfM BarbieCasino theunbeatable mindrust fillippone Mister k LFC Bitcoin nutildah Oceat digit Woshib ubay undeadbitcoiner pushups btcrocks realdantreccia Dq Atabey limtjoehua LoyceV anonymousminer MagicByt vizique coinlocket Altcoinsintel baeva OgNasty o solo miner Janation Kalemder sujonali MoparMiningLLC Eddyc jonemil Kryptowolf green slmn TyfrTR cr mprep Searing EFS adaseb notbatman Lucius boltz layer gfx seoincorporation AGD Phinnaeus Gage tabas pawel Lafu pangu Blind Legs Parker itod Potato Chips wonko Arriemoller Coin ruletheworld Halab coupable o e l e o TheBeardedBaby MoxnatyShmel monsanto amishmanish xtraelv Husna QA madnessteat Bthd taikuri dvd rw Toxic styca WorldCoiner bubbalex xyzzy V saya jets crypto trader xzEXrP xlcus solosequenosenada VB MishaSER dragonvslinux Zocadas jahepahit risatrakib chimk Porfirii YuT Coin adrianto famososMuertos angel Financisto RareFortune jakoylantern bere kin mdayonliner sncc squallw cryptjh jazmuzika wishxy markleal BlackHatCoiner an sha ldah DEMENTOR mustangy TaShoKi Adriane Poker Player StackItUp PIOUPIOU loreRex tasadar wego Gustavo Livecoins Palmholder CryptoPravda barjan Crypto Collection collapse jukeee Cuk ng bitc in LBTC Pyrojason M BTC vanobe shortcircuit Toqo Vxv BiT pOL songsunling bitcoinokulu AlexMay Kaonashi Neo Baudrillard RussaX morkaii Welcome to the new Bitcoin forumThe old forum can still be reached here http bitcoinsourceforgenet boards indexphpI ll repost some selected threads here and add updated answers to questions where I canFAQhttp bitcoinsourceforgenet wiki indexphppage FAQDownloadhttp sourceforgenet projects bitcoin files Welcome to the new Bitcoin forumNovember PMMeritedbyVlad Vlad Claymore krogothmanhattan negeroy Referee Vod suchmoon alani Lesbian Cow cryptohunter hv janggernaut matt Jeremycoin MaoChao Kda roslinpl gold MicroGuy elokk notaek BitcoinFX EcuaMobi Lutpin Lincoln Echo Nomad avatar kiyoshi saugwurm BALIK anggriani teeGUMES dooglus bitbollo klarki franckuestein legendster techman Provok mrcash paxmao jeks Cent MrCryptHodl DireWolfM BarbieCasino theunbeatable mindrust fillippone Mister k LFC Bitcoin nutildah Oceat digit Woshib ubay undeadbitcoiner pushups btcrocks realdantreccia Dq Atabey limtjoehua LoyceV anonymousminer MagicByt vizique coinlocket Altcoinsintel baeva OgNasty o solo miner Janation Kalemder sujonali MoparMiningLLC Eddyc jonemil Kryptowolf green slmn TyfrTR cr mprep Searing EFS adaseb notbatman Lucius boltz layer gfx seoincorporation AGD Phinnaeus Gage tabas pawel Lafu pangu Blind Legs Parker itod Potato Chips wonko Arriemoller Coin ruletheworld Halab coupable o e l e o TheBeardedBaby MoxnatyShmel monsanto amishmanish xtraelv Husna QA madnessteat Bthd taikuri dvd rw Toxic styca WorldCoiner bubbalex xyzzy V saya jets crypto trader xzEXrP xlcus solosequenosenada VB MishaSER dragonvslinux Zocadas jahepahit risatrakib chimk Porfirii YuT Coin adrianto famososMuertos angel Financisto RareFortune jakoylantern bere kin mdayonliner sncc squallw cryptjh jazmuzika wishxy markleal BlackHatCoiner an sha ldah DEMENTOR mustangy TaShoKi Adriane Poker Player StackItUp PIOUPIOU loreRex tasadar wego Gustavo Livecoins Palmholder CryptoPravda barjan Crypto Collection collapse jukeee Cuk ng bitc in LBTC Pyrojason M BTC vanobe shortcircuit Toqo Vxv BiT pOL songsunling bitcoinokulu AlexMay Kaonashi Neo Baudrillard RussaX morkaii ',
'satoshiFounderSr MemberOfflineActivity Merit Repost Bitcoin MaturationNovember PMMeritedbyescrowms NeuroticFish finist x icopress jankeman bitcoinbitcoin Bitcoin MaturationPosted Thu of Oct UTC From the user s perspective the bitcoin maturation process can be broken down into stages The initial network transaction that occurs when you first click Generate Coins The time between that initial network transaction and when the bitcoin entry is ready to appear in the All Transactions list The change of the bitcoin entry from outside the All Transaction field to inside it The time between when the bitcoin appears in the All Transfers list and when the Description is ready to change to Generated matures in x more blocks The change of the Description to Generated matures in x more blocks The time between when the Description says Generated matures in x more blocks to when it is ready to change to Generated The change of the Description to Generated The time after the Description has changed to GeneratedWhich stages require network connectivity significant local CPU usage and or significant remote CPU usage Do any of these stages have names sirius m Re Bitcoin MaturationPosted Thu of Oct UTC As far as I know there s no network transaction when you click Generate Coins your computer just starts calculating the next proof of work The CPU usage is when you re generating coinsIn this example the network connection is used when you broadcast the information about the proof of work block you ve created that which entitles you to the new coin Generating coins successfully requires constant connectivity so that you can start working on the next block when someone gets the current block before yousatoshiFounderSr MemberOfflineActivity Merit Repost Bitcoin MaturationNovember PMMeritedbyescrowms NeuroticFish finist x icopress jankeman bitcoinbitcoin Bitcoin MaturationPosted Thu of Oct UTC From the user s perspective the bitcoin maturation process can be broken down into stages The initial network transaction that occurs when you first click Generate Coins The time between that initial network transaction and when the bitcoin entry is ready to appear in the All Transactions list The change of the bitcoin entry from outside the All Transaction field to inside it The time between when the bitcoin appears in the All Transfers list and when the Description is ready to change to Generated matures in x more blocks The change of the Description to Generated matures in x more blocks The time between when the Description says Generated matures in x more blocks to when it is ready to change to Generated The change of the Description to Generated The time after the Description has changed to GeneratedWhich stages require network connectivity significant local CPU usage and or significant remote CPU usage Do any of these stages have names sirius m Re Bitcoin MaturationPosted Thu of Oct UTC As far as I know there s no network transaction when you click Generate Coins your computer just starts calculating the next proof of work The CPU usage is when you re generating coinsIn this example the network connection is used when you broadcast the information about the proof of work block you ve created that which entitles you to the new coin Generating coins successfully requires constant connectivity so that you can start working on the next block when someone gets the current block before yousatoshiFounderSr MemberOfflineActivity Merit Repost Bitcoin MaturationNovember PMMeritedbyescrowms NeuroticFish finist x icopress jankeman bitcoinbitcoin Bitcoin MaturationPosted Thu of Oct UTC From the user s perspective the bitcoin maturation process can be broken down into stages The initial network transaction that occurs when you first click Generate Coins The time between that initial network transaction and when the bitcoin entry is ready to appear in the All Transactions list The change of the bitcoin entry from outside the All Transaction field to inside it The time between when the bitcoin appears in the All Transfers list and when the Description is ready to change to Generated matures in x more blocks The change of the Description to Generated matures in x more blocks The time between when the Description says Generated matures in x more blocks to when it is ready to change to Generated The change of the Description to Generated The time after the Description has changed to GeneratedWhich stages require network connectivity significant local CPU usage and or significant remote CPU usage Do any of these stages have names sirius m Re Bitcoin MaturationPosted Thu of Oct UTC As far as I know there s no network transaction when you click Generate Coins your computer just starts calculating the next proof of work The CPU usage is when you re generating coinsIn this example the network connection is used when you broadcast the information about the proof of work block you ve created that which entitles you to the new coin Generating coins successfully requires constant connectivity so that you can start working on the next block when someone gets the current block before youRepost Bitcoin MaturationNovember PMMeritedbyescrowms NeuroticFish finist x icopress jankeman ',
'satoshiFounderSr MemberOfflineActivity Merit Repost Request Make this anonymousNovember PMMeritedbyxtraelv anonguy Request Make this anonymousPosted Thu of Oct UTC Are there any plans to make this service anonymouseg Being able to route BitCoin through TorsatoshiFounderSr MemberOfflineActivity Merit Repost Request Make this anonymousNovember PMMeritedbyxtraelv anonguy Request Make this anonymousPosted Thu of Oct UTC Are there any plans to make this service anonymouseg Being able to route BitCoin through TorsatoshiFounderSr MemberOfflineActivity Merit Repost Request Make this anonymousNovember PMMeritedbyxtraelv anonguy Request Make this anonymousPosted Thu of Oct UTC Are there any plans to make this service anonymouseg Being able to route BitCoin through TorRepost Request Make this anonymousNovember PMMeritedbyxtraelv ',
'satoshiFounderSr MemberOfflineActivity Merit Re Repost Bitcoin MaturationNovember PMMeritedbyhold coins NeuroticFish It s important to have network connectivity while you re trying to generate a coin block and at the moment it is successfully generated During generation when the status bar says Generating and you re using CPU to find a proof of work you must constantly keep in contact with the network to receive the latest block If your block does not link to the latest block it may not be accepted When you successfully generate a block it is immediately broadcast to the network Other nodes must receive it and link to it for it to be accepted as the new latest blockThink of it as a cooperative effort to make a chain When you add a link you must first find the current end of the chain If you were to locate the last link then go off for an hour and forge your link come back and link it to the link that was the end an hour ago others may have added several links since then and they re not going to want to use your link that now branches off the middleAfter a block is created the maturation time of blocks is to make absolutely sure the block is part of the main chain before it can be spent Your node isn t doing anything with the block during that time just waiting for other blocks to be added after yours You don t have to be online during that timesatoshiFounderSr MemberOfflineActivity Merit Re Repost Bitcoin MaturationNovember PMMeritedbyhold coins NeuroticFish It s important to have network connectivity while you re trying to generate a coin block and at the moment it is successfully generated During generation when the status bar says Generating and you re using CPU to find a proof of work you must constantly keep in contact with the network to receive the latest block If your block does not link to the latest block it may not be accepted When you successfully generate a block it is immediately broadcast to the network Other nodes must receive it and link to it for it to be accepted as the new latest blockThink of it as a cooperative effort to make a chain When you add a link you must first find the current end of the chain If you were to locate the last link then go off for an hour and forge your link come back and link it to the link that was the end an hour ago others may have added several links since then and they re not going to want to use your link that now branches off the middleAfter a block is created the maturation time of blocks is to make absolutely sure the block is part of the main chain before it can be spent Your node isn t doing anything with the block during that time just waiting for other blocks to be added after yours You don t have to be online during that timesatoshiFounderSr MemberOfflineActivity Merit Re Repost Bitcoin MaturationNovember PMMeritedbyhold coins NeuroticFish It s important to have network connectivity while you re trying to generate a coin block and at the moment it is successfully generated During generation when the status bar says Generating and you re using CPU to find a proof of work you must constantly keep in contact with the network to receive the latest block If your block does not link to the latest block it may not be accepted When you successfully generate a block it is immediately broadcast to the network Other nodes must receive it and link to it for it to be accepted as the new latest blockThink of it as a cooperative effort to make a chain When you add a link you must first find the current end of the chain If you were to locate the last link then go off for an hour and forge your link come back and link it to the link that was the end an hour ago others may have added several links since then and they re not going to want to use your link that now branches off the middleAfter a block is created the maturation time of blocks is to make absolutely sure the block is part of the main chain before it can be spent Your node isn t doing anything with the block during that time just waiting for other blocks to be added after yours You don t have to be online during that timeRe Repost Bitcoin MaturationNovember PMMeritedbyhold coins NeuroticFish ',
'satoshiFounderSr MemberOfflineActivity Merit Re Repost Request Make this anonymousNovember PM There will be a proxy setting in version so you can connect through TOR I ve done a careful scrub to make sure it doesn t use DNS or do anything that would leak your IP while in proxy modesatoshiFounderSr MemberOfflineActivity Merit Re Repost Request Make this anonymousNovember PM There will be a proxy setting in version so you can connect through TOR I ve done a careful scrub to make sure it doesn t use DNS or do anything that would leak your IP while in proxy modesatoshiFounderSr MemberOfflineActivity Merit Re Repost Request Make this anonymousNovember PM There will be a proxy setting in version so you can connect through TOR I ve done a careful scrub to make sure it doesn t use DNS or do anything that would leak your IP while in proxy modeRe Repost Request Make this anonymousNovember PM ',
'satoshiFounderSr MemberOfflineActivity Merit Repost How anonymous are bitcoinsNovember PMMeritedbylivingfree xtraelv bitcoinbitcoin How anonymous are bitcoinsCan nodes on the network tell from which and or to which bitcoin address coins are being sent Do blocks contain a history of where bitcoins have been transfered to and from Can nodes tell which bitcoin addresses belong to which IP addresses Is there a command line option to enable the sock proxy the first time that bitcoin starts What happens if you send bitcoins to an IP address that has multiple clients connected through network address translation NAT satoshiFounderSr MemberOfflineActivity Merit Repost How anonymous are bitcoinsNovember PMMeritedbylivingfree xtraelv bitcoinbitcoin How anonymous are bitcoinsCan nodes on the network tell from which and or to which bitcoin address coins are being sent Do blocks contain a history of where bitcoins have been transfered to and from Can nodes tell which bitcoin addresses belong to which IP addresses Is there a command line option to enable the sock proxy the first time that bitcoin starts What happens if you send bitcoins to an IP address that has multiple clients connected through network address translation NAT satoshiFounderSr MemberOfflineActivity Merit Repost How anonymous are bitcoinsNovember PMMeritedbylivingfree xtraelv bitcoinbitcoin How anonymous are bitcoinsCan nodes on the network tell from which and or to which bitcoin address coins are being sent Do blocks contain a history of where bitcoins have been transfered to and from Can nodes tell which bitcoin addresses belong to which IP addresses Is there a command line option to enable the sock proxy the first time that bitcoin starts What happens if you send bitcoins to an IP address that has multiple clients connected through network address translation NAT Repost How anonymous are bitcoinsNovember PMMeritedbylivingfree xtraelv ']
stop = list(stopwords.words("english"))
stop.append("brand newofflineactivity")
stop.append("newbieofflineactivity")
stop.append("jr. memberofflineactivity")
stop.append("memberofflineactivity")
stop.append("full memberofflineactivity")
stop.append("sr. memberofflineactivity")
stop.append("hero memberofflineactivity")
stop.append("legendaryofflineactivity")
stop.append("vipofflineactivity")
stop.append("donaterofflineactivity")
stop.append("staffofflineactivity")
stop.append("moderatorofflineactivity")
stop.append("global moderatorofflineactivity")
stop.append("administratorofflineactivity")
stop.append("founderofflineactivity")
stop.append("merit")
stop.append("re")
stop.append("bitcoin")
stop.append("bitcoins")
stop.append("brand")
stop.append("full")
stop.append("global")
stop.append("hero")
stop.append("jr")
stop.append("newofflineactivity")
stop.append("sr")
stop.append("pm")
stop.append("com")
stop.append("www")
stop.append("http")
stop.append("")
stpexp = ['*brand newofflineactivity', '*newbieofflineactivity', '*jr. memberofflineactivity',\
'*memberofflineactivity','*full memberofflineactivity','*sr. memberofflineactivity',\
'*hero memberofflineactivity','*legendaryofflineactivity','*vipofflineactivity'\
'*donaterofflineactivity','*staffofflineactivity','*moderatorofflineactivity','*global moderatorofflineactivity',\
'*administratorofflineactivity','*founderofflineactivity']
So I want to search all of the items in 'txt' for all of the variations of the words in the list 'stpexp' and then append all of those variation to my list of stopwords 'stop'.
Any assistance would be greatly appreciated.
You can use the fnmatch built in library.
For example, if you want to find all the words in your text that ends with 'thispattern', you can do it like this:
import fnmatch
txt = ["longerthepattern is the word i want", "thisisthepattern and it works"]
pattern = '*thepattern'
to_add_to_stoplist = []
for sentence in txt:
filters = fnmatch.filter(sentence.split(" "),pattern)
to_add_to_stoplist += filters
And it outputs:
['longerthepattern', 'thisisthepattern']
You can add this list of words to the stopwords.
EDIT:
Here is a version using for comprehensions to analyze multiple patterns. It no longer uses fnmatch and uses the str.endswith function.
Note that it requires the patterns to be a tuple, and not a list.
txt = ["longerthepattern removeme is the word i want", "thisisthepattern and it works"]
patterns = ("pattern","veme")
def my_func(sentence):
return [x for x in sentence.split(" ") if x.lower().endswith(patterns)]
to_add_to_stop = [word for sentence in txt for word in my_func(sentence) ]
It outputs:
['longerthepattern', 'removeme', 'thisisthepattern']
SECOND EDIT:
I added the .lower() fonction in the for comprehension to ensure that the words we are comparing with the patterns are all lowercase since the patterns are lowercase as well.
Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.
Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).
I'm making an application to clean cellphone number. I'm using the phonenumbers package. So I used phonenumbers.parse(Cell No, Country Initials)
and on console it looks like this:
Country Code: ### National Number: ###
I was planning to just delete some text to get the area code and national number but I remembered that the area code and national number character length is different in other countries.
Is there a way to get just the area code and national numbers seperatly
phonenumbers.parse returns a PhoneNumber object.
After y = phonenumbers.parse("020 8366 1177", "GB"), you can access the attributes by e.g. y.country_code or set them by y.country_code = 49 etc.
Extraction of the area code is a bit tricky as not all countries have the concept of an area code. See this code snippet of how to correctly get the area code and national subscriber number separately.
Problem
I am trying to output statistics about a table, followed by more table data using Pandas and numpy.
When I execute the following code:
import pandas as pd
import numpy as np
data = pd.read_csv(r'c:\Documents\DS\CAStateBuildingMetrics.csv')
waterUsage = data["Water Use (All Water Sources) (kgal)"]
dept = data[["Department Name", "Property Id"]]
mean = str(waterUsage.mean())
median = str(waterUsage.median())
most = str(waterUsage.mode())
hw1 = open(r'c:\Documents\DS\testFile', "a")
hw1.write("Mean Water Usage Median Water Usage Most Common Usage Amounts\n")
hw1.write(mean+' '+median+' '+most)
np.savetxt(r'c:\Documents\DS\testFile', dept.values, fmt='%s')
The table output by np.savetext is written into c:\Documents\DS\testFile before the statistics about Mean, Median, and Mode water usage are written into the file. Below is the output I am describing:
Here is a sample of the table output, which ends up to be 1700 rows.
Capitol Area Development Authority 1259182
Capitol Area Development Authority 1259200
Capitol Area Development Authority 1259218
California Department of Forestry and Fire Protection 3939905
California Department of Forestry and Fire Protection 3939906
California Department of Forestry and Fire Protection 3939907
After this, the script outputs the statistics in this format
Mean Water Usage Median Water Usage Most Common Usage Amounts
6913.1633414932685 182.35 0 165.0
Type: float64
Question
How do I adjust the behavior to guarantee that the statistics appear before the table?
The issue, as pointed out by #hpaulj, is that the same open file is not being referenced.
Replacing
np.savetxt(r'c:\Documents\DS\testFile', dept.values, fmt='%s')
With
np.savetxt(hw1, dept.values, fmt='%s')
hw1.close()
Will write all information in the expected order in the same file. Closing it follows best practices of handling files in Python.
Given an input file, e.g.
<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>
The desired result is a nested dictionary that stores:
/setid
/docid
/segid
text
I've been using a defaultdict and reading the xml file with BeautifulSoup and nested loops, i.e.
from io import StringIO
from collections import defaultdict
from bs4 import BeautifulSoup
srcfile = """<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>"""
#ntok = NISTTokenizer()
eval_docs = defaultdict(lambda: defaultdict(dict))
with StringIO(srcfile) as fin:
bsoup = BeautifulSoup(fin.read(), 'html5lib')
setid = bsoup.find('srcset')['setid']
for doc in bsoup.find_all('doc'):
docid = doc['docid']
for seg in doc.find_all('seg'):
segid = seg['id']
eval_docs[setid][docid][segid] = seg.text
[out]:
>>> eval_docs
defaultdict(<function __main__.<lambda>>,
{'newstest2015': defaultdict(dict,
{'1012-bbc': {'1': 'India and Japan prime ministers meet in Tokyo',
'2': "India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.",
'3': 'Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.',
'4': 'High on the agenda are plans for greater nuclear co-operation.',
'5': 'India is also reportedly hoping for a deal on defence collaboration between the two nations.'},
'1018-lenta.ru': {'1': 'FANO Russia will hold a final Expert Session',
'2': 'The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.',
'3': 'The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.',
'4': 'At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.'}})})
Is there a simpler way to read the file and get the same eval_docs nested dictionary?
Can it be done easily without using BeautifulSoup?
Note that in the example, there's only one setid and one docid but the actual file has more than one of those.
Since what you have is an HTML with an appearance like XML, you can't go for XML based tools. In most cases your options were
Implement SAX parser
use BS4 (which you are already doing)
Use lxml
In any case you will end up spending more time and effort and have a bigger code to handle this. What you have really sleek and easy. I wouldn't look for another solution if I was you.
PS: What simpler could it be than a 10 liner code!
I don't know if you'll find this simpler, but here's an alternative, using lxml as others have suggested.
Step 1: Convert the XML data into a normalized table (a list of lists)
from lxml import etree
tree = etree.parse('source.xml')
segs = tree.xpath('//seg')
normalized_list = []
for seg in segs:
srcset = seg.getparent().getparent().getparent().attrib['setid']
doc = seg.getparent().getparent().attrib['docid']
normalized_list.append([srcset, doc, seg.attrib['id'], seg.text])
Step 2: Use defaultdict like you did in your original code
d = defaultdict(lambda: defaultdict(dict))
for i in normalized_list:
d[i[0]][i[1]][i[2]] = i[3]
Depending on how you're keeping the source file, you'll have to use one of these methods to parse XML:
tree = etree.parse('source.xml'): when you want to parse a file directly - you won't need StringIO. File is closed automatically by etree.
tree = etree.fromstring(source): where source is a string object, like in your question.