Hi I have been playing with this code from last 3-4 days but no luck.
Here is the code
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences3 = ['USPS IV USA Tracking Status WTA Portal to content Home ures Plans About Us Glossary Contact Us Home parceltrack USPS IV USA Tracking Status ',
'UPS Mail Innovations USA Tracking Status to content Home Contact Us Home parceltrack UPS Mail Innovations USA Tracking Status',
'USPS USA Tracking Status | WTA Portal - WTA Portal Skip to cont About Us Glossary Contact Us Home parceltrack USPS USA Tracking Status ']
# remove non ascii characters from list
sentences3 = [x.encode('ascii', 'ignore').decode('ascii') for x in sentences3]
print(sentences3)
embeddings3 = model.encode(sentences3, convert_to_tensor=True)
print(embeddings3)
clusters = util.community_detection(embeddings3, threshold=0.2, min_community_size=1)
print(clusters)
the line clusters = util.community_detection(embeddings3, threshold=0.2, min_community_size=1)
is not outputting anything, I have waited for like 1 hour to see anything happens but no luck. I have descent mac pro m2 with 16gb RAM, so I feel resources shouldn't be an issue.
Anyone got any tips to debug?
Thanks
This was a known issue in SBERT that was supposed to be fixed. But you can get around it by setting a higher threshold. If you set it to 0.8 it runs in seconds. Or by adding more sentences.
util.community_detection(embeddings3, threshold=0.8, min_community_size=1)
Related
Please forgive me I am teaching myself and am pretty new to this. I have searched and found nothing that addressed my issue, lots of things dealing with os and file directories but I couldn't figure out how to implement them here. I also am not super familiar with regex, and have tried implementing that as well but kept getting errors.
So I have a large text file (9GB) that is actually a list of discussion board posts.
I have a list of words I want to add to the stoplist for topic modeling. (I can do that)
However I also want to add any term that ends with any of the words in my list.
A sample of my data and lists is below.
txt = ['satoshiFounderSr MemberOfflineActivity Merit Welcome to the new Bitcoin forumNovember PMMeritedbyVlad Vlad Claymore krogothmanhattan negeroy Referee Vod suchmoon alani Lesbian Cow cryptohunter hv janggernaut matt Jeremycoin MaoChao Kda roslinpl gold MicroGuy elokk notaek BitcoinFX EcuaMobi Lutpin Lincoln Echo Nomad avatar kiyoshi saugwurm BALIK anggriani teeGUMES dooglus bitbollo klarki franckuestein legendster techman Provok mrcash paxmao jeks Cent MrCryptHodl DireWolfM BarbieCasino theunbeatable mindrust fillippone Mister k LFC Bitcoin nutildah Oceat digit Woshib ubay undeadbitcoiner pushups btcrocks realdantreccia Dq Atabey limtjoehua LoyceV anonymousminer MagicByt vizique coinlocket Altcoinsintel baeva OgNasty o solo miner Janation Kalemder sujonali MoparMiningLLC Eddyc jonemil Kryptowolf green slmn TyfrTR cr mprep Searing EFS adaseb notbatman Lucius boltz layer gfx seoincorporation AGD Phinnaeus Gage tabas pawel Lafu pangu Blind Legs Parker itod Potato Chips wonko Arriemoller Coin ruletheworld Halab coupable o e l e o TheBeardedBaby MoxnatyShmel monsanto amishmanish xtraelv Husna QA madnessteat Bthd taikuri dvd rw Toxic styca WorldCoiner bubbalex xyzzy V saya jets crypto trader xzEXrP xlcus solosequenosenada VB MishaSER dragonvslinux Zocadas jahepahit risatrakib chimk Porfirii YuT Coin adrianto famososMuertos angel Financisto RareFortune jakoylantern bere kin mdayonliner sncc squallw cryptjh jazmuzika wishxy markleal BlackHatCoiner an sha ldah DEMENTOR mustangy TaShoKi Adriane Poker Player StackItUp PIOUPIOU loreRex tasadar wego Gustavo Livecoins Palmholder CryptoPravda barjan Crypto Collection collapse jukeee Cuk ng bitc in LBTC Pyrojason M BTC vanobe shortcircuit Toqo Vxv BiT pOL songsunling bitcoinokulu AlexMay Kaonashi Neo Baudrillard RussaX morkaii Welcome to the new Bitcoin forumThe old forum can still be reached here http bitcoinsourceforgenet boards indexphpI ll repost some selected threads here and add updated answers to questions where I canFAQhttp bitcoinsourceforgenet wiki indexphppage FAQDownloadhttp sourceforgenet projects bitcoin files satoshiFounderSr MemberOfflineActivity Merit Welcome to the new Bitcoin forumNovember PMMeritedbyVlad Vlad Claymore krogothmanhattan negeroy Referee Vod suchmoon alani Lesbian Cow cryptohunter hv janggernaut matt Jeremycoin MaoChao Kda roslinpl gold MicroGuy elokk notaek BitcoinFX EcuaMobi Lutpin Lincoln Echo Nomad avatar kiyoshi saugwurm BALIK anggriani teeGUMES dooglus bitbollo klarki franckuestein legendster techman Provok mrcash paxmao jeks Cent MrCryptHodl DireWolfM BarbieCasino theunbeatable mindrust fillippone Mister k LFC Bitcoin nutildah Oceat digit Woshib ubay undeadbitcoiner pushups btcrocks realdantreccia Dq Atabey limtjoehua LoyceV anonymousminer MagicByt vizique coinlocket Altcoinsintel baeva OgNasty o solo miner Janation Kalemder sujonali MoparMiningLLC Eddyc jonemil Kryptowolf green slmn TyfrTR cr mprep Searing EFS adaseb notbatman Lucius boltz layer gfx seoincorporation AGD Phinnaeus Gage tabas pawel Lafu pangu Blind Legs Parker itod Potato Chips wonko Arriemoller Coin ruletheworld Halab coupable o e l e o TheBeardedBaby MoxnatyShmel monsanto amishmanish xtraelv Husna QA madnessteat Bthd taikuri dvd rw Toxic styca WorldCoiner bubbalex xyzzy V saya jets crypto trader xzEXrP xlcus solosequenosenada VB MishaSER dragonvslinux Zocadas jahepahit risatrakib chimk Porfirii YuT Coin adrianto famososMuertos angel Financisto RareFortune jakoylantern bere kin mdayonliner sncc squallw cryptjh jazmuzika wishxy markleal BlackHatCoiner an sha ldah DEMENTOR mustangy TaShoKi Adriane Poker Player StackItUp PIOUPIOU loreRex tasadar wego Gustavo Livecoins Palmholder CryptoPravda barjan Crypto Collection collapse jukeee Cuk ng bitc in LBTC Pyrojason M BTC vanobe shortcircuit Toqo Vxv BiT pOL songsunling bitcoinokulu AlexMay Kaonashi Neo Baudrillard RussaX morkaii Welcome to the new Bitcoin forumThe old forum can still be reached here http bitcoinsourceforgenet boards indexphpI ll repost some selected threads here and add updated answers to questions where I canFAQhttp bitcoinsourceforgenet wiki indexphppage FAQDownloadhttp sourceforgenet projects bitcoin files satoshiFounderSr MemberOfflineActivity Merit Welcome to the new Bitcoin forumNovember PMMeritedbyVlad Vlad Claymore krogothmanhattan negeroy Referee Vod suchmoon alani Lesbian Cow cryptohunter hv janggernaut matt Jeremycoin MaoChao Kda roslinpl gold MicroGuy elokk notaek BitcoinFX EcuaMobi Lutpin Lincoln Echo Nomad avatar kiyoshi saugwurm BALIK anggriani teeGUMES dooglus bitbollo klarki franckuestein legendster techman Provok mrcash paxmao jeks Cent MrCryptHodl DireWolfM BarbieCasino theunbeatable mindrust fillippone Mister k LFC Bitcoin nutildah Oceat digit Woshib ubay undeadbitcoiner pushups btcrocks realdantreccia Dq Atabey limtjoehua LoyceV anonymousminer MagicByt vizique coinlocket Altcoinsintel baeva OgNasty o solo miner Janation Kalemder sujonali MoparMiningLLC Eddyc jonemil Kryptowolf green slmn TyfrTR cr mprep Searing EFS adaseb notbatman Lucius boltz layer gfx seoincorporation AGD Phinnaeus Gage tabas pawel Lafu pangu Blind Legs Parker itod Potato Chips wonko Arriemoller Coin ruletheworld Halab coupable o e l e o TheBeardedBaby MoxnatyShmel monsanto amishmanish xtraelv Husna QA madnessteat Bthd taikuri dvd rw Toxic styca WorldCoiner bubbalex xyzzy V saya jets crypto trader xzEXrP xlcus solosequenosenada VB MishaSER dragonvslinux Zocadas jahepahit risatrakib chimk Porfirii YuT Coin adrianto famososMuertos angel Financisto RareFortune jakoylantern bere kin mdayonliner sncc squallw cryptjh jazmuzika wishxy markleal BlackHatCoiner an sha ldah DEMENTOR mustangy TaShoKi Adriane Poker Player StackItUp PIOUPIOU loreRex tasadar wego Gustavo Livecoins Palmholder CryptoPravda barjan Crypto Collection collapse jukeee Cuk ng bitc in LBTC Pyrojason M BTC vanobe shortcircuit Toqo Vxv BiT pOL songsunling bitcoinokulu AlexMay Kaonashi Neo Baudrillard RussaX morkaii Welcome to the new Bitcoin forumThe old forum can still be reached here http bitcoinsourceforgenet boards indexphpI ll repost some selected threads here and add updated answers to questions where I canFAQhttp bitcoinsourceforgenet wiki indexphppage FAQDownloadhttp sourceforgenet projects bitcoin files Welcome to the new Bitcoin forumNovember PMMeritedbyVlad Vlad Claymore krogothmanhattan negeroy Referee Vod suchmoon alani Lesbian Cow cryptohunter hv janggernaut matt Jeremycoin MaoChao Kda roslinpl gold MicroGuy elokk notaek BitcoinFX EcuaMobi Lutpin Lincoln Echo Nomad avatar kiyoshi saugwurm BALIK anggriani teeGUMES dooglus bitbollo klarki franckuestein legendster techman Provok mrcash paxmao jeks Cent MrCryptHodl DireWolfM BarbieCasino theunbeatable mindrust fillippone Mister k LFC Bitcoin nutildah Oceat digit Woshib ubay undeadbitcoiner pushups btcrocks realdantreccia Dq Atabey limtjoehua LoyceV anonymousminer MagicByt vizique coinlocket Altcoinsintel baeva OgNasty o solo miner Janation Kalemder sujonali MoparMiningLLC Eddyc jonemil Kryptowolf green slmn TyfrTR cr mprep Searing EFS adaseb notbatman Lucius boltz layer gfx seoincorporation AGD Phinnaeus Gage tabas pawel Lafu pangu Blind Legs Parker itod Potato Chips wonko Arriemoller Coin ruletheworld Halab coupable o e l e o TheBeardedBaby MoxnatyShmel monsanto amishmanish xtraelv Husna QA madnessteat Bthd taikuri dvd rw Toxic styca WorldCoiner bubbalex xyzzy V saya jets crypto trader xzEXrP xlcus solosequenosenada VB MishaSER dragonvslinux Zocadas jahepahit risatrakib chimk Porfirii YuT Coin adrianto famososMuertos angel Financisto RareFortune jakoylantern bere kin mdayonliner sncc squallw cryptjh jazmuzika wishxy markleal BlackHatCoiner an sha ldah DEMENTOR mustangy TaShoKi Adriane Poker Player StackItUp PIOUPIOU loreRex tasadar wego Gustavo Livecoins Palmholder CryptoPravda barjan Crypto Collection collapse jukeee Cuk ng bitc in LBTC Pyrojason M BTC vanobe shortcircuit Toqo Vxv BiT pOL songsunling bitcoinokulu AlexMay Kaonashi Neo Baudrillard RussaX morkaii ',
'satoshiFounderSr MemberOfflineActivity Merit Repost Bitcoin MaturationNovember PMMeritedbyescrowms NeuroticFish finist x icopress jankeman bitcoinbitcoin Bitcoin MaturationPosted Thu of Oct UTC From the user s perspective the bitcoin maturation process can be broken down into stages The initial network transaction that occurs when you first click Generate Coins The time between that initial network transaction and when the bitcoin entry is ready to appear in the All Transactions list The change of the bitcoin entry from outside the All Transaction field to inside it The time between when the bitcoin appears in the All Transfers list and when the Description is ready to change to Generated matures in x more blocks The change of the Description to Generated matures in x more blocks The time between when the Description says Generated matures in x more blocks to when it is ready to change to Generated The change of the Description to Generated The time after the Description has changed to GeneratedWhich stages require network connectivity significant local CPU usage and or significant remote CPU usage Do any of these stages have names sirius m Re Bitcoin MaturationPosted Thu of Oct UTC As far as I know there s no network transaction when you click Generate Coins your computer just starts calculating the next proof of work The CPU usage is when you re generating coinsIn this example the network connection is used when you broadcast the information about the proof of work block you ve created that which entitles you to the new coin Generating coins successfully requires constant connectivity so that you can start working on the next block when someone gets the current block before yousatoshiFounderSr MemberOfflineActivity Merit Repost Bitcoin MaturationNovember PMMeritedbyescrowms NeuroticFish finist x icopress jankeman bitcoinbitcoin Bitcoin MaturationPosted Thu of Oct UTC From the user s perspective the bitcoin maturation process can be broken down into stages The initial network transaction that occurs when you first click Generate Coins The time between that initial network transaction and when the bitcoin entry is ready to appear in the All Transactions list The change of the bitcoin entry from outside the All Transaction field to inside it The time between when the bitcoin appears in the All Transfers list and when the Description is ready to change to Generated matures in x more blocks The change of the Description to Generated matures in x more blocks The time between when the Description says Generated matures in x more blocks to when it is ready to change to Generated The change of the Description to Generated The time after the Description has changed to GeneratedWhich stages require network connectivity significant local CPU usage and or significant remote CPU usage Do any of these stages have names sirius m Re Bitcoin MaturationPosted Thu of Oct UTC As far as I know there s no network transaction when you click Generate Coins your computer just starts calculating the next proof of work The CPU usage is when you re generating coinsIn this example the network connection is used when you broadcast the information about the proof of work block you ve created that which entitles you to the new coin Generating coins successfully requires constant connectivity so that you can start working on the next block when someone gets the current block before yousatoshiFounderSr MemberOfflineActivity Merit Repost Bitcoin MaturationNovember PMMeritedbyescrowms NeuroticFish finist x icopress jankeman bitcoinbitcoin Bitcoin MaturationPosted Thu of Oct UTC From the user s perspective the bitcoin maturation process can be broken down into stages The initial network transaction that occurs when you first click Generate Coins The time between that initial network transaction and when the bitcoin entry is ready to appear in the All Transactions list The change of the bitcoin entry from outside the All Transaction field to inside it The time between when the bitcoin appears in the All Transfers list and when the Description is ready to change to Generated matures in x more blocks The change of the Description to Generated matures in x more blocks The time between when the Description says Generated matures in x more blocks to when it is ready to change to Generated The change of the Description to Generated The time after the Description has changed to GeneratedWhich stages require network connectivity significant local CPU usage and or significant remote CPU usage Do any of these stages have names sirius m Re Bitcoin MaturationPosted Thu of Oct UTC As far as I know there s no network transaction when you click Generate Coins your computer just starts calculating the next proof of work The CPU usage is when you re generating coinsIn this example the network connection is used when you broadcast the information about the proof of work block you ve created that which entitles you to the new coin Generating coins successfully requires constant connectivity so that you can start working on the next block when someone gets the current block before youRepost Bitcoin MaturationNovember PMMeritedbyescrowms NeuroticFish finist x icopress jankeman ',
'satoshiFounderSr MemberOfflineActivity Merit Repost Request Make this anonymousNovember PMMeritedbyxtraelv anonguy Request Make this anonymousPosted Thu of Oct UTC Are there any plans to make this service anonymouseg Being able to route BitCoin through TorsatoshiFounderSr MemberOfflineActivity Merit Repost Request Make this anonymousNovember PMMeritedbyxtraelv anonguy Request Make this anonymousPosted Thu of Oct UTC Are there any plans to make this service anonymouseg Being able to route BitCoin through TorsatoshiFounderSr MemberOfflineActivity Merit Repost Request Make this anonymousNovember PMMeritedbyxtraelv anonguy Request Make this anonymousPosted Thu of Oct UTC Are there any plans to make this service anonymouseg Being able to route BitCoin through TorRepost Request Make this anonymousNovember PMMeritedbyxtraelv ',
'satoshiFounderSr MemberOfflineActivity Merit Re Repost Bitcoin MaturationNovember PMMeritedbyhold coins NeuroticFish It s important to have network connectivity while you re trying to generate a coin block and at the moment it is successfully generated During generation when the status bar says Generating and you re using CPU to find a proof of work you must constantly keep in contact with the network to receive the latest block If your block does not link to the latest block it may not be accepted When you successfully generate a block it is immediately broadcast to the network Other nodes must receive it and link to it for it to be accepted as the new latest blockThink of it as a cooperative effort to make a chain When you add a link you must first find the current end of the chain If you were to locate the last link then go off for an hour and forge your link come back and link it to the link that was the end an hour ago others may have added several links since then and they re not going to want to use your link that now branches off the middleAfter a block is created the maturation time of blocks is to make absolutely sure the block is part of the main chain before it can be spent Your node isn t doing anything with the block during that time just waiting for other blocks to be added after yours You don t have to be online during that timesatoshiFounderSr MemberOfflineActivity Merit Re Repost Bitcoin MaturationNovember PMMeritedbyhold coins NeuroticFish It s important to have network connectivity while you re trying to generate a coin block and at the moment it is successfully generated During generation when the status bar says Generating and you re using CPU to find a proof of work you must constantly keep in contact with the network to receive the latest block If your block does not link to the latest block it may not be accepted When you successfully generate a block it is immediately broadcast to the network Other nodes must receive it and link to it for it to be accepted as the new latest blockThink of it as a cooperative effort to make a chain When you add a link you must first find the current end of the chain If you were to locate the last link then go off for an hour and forge your link come back and link it to the link that was the end an hour ago others may have added several links since then and they re not going to want to use your link that now branches off the middleAfter a block is created the maturation time of blocks is to make absolutely sure the block is part of the main chain before it can be spent Your node isn t doing anything with the block during that time just waiting for other blocks to be added after yours You don t have to be online during that timesatoshiFounderSr MemberOfflineActivity Merit Re Repost Bitcoin MaturationNovember PMMeritedbyhold coins NeuroticFish It s important to have network connectivity while you re trying to generate a coin block and at the moment it is successfully generated During generation when the status bar says Generating and you re using CPU to find a proof of work you must constantly keep in contact with the network to receive the latest block If your block does not link to the latest block it may not be accepted When you successfully generate a block it is immediately broadcast to the network Other nodes must receive it and link to it for it to be accepted as the new latest blockThink of it as a cooperative effort to make a chain When you add a link you must first find the current end of the chain If you were to locate the last link then go off for an hour and forge your link come back and link it to the link that was the end an hour ago others may have added several links since then and they re not going to want to use your link that now branches off the middleAfter a block is created the maturation time of blocks is to make absolutely sure the block is part of the main chain before it can be spent Your node isn t doing anything with the block during that time just waiting for other blocks to be added after yours You don t have to be online during that timeRe Repost Bitcoin MaturationNovember PMMeritedbyhold coins NeuroticFish ',
'satoshiFounderSr MemberOfflineActivity Merit Re Repost Request Make this anonymousNovember PM There will be a proxy setting in version so you can connect through TOR I ve done a careful scrub to make sure it doesn t use DNS or do anything that would leak your IP while in proxy modesatoshiFounderSr MemberOfflineActivity Merit Re Repost Request Make this anonymousNovember PM There will be a proxy setting in version so you can connect through TOR I ve done a careful scrub to make sure it doesn t use DNS or do anything that would leak your IP while in proxy modesatoshiFounderSr MemberOfflineActivity Merit Re Repost Request Make this anonymousNovember PM There will be a proxy setting in version so you can connect through TOR I ve done a careful scrub to make sure it doesn t use DNS or do anything that would leak your IP while in proxy modeRe Repost Request Make this anonymousNovember PM ',
'satoshiFounderSr MemberOfflineActivity Merit Repost How anonymous are bitcoinsNovember PMMeritedbylivingfree xtraelv bitcoinbitcoin How anonymous are bitcoinsCan nodes on the network tell from which and or to which bitcoin address coins are being sent Do blocks contain a history of where bitcoins have been transfered to and from Can nodes tell which bitcoin addresses belong to which IP addresses Is there a command line option to enable the sock proxy the first time that bitcoin starts What happens if you send bitcoins to an IP address that has multiple clients connected through network address translation NAT satoshiFounderSr MemberOfflineActivity Merit Repost How anonymous are bitcoinsNovember PMMeritedbylivingfree xtraelv bitcoinbitcoin How anonymous are bitcoinsCan nodes on the network tell from which and or to which bitcoin address coins are being sent Do blocks contain a history of where bitcoins have been transfered to and from Can nodes tell which bitcoin addresses belong to which IP addresses Is there a command line option to enable the sock proxy the first time that bitcoin starts What happens if you send bitcoins to an IP address that has multiple clients connected through network address translation NAT satoshiFounderSr MemberOfflineActivity Merit Repost How anonymous are bitcoinsNovember PMMeritedbylivingfree xtraelv bitcoinbitcoin How anonymous are bitcoinsCan nodes on the network tell from which and or to which bitcoin address coins are being sent Do blocks contain a history of where bitcoins have been transfered to and from Can nodes tell which bitcoin addresses belong to which IP addresses Is there a command line option to enable the sock proxy the first time that bitcoin starts What happens if you send bitcoins to an IP address that has multiple clients connected through network address translation NAT Repost How anonymous are bitcoinsNovember PMMeritedbylivingfree xtraelv ']
stop = list(stopwords.words("english"))
stop.append("brand newofflineactivity")
stop.append("newbieofflineactivity")
stop.append("jr. memberofflineactivity")
stop.append("memberofflineactivity")
stop.append("full memberofflineactivity")
stop.append("sr. memberofflineactivity")
stop.append("hero memberofflineactivity")
stop.append("legendaryofflineactivity")
stop.append("vipofflineactivity")
stop.append("donaterofflineactivity")
stop.append("staffofflineactivity")
stop.append("moderatorofflineactivity")
stop.append("global moderatorofflineactivity")
stop.append("administratorofflineactivity")
stop.append("founderofflineactivity")
stop.append("merit")
stop.append("re")
stop.append("bitcoin")
stop.append("bitcoins")
stop.append("brand")
stop.append("full")
stop.append("global")
stop.append("hero")
stop.append("jr")
stop.append("newofflineactivity")
stop.append("sr")
stop.append("pm")
stop.append("com")
stop.append("www")
stop.append("http")
stop.append("")
stpexp = ['*brand newofflineactivity', '*newbieofflineactivity', '*jr. memberofflineactivity',\
'*memberofflineactivity','*full memberofflineactivity','*sr. memberofflineactivity',\
'*hero memberofflineactivity','*legendaryofflineactivity','*vipofflineactivity'\
'*donaterofflineactivity','*staffofflineactivity','*moderatorofflineactivity','*global moderatorofflineactivity',\
'*administratorofflineactivity','*founderofflineactivity']
So I want to search all of the items in 'txt' for all of the variations of the words in the list 'stpexp' and then append all of those variation to my list of stopwords 'stop'.
Any assistance would be greatly appreciated.
You can use the fnmatch built in library.
For example, if you want to find all the words in your text that ends with 'thispattern', you can do it like this:
import fnmatch
txt = ["longerthepattern is the word i want", "thisisthepattern and it works"]
pattern = '*thepattern'
to_add_to_stoplist = []
for sentence in txt:
filters = fnmatch.filter(sentence.split(" "),pattern)
to_add_to_stoplist += filters
And it outputs:
['longerthepattern', 'thisisthepattern']
You can add this list of words to the stopwords.
EDIT:
Here is a version using for comprehensions to analyze multiple patterns. It no longer uses fnmatch and uses the str.endswith function.
Note that it requires the patterns to be a tuple, and not a list.
txt = ["longerthepattern removeme is the word i want", "thisisthepattern and it works"]
patterns = ("pattern","veme")
def my_func(sentence):
return [x for x in sentence.split(" ") if x.lower().endswith(patterns)]
to_add_to_stop = [word for sentence in txt for word in my_func(sentence) ]
It outputs:
['longerthepattern', 'removeme', 'thisisthepattern']
SECOND EDIT:
I added the .lower() fonction in the for comprehension to ensure that the words we are comparing with the patterns are all lowercase since the patterns are lowercase as well.
I am new to the web scraping. I am trying to scrape "When purchase Online"
When purchased online in the Target. But i did not find it in the HTML.
.
Does anyone konw how to locate the element in HTML? Any help appreciates. Thanks!
Product Url:
https://www.target.com/c/allergy-sinus-medicines-treatments-health/-/N-4y5ny?Nao=144
https://www.target.com/p/genexa-dextromethorphan-kids-39-cough-and-chest-congestion-suppressant-4-fl-oz/-/A-80130848#lnk=sametab
I have no idea which element you want to get but API sends JSON data, not HTML, and you may simply convert it to dictionary/list and use keys/indexes to get value.
But you have to manually find correct keys in JSON data.
Or you may write some script to search in JSON (using for-loops and recursions)
Minimal working code. I found keys manually.
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130848&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA' # JSON
response = requests.get(url)
data = response.json()
product = data['data']['product']
print('price:', product['price']['current_retail'])
print('title:', product['item']['product_description']['title'])
print('description:', product['item']['product_description']['downstream_description'])
print('------------')
for bullet in product['item']['product_description']['bullet_descriptions']:
print(bullet)
print('------------')
print(product['item']['product_description']['soft_bullets']['title'])
for bullet in product['item']['product_description']['soft_bullets']['bullets']:
print('-', bullet)
print('------------')
for attribute in product['item']['wellness_merchandise_attributes']:
print('-', attribute['value_name'])
print(' ', attribute['wellness_description'])
Result:
price: 13.99
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
description: Genexa Kids’ Cough & Chest Congestion is real medicine, made clean - a powerful cough suppressant and expectorant that helps control cough, relieves chest congestion and helps thin and loosen mucus. This liquid, non-drowsy medicine has the same active ingredients you need (dextromethorphan HBr and guaifenesin), but without the artificial ones you don’t (dyes, common allergens, parabens). We only use ingredients people deserve to make the first gluten-free, non-GMO, certified vegan medicines to help your little ones feel better. <br /><br />Genexa is the first clean medicine company. Founded by two dads who believe in putting People Over Everything, Genexa makes medicine with the same active ingredients people need, but without the artificial ones they don’t. It’s real medicine, made clean.
------------
<B>Suggested Age:</B> 4 Years and Up
<B>Product Form:</B> Liquid
<B>Primary Active Ingredient:</B> Dextromethorphan
<B>Package Quantity:</B> 1
<B>Net weight:</B> 4 fl oz (US)
------------
highlights
- This is an age restricted item and will require us to take a quick peek at your ID upon pick-up
- Helps relieve kids’ chest congestion and makes coughs more productive by thinning and loosening mucus
- Non-drowsy so your little ones (ages 4+) can get back to playing
- Our medicine is junk-free, with no artificial sweeteners or preservatives, no dyes, no parabens, and no common allergens
- Certified gluten-free, vegan, and non-GMO
- Flavored with real organic blueberries
- Gentle on little tummies
------------
- Dye-Free
A product that either makes an unqualified on-pack statement indicating that it does not contain dye, or carries an unqualified on-pack statement such as "no dyes" or "dye-free."
- Gluten Free
A product that has an unqualified independent third-party certification, or carries an on-pack statement relating to the finished product being gluten-free.
- Non-GMO
A product that has an independent third-party certification, or carries an unqualified on-pack statement relating to the final product being made without genetically engineered ingredients.
- Vegan
A product that carries an unqualified independent, third-party certification, or carries on-pack statement relating to the product being 100% vegan.
- HSA/FSA Eligible
Restrictions apply; contact your insurance provider about plan allowances and requirements
EDIT:
Information "When purchased online" (or "at Cedar Rapids South") are in different url.
For example
Product url:
https://www.target.com/p/genexa-kids-39-diphenhydramine-allergy-liquid-medicine-organic-agave-4-fl-oz/-/A-80130847
API product data:
https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130847&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA
API "at Cedar Rapids South":
https://redsky.target.com/redsky_aggregations/v1/web_platform/product_fulfillment_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&is_bot=false&tcin=80130847&store_id=1771&zip=52404&state=IA&latitude=41.9831&longitude=-91.6686&scheduled_delivery_store_id=1771&required_store_id=1771&has_required_store_id=true
But probably in some situations it may use other information in product data to put "When purchased online" instead of "at Cedar Rapids South" - and this can be hardcoded in JavaScript. For example product which displays "When purchased online" has formatted_price $13.99 but product which displays "at Cedar Rapids South" has formatted_price "See price in cart"
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&brand_id=q643lel65ir&channel=WEB&count=24&default_purchasability_filter=true&offset=0&page=%2Fb%2Fq643lel65ir&platform=desktop&pricing_store_id=1771&store_ids=1771%2C1768%2C1113%2C3374%2C1792&useragent=Mozilla%2F5.0+%28X11%3B+Linux+x86_64%3B+rv%3A101.0%29+Gecko%2F20100101+Firefox%2F101.0&visitor_id=01819D268B380201B177CA755BCE70CC' # JSON
response = requests.get(url)
data = response.json()
for product in data['data']['search']['products']:
print('title:', product['item']['product_description']['title'])
print('price:', product['price']['current_retail'])
print('formatted:', product['price']['formatted_current_price'])
print('---')
Result:
title: Genexa Kids' Diphenhydramine Allergy Liquid Medicine - Organic Agave - 4 fl oz
price: 7.99
formatted: See price in cart
---
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
price: 13.99
formatted: $13.99
---
I am trying to scrape the mail ID with Scrapy, Python and RegEx from this page: https://allevents.in/bangalore/project-based-summer-training-program/1851553244864163 .
For that purpose, I wrote the following commands, each of which returned an empty list:
response.xpath('//a/*[#href = "#"]/text()').extract()
response.xpath('//a/#onclick').extract()
response.xpath('//a/#onclick/text()').extract()
response.xpath('//span/*[#class = ""]/a/text()').extract()
response.xpath('//a/#onclick/text()').extract()
Apart from these, I had a plan to scrape the email ID from the description using RegEx. For that I wrote command to scrape the description which scraped everything except the email Id at the end of the description:
response.xpath('//*[#property = "schema:description"]/text()').extract()
The output of the above command is:
[u'\n\t\t\t\t\t\t\t "Your Future is created by what you do today Let\'s shape it With Summer Training Program \u2026\u2026\u2026 ."', u'\n', u'\nWith ever changing technologies & methodologies, the competition today is much greater than ever before. The industrial scenario needs constant technical enhancements to cater to the rapid demands.', u'\nHT India Labs is presenting Summer Training Program to acquire and clear your concepts about your respective fields. ', u'\nEnroll on ', u' and avail Early bird Discounts.', u'\n', u'\nFor Registration or Enquiry call 9911330807, 7065657373 or write us at ', u'\t\t\t\t\t\t']
I don't have much knowledge on onclick event attribute. I suppose, when it is set to return false then the request usually skips that portion. However, if you try the way I showed below, you may get the result very close to what you want.
import requests
from scrapy import Selector
res = requests.get("https://allevents.in/bangalore/project-based-summer-training-program/1851553244864163")
sel = Selector(res)
for items in sel.css("div[property='schema:description']"):
emailid = items.css("span::text").extract_first()
print(emailid)
Output:
htindialabsworkshops | gmail ! com
Hi~ I am having a problem while I am trying to tokenize facebook comments which are in CSV format. I have my CSV data ready, and I completed reading the file.
I am using Anaconda3; Python 3.5. (My CSV data has about 20k in rows and 1 in cols)
The codes are,
import csv
from nltk import sent_tokenize, word_tokenize as sent_tokenize, word_tokenize
with open('facebook_comments_samsung.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader) #list(reader)
print (your_list)
What comes, as a result, is something like this:
[['comment_message'], ['b"Yet again been told a pack of lies by Samsung Customer services who have lost my daughters phone and couldn\'t care less. ANYONE WHO PURCHASES ANYTHING FROM THIS COMPANY NEEDS THEIR HEAD TESTED"'], ["b'You cannot really blame an entire brand worldwide for a problem caused by a branch. It is a problem yes, but address your local problem branch'"], ["b'Haha!! Sorry if they lost your daughters phone but I will always buy Samsung products no matter what.'"], ["b'Salim Gaji BEST REPLIE EVER \\xf0\\x9f\\x98\\x8e'"], ["b'<3 Bewafa zarge <3 \\r\\n\\n \\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\r\\n\\xf0\\x9f\\x8e\\xad\\xf0\\x9f\\x91\\x89 AQIB-BOT.ML \\xf0\\x9f\\x91\\x88\\xf0\\x9f\\x8e\\xadMANUAL\\xe2\\x99\\xaaKing.Bot\\xe2\\x84\\xa2 \\r\\n\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\xe2\\x80\\x94'"], ["b'\\xf0\\x9f\\x8c\\x90 LATIF.ML \\xf0\\x9f\\x8c\\x90'"], ['b"I\'m just waiting here patiently for you guys to say that you\'ll be releasing the s8 and s8+ a week early, for those who pre-ordered. Wishful thinking \\xf0\\x9f\\x98\\x86. Can\'t wait!"'], ['b"That\'s some good positive thinking there sir."'], ["b'(y) #NextIsNow #DoWhatYouCant'"], ["b'looking good'"], ['b"I\'ve always thought that when I first set eyes on my first born that I\'d like it to be on the screen of a cameraphone at arms length rather than eye-to-eye while holding my child. Thank you Samsung for improving our species."'], ["b'cool story'"], ["b'I believe so!'"], ["b'superb'"], ["b'Nice'"], ["b'thanks for the share'"], ["b'awesome'"], ["b'How can I talk to Samsung'"], ["b'Wow'"], ["b'#DoWhatYouCant siempre grandes innovadores Samsung Mobile'"], ["b'I had a problem with my s7 edge when I first got it all fixed now. However when I went to the Samsung shop they were useless and rude they refused to help and said there is nothing they could do no wonder the shop was dead quiet'"], ["b'Zeeshan Khan Masti Khel'"], ["b'I dnt had any problem wd my phn'"], ["b'I have maybe just had a bad phone to start with until it got fixed eventually. I had to go to carphone warehouse they were very helpful'"], ["b'awesome'"], ["b'Ch Shuja Uddin'"], ["b'akhheeerrr'"], ["b'superb'"], ["b'nice story'"], ["b'thanks for the share'"], ["b'superb'"], ["b'thanks for the share'"], ['b"On February 18th 2017 I sent my phone away to with a screen issue. The lower part of the screen was flickering bright white. The phone had zero physical damage to the screen\\n\\nI receive an email from Samsung Quotations with a picture of my SIM tray. Upon phoning I was told my SIM tray was stuck inside the phone and was handed a \\xc2\\xa392.14 repair bill. There is no way that my SIM tray was stuck in the phone as I removed my SIM and memory card before sending the phone away.\\n\\nAfter numerous calls I finally gave in and agreed to pay the \\xc2\\xa392.14 on the understanding that my screen repair would also be covered in this cost. This was confirmed to me by the person on the phone.\\n\\nOn
Sorry for your inconvenience in reading the result. My bad.
To continue, I added,
tokens = [word_tokenize(i) for i in your_list]
for i in tokens:
print (i)
print (tokens)
This is the part where I get the following error:
C:\Program Files\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) in line 1278 TypeError: expected string or bytes-like object
What I want to do next is,
import nltk
en = nltk.Text(tokens)
print(len(en.tokens))
print(len(set(en.tokens)))
en.vocab()
en.plot(50)
en.count('galaxy s8')
And finally, I want to draw a wordcloud based on the data.
Being aware of the fact that every seconds of your time is precious, I am terribly sorry to ask for your help. I have been working this for a couple of days, and cannot find the right solution for my problem. Thank you for reading.
The error you're getting is because your CSV file is turned into a list of lists-- one for each row in the file. The file only contains one column, so each of these lists has one element: The string containing the message you want to tokenize. To get past the error, unpack the sublists by using this line instead:
tokens = [word_tokenize(row[0]) for row in your_list]
After that, you'll need to learn some more python and learn how to examine your program and your variables.
My first try on audio to text.
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("/path/to/.mp3") as source:
audio = r.record(source)
When I execute the above code, the following error occurs,
<ipython-input-10-72e982ecb706> in <module>()
----> 1 with sr.AudioFile("/home/yogaraj/Documents/Python workouts/Python audio to text/show_me_the_meaning.mp3") as source:
2 audio = sr.record(source)
3
/usr/lib/python2.7/site-packages/speech_recognition/__init__.pyc in __enter__(self)
197 aiff_file = io.BytesIO(aiff_data)
198 try:
--> 199 self.audio_reader = aifc.open(aiff_file, "rb")
200 except aifc.Error:
201 assert False, "Audio file could not be read as WAV, AIFF, or FLAC; check if file is corrupted"
/usr/lib64/python2.7/aifc.pyc in open(f, mode)
950 mode = 'rb'
951 if mode in ('r', 'rb'):
--> 952 return Aifc_read(f)
953 elif mode in ('w', 'wb'):
954 return Aifc_write(f)
/usr/lib64/python2.7/aifc.pyc in __init__(self, f)
345 f = __builtin__.open(f, 'rb')
346 # else, assume it is an open file object already
--> 347 self.initfp(f)
348
349 #
/usr/lib64/python2.7/aifc.pyc in initfp(self, file)
296 self._soundpos = 0
297 self._file = file
--> 298 chunk = Chunk(file)
299 if chunk.getname() != 'FORM':
300 raise Error, 'file does not start with FORM id'
/usr/lib64/python2.7/chunk.py in __init__(self, file, align, bigendian, inclheader)
61 self.chunkname = file.read(4)
62 if len(self.chunkname) < 4:
---> 63 raise EOFError
64 try:
65 self.chunksize = struct.unpack(strflag+'L', file.read(4))[0]
I don't know what I'm going wrong. Can someone say me what I'm wrong in the above code?
Speech recognition supports WAV file format.
Here is a sample WAV to text program using speech_recognition:
Sample code (Python 3)
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("woman1_wb.wav") as source:
audio = r.record(source)
try:
s = r.recognize_google(audio)
print("Text: "+s)
except Exception as e:
print("Exception: "+str(e))
Output:
Text: to administer medicine to animals is frequency of very difficult matter and yet sometimes it's necessary to do so
Used WAV File URL: http://www-mobile.ecs.soton.ac.uk/hth97r/links/Database/woman1_wb.wav
This is what was wrong:
Speech recognition only supports WAV file format.
But this is a more complete answer on how to get MP3-to-text:
This is a processing function that uses speech_recognition and pydub to convert MP3 into WAV then to TEXT using Google's Speech API. It chunks the MP3 file into 60s portions to fit inside google's limits and will allow you to run about 50 minutes of audio in a day. But it will block you after 50 API calls.
from pydub import AudioSegment # uses FFMPEG
import speech_recognition as sr
from pathlib import Path
#from pydub.silence import split_on_silence
#import io
#from pocketsphinx import AudioFile, Pocketsphinx
def process(filepath, chunksize=60000):
#0: load mp3
sound = AudioSegment.from_mp3(filepath)
#1: split file into 60s chunks
def divide_chunks(sound, chunksize):
# looping till length l
for i in range(0, len(sound), chunksize):
yield sound[i:i + chunksize]
chunks = list(divide_chunks(sound, chunksize))
print(f"{len(chunks)} chunks of {chunksize/1000}s each")
r = sr.Recognizer()
#2: per chunk, save to wav, then read and run through recognize_google()
string_index = {}
for index,chunk in enumerate(chunks):
#TODO io.BytesIO()
chunk.export('/Users/mmaxmeister/Downloads/test.wav', format='wav')
with sr.AudioFile('/Users/mmaxmeister/Downloads/test.wav') as source:
audio = r.record(source)
#s = r.recognize_google(audio, language="en-US") #, key=API_KEY) --- my key results in broken pipe
s = r.recognize_google(audio, language="en-US")
print(s)
string_index[index] = s
break
return string_index
text = process('/Users/mmaxmeister/Downloads/UUCM.mp3')
My test MP3 file was a sermon from archive.org:
https://ia801008.us.archive.org/24/items/UUCMService20190602IfWeBuildIt/UUCM%20Service%202019-06-02%20-%20If%20We%20Build%20It.mp3
And this is the text returned (each line is 60s of audio):
13 chunks of 60.0s each
please join me in a spirit of prayer Spirit of Life known in many ways by a million names gracious Spirit of Life unfolding never known in its fullness be with us hear our cries for deliverance dance with us in exultation hold us when we fall keep before us the reality that every day is a gift to be unwrapped a gift to help discover why we live why we are cast Here and Now
Austin teaches us that the days come and go like muffled veiled figures Sent From A Distant friendly party but they say nothing and if we do not use the gifts they bring us they will carry them away as silently as they came through buying source of all Bend us towards gratitude and compassion Modern Life demands much misery and woe get created all around us but there is more much more show us that much more belongs to us to light Dawns on those who live love and sing the truth Joy John's on those who humbly toiled
do what is just so you who can shine pass on your light when it Dawns on you and let us all find the space to see Life as a gift to see our days as blessings and let us return life's gift and promise with grateful hearts and acts of kindness in the name of all the each of us teams holiest within our hearts we pray Amon
my character at least when I was younger I'm sure I don't really do this anymore the most challenging aspect of my character is that I want wisdom yesterday I don't want to have to learn something now I should have known it already right I used to drive my poor parents crazy as they tried to help me with my homework my father with my math my mother with my spelling if I didn't know the answer as soon as the problem was in front of me I would get angry frustrated with myself how come I didn't already know that I'm supposed to be wise only child I wonder if that has anything to do around the room has been throughout my life
but I still see it manifest in one particular aspects of my being I want us all to know how to love one another perfectly with wisdom already we should have learned that yesterday I want the Beloved Community right now and it frustrates me to no end that it isn't here what was that song that we saying after the prayer response how could anyone ever tell you you were anything less than beautiful how do we do that how do we tell ourselves that and others that we are not all the time how do we do that how come we haven't figured that out yesterday there's been a great Salve and corrective
to this challenge of my personality that I found in Community First the bomb the South I find the in community in this started when I was a youth in my youth group when we were ten people sitting on a floor together on pillows telling one another about what we've been through that week and how much pain we were carrying and how much we needed one another I found in that youth group with just 10 of us are sitting on the floor that we could be the Beloved Community for one another if only just for one hour a week sometimes just for 5 minutes sometimes just for a moment that was the Sal that was the bomb I realize that maybe we can't do it all the time but we can do it in moments and in spaces and that only happen for me in the space
community that community that we created with one another and the corrective to my need to have things everything done yesterday that also happens in community because Community is the slowest place on Earth We're going to have our annual meeting later let's see how slow that's going to be but the truth of the matter is that even in that slowness when you're working really hard to set up or cleanup connection Cafe when you're trying to figure out how to set up membership so that we actually do talk to everybody who comes through the doors right when you're doing that work of the Care team and that big list of all the different people that we need to reach out to and and we have to figure out how we reached out to all of them and who's done it
when you're waiting for the sermon to be over in all of these waiting times and all of these phases of process what I've learned in that slowness something amazing something remarkable is happening we are dedicating ourselves over and over again to still being together cuz it's not always easy because we're all broken and we're all whole because sometimes this is incredibly difficult and painful but when we're in those times what we're doing is we're saying it's worth it something about this matters to me and to all of us and Becca's got a great story to illustrate the this comes from
used to have a radio show in Boston maybe you heard him at some point Unitarian Universalist Minister from Boston 66 driving lessons and five road test she was 70 years old and had never driven a car but on July 25th 1975 she went to the Rockland County driving school and took her first lesson her husband had already had heart trouble and might someday be unable to drive if that happened she wanted to be the one to do the shopping and Shake him to the doctor she began the slow and painful process of learning to start stop turn into traffic back up after 5 difficult month she took the driving test
before ever she wrote in her diary and was not nervous she just a test a month later and slumped again I did everything wrong she told her diary demonic in August of 1976 she resumed the lessons with the Eaton driving school and took her third road test in October with half the world praying for me
she took a double driving lesson the next day and parallel park 6 times after three more lessons she took her Fifth and final test on January 21st 1977 and passed she had spent $860 on 66 plus 5 road test and at the age of 71 she had her license good three years later he did for several months she was the one who drove to the hospital Supermarket Pharmacy in church when we were children
someone rafter do another instructed by the spider's persistence Robert brute Robert Bruce left the Hut gathered his men and defeated the dance my mother and body of the story but it was not just persistence that moved her but love for the man who was her other self do you want to know what love is it's 66 driving lessons and five road tests and a very tough lady
who won't give up because her love is that great thank you all for bringing this to this beloved in moments community
That's pretty good for FREE. Unfortunately the google-cloud API version is prohibitively expensive if I wanted to transcribe hours of content.