So I've been working on a reddit bot and have run into one issue. Lets say for example I trying to find all comments with the word "man" in it, the bot will find comments with this word but will also find comments with man in a word. For example "woman" . I only want it to find the exact word man.
I am using the package "praw" . I know this is going to be a really easy fix but for some reason I cant figure it out. This is the code to find the word in a comment.
if "man" in comment.body:
if you need to see more of the code just let me know. Any help would be great. Also I am using Python to make this bot.
The word "man" appears exactly in the word "woman". You need to be more specific in your search, and that depends on the content of the comments. Perhaps you can search for " man" (but that would still count cases like " manage"), so perhaps you can do a regex search for whitespace followed by the string "man" followed by either whitespace or punctuation?
Related
So I'm working on a project where I need to manually filter a HTML of social media comment threads with split and replace and re.sub and that stuff, I wouldn't get the information required otherwise (BeautifulSoup filters out important information too). In the end, I'm left with something like this:
Best of luck to you now that there's some real competition \xf0\x9f\x98\x8f
Thanks \xf0\x9f\x98\x82
I searched for any way to get rid of these or replace them with actual emojis, but I found nothing. I did find commands that filter out emojis when they look like this U+1F600 or like this :cowboy hat face: or like this \U0001F606, and I did find someone who filtered things like this \xe2\x80\x99, but he only did it for semicolons and quotation marks, not emojis. I also couldn't find a way to use encode and decode for this.
Short: I want "Thanks \xf0\x9f\x98\x82" to become "Thanks".
So I'm new to working with websites and maybe the answer is quite simple, but as I said, I found nothing on this on the internet. Any help is very appreciated!
if you only want ascii characters in your text , you can enode and decode the text with ascii
text = """Best of luck to you now that there's some real competition \xf0\x9f\x98\x8f
Thanks \xf0\x9f\x98\x82"""
text = text.encode('ascii', 'ignore').decode()
>>> text
Best of luck to you now that there's some real competition
Thanks
I have a python project in mind but i'm not too sure where to start.
I want to do some text comparison between two blocks of text, I want a user to be able to input two blocks of text and the program to identify the parts that are different/not the same.
I've seen this functionality in Git - when you make a change in a repo, it shows you the changes before you commit - this makes me think that I should be able to make something with similar functionality.
Any kinda' insight would be greatly appreciated!!
EDIT:
While searching I came across this Git repo online, it's all i'm looking for! A simple GUI interface where a user can load two different files and see the similarities or differences between them!
For others looking for something similar: https://github.com/yebrahim/pydiff
From my point of view, you can take user input and store it in two strings say str1 and str2 then you can make use of split( ) method or rather word_tokenize( )(Natural Language Processing) to get all the words in the String
If you want you can also remove stopwords Here for better comparison
Now you can run a loop comparing each word and for clear perception, you can underline the words or a particular part of a word that doesn't match
I'm working with a corpus I've scraped from Twitter activist communities in order to study the modern era of community organizing. I'm trying to run these data through re.findall in order to identify the tweets focused on location. I think that using the keyword "at" may be the easiest way to accomplish this.
Basically, if the entire tweet is (for example) "all who wish 2 join, meet at city hall 3pm", my code should print out something like "meet at city hall" for that line. Is this possible, or am I fundamentally misunderstanding the utility of regex? I've only ever really used them for extracting email information previously, so I'm used to writing code like this:
match = re.findall(r'[\w\.-]+#[\w\.-]+', line)
However, attempting to exchange the '#' in the code above for an 'at' doesn't yield any results.
I'm probably not even asking the right question here. Apologies for any confusion I cause and I appreciate any and all help!
If I understand correctly, you are just trying to match a sentence with the word "at" or "#"?
This is the regex I came up with:
r'[\w\s]+(at|#)[\w\s]+\.?'
This will match any words before and after an "at" or "#".
For future reference: next time you are creating a regex, use https://regex101.com/. I find it helps a ton.
I am very new to Python, and I am currently working on a project. This project would be to create (among other things) a program to correct a text. I am having difficulty combining two separate ideas and parts of code together. First of all, I have been experimenting with a code to correct a word that is inputted by a user.
The code can be found here.
So far, I am using this exact code without any modifications.
My goal is to be able to read a text file and go through it and find and propose corrections for the words which are wrong, as this spellchecker code does.
I would use something like:
with open('words.txt','r') as f:
for line in f:
for word in line.split()
to go through the text file and split it into individual words.
Ideally, if my text said
"Wgat is the definiton" I would want to be able to recognize wgat and correct it to what, and recognize definiton and correct to definition.
How do I combine these two ideas? Thanks
Maybe you should look at this:
https://norvig.com/spell-correct.html
It uses probability to give the best answer without being connected to a database.
Else, you can use urllib to connect to the english dictionary website: http://www.mieliestronk.com/corncob_lowercase.txt
Then find the word that is most closely related to the one inputted and then print the word in this list.
Hope it helps!
I'm building a website in django that needs to extract key words from short (twitter-like) messages.
I've looked at packages like topia.textextract and nltk - but both seem to be overkill for what I need to do. All I need to do is filter words like "and", "or", "not" while keeping nouns and verbs that aren't conjunctives or other parts of speech. Are there any "simpler" packages out there that can do this?
EDIT: This needs to be done in near real-time on a production website, so using a keyword extraction service seems out of the question, based on their response times and request throttling.
You can make a set sw of the "stop words" you want to eliminate (maybe copy it once and for all from the stop words corpus of NLTK, depending how familiar you are with the various natural languages you need to support), then apply it very simply.
E.g., if you have a list of words sent that make up the sentence (shorn of punctuation and lowercased, for simplicity), [word for word in sent if word not in sw] is all you need to make a list of non-stopwords -- could hardly be easier, right?
To get the sent list in the first place, using the re module from the standard library, re.findall(r'\w+', sentstring) might suffice if sentstring is the string with the sentence you're dealing with -- it doesn't lowercase, but you can change the list comprehension I suggest above to [word for word in sent if word.lower() not in sw] to compensate for that and (btw) keep the word's original case, which may be useful.
Abbreviations like NO for navigation officer or OR for operations room need a little care lest you cause a SNAFU ;-) One suspects that better results could be obtained from "Find the NO and send her to the OR" by tagging the words with parts of speech using the context ... hint 1: "the OR" should result in "the [noun]" not "the [conjunction]". Hint 2: if in doubt about a word, keep it as a keyword.