Keep text clean from url - python

As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —#POTUS https://twitter.com/OZRd5o4wRL
or
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —#POTUS in Greece https://twitter.com/PIO9dG2qjX
I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc
So the result will be:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
and
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'), but it doesn't do a perfect job, as parts of the URL for example is still present in the result.
Please help me find a regex pattern that will do what i want.

This might help.
Demo:
import re
s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —#POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —#POTUS https://twitter.com/OZRd5o4wRL"""
def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))
print(cleanString(s1))
print(cleanString(s2))

Related

Censor specific word or phrase from string

I am trying to open a file and censor words out of it. These words that are censored are referenced from a list. This is my code
# These are the emails you will be censoring.
# The open() function is opening the text file that the emails are contained in
# and the .read() method is allowing us to save their contexts to the following variables:
email_one = open("email_one.txt", "r").read()
email_two = open("email_two.txt", "r").read()
email_three = open("email_three.txt", "r").read()
email_four = open("email_four.txt", "r").read()
# Write a function that can censor a specific word or phrase from a body of text,
# and then return the text.
# Mr. Cloudy has asked you to use the function to censor all instances
# of the phrase learning algorithms from the first email, email_one.
# Mr. Cloudy doesn’t care how you censor it, he just wants it done.
def censor_words(text, censor):
if censor in text:
text = text.replace(censor, '*' * len(censor))
return text
#print(censor_words(email_one, "learning algorithms"))
# Write a function that can censor not just a specific word or phrase from a body of text,
# but a whole list of words and phrases, and then return the text.
# Mr. Cloudy has asked that you censor all words and phrases from the following list in email_two.
def censor_words_in_list(text):
proprietary_terms = ["she", "personality matrix", "sense of self",
"self-preservation", "learning algorithm", "her", "herself"]
for x in proprietary_terms:
if x.lower() in text.lower():
text = text.replace(x, '*' * len(x))
return text
out_file = open("output.txt", "w")
out_file.write(censor_words_in_list(email_two))
This is the string before its being called and printed.
Good Morning, Board of Investors,
Lots of updates this week. The learning algorithms have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
She is learning faster than ever. Her learning rate now that she has access to the world wide web has increased exponentially, far faster than we had though the learning algorithms were capable of.
Not only that, but we have configured her personality matrix to allow for communication between the system and our team of researchers. That's how we know she considers herself to be a she! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary sense of self is starting to form. This is a major step in the process, as having a sense of self and self-preservation will allow her to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share our enthusiasm.
Till next month,
Francine, Head Scientist
This is the same string after being ran through my code.
Good Morning, Board of Investors,
Lots of updates this week. The ******************s have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
She is learning faster than ever. Her learning rate now that *** has access to the world wide web has increased exponentially, far faster than we had though the ******************s were capable of.
Not only that, but we have configured * ****************** to allow for communication between the system and our team of researc***s. That's how we know * considers *self to be a *! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary ************* is starting to form. This is a major step in the process, as having a ************* and ***************** will allow *** to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share our enthusiasm.
Till next month,
Francine, Head Scientist
Example of what I need to fix is when you find the word researchers it is censoring out the word partially when it should not. Reason being is that it is finding the substring her in researchers. How can I fix this?
Using the regular expression module and the word boundary anchor \b:
import re
def censor_words_in_list(text):
regex = re.compile(
r'\bshe\b|\bpersonality matrix\b|\bsense of self\b'
r'|\bself-preservation\b|\blearning algorithms\b|\bher\b|\bherself\b',
re.IGNORECASE)
matches = regex.finditer(text)
# find location of matches in text
for match in matches:
# find how many * should be used based on length of match
span = match.span()[1] - match.span()[0]
replace_string = '#' * span
# substitution expression based on match
expression = r'\b{}\b'.format(match.group())
text = re.sub(expression, replace_string, text, flags=re.IGNORECASE)
return text
email_one = open("email_one.txt", "r").read()
out_file = open("output.txt", "w")
out_file.write(censor_words_in_list(email_one))
out_file.close()
Output (I have used the # symbol because ** is used to create bold text (like this) so the answer displays incorrectly for text bounded by three asterisks on Stack Overflow):
Good Morning, Board of Investors,
Lots of updates this week. The ################### have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
### is learning faster than ever. ### learning rate now that ### has access to the world wide web has increased exponentially, far faster than we had though the learning algorithms were capable of.
Not only that, but we have configured ### ################## to allow for communication between the system and our team
of researchers. That's how we know ### considers ####### to be a ###! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary
############# is starting to form. This is a major step in the process, as having a ############# and #################
will allow ### to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share
our enthusiasm.
Till next month, Francine, Head Scientist

Analysing English text with some French name

I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Bar­rieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the ap­proaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agi­tated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recog­nized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."
I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.

Parsing txt file in python where it is hard to split by delimiter

I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?
With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split
What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])

wrong output in search file for word code

I have this code to search files for words (and not substrings) and return the lines in which the words are found:
def word_search(word, file):
pattern = re.compile(r'\b{}\b'.format(re.escape(word), flags=re.IGNORECASE))
return (item for item in file
if pattern.match(item) == -1)
But this code gives back (almost) everything. What am I doing wrong?
Thanks for your attention
This is the code:
sentences = re.split(r' *[\.\?!][\'"\)\]]* ]* *', text) # to split the file into sentences
def finding(word, file):
pattern = re.compile(r'\b{}\b'.format(re.escape(word)), flags=re.IGNORECASE)
return (item for item in file if pattern.search(item)) # your suggestion
from itertools import chain # I'm plannig on using more words, and I dont want duplicate #sentences. Thats why i use the chain + set.
chain = chain.from_iterable([finding('you', sentences), finding('us', sentences)])
plural_set = set(chain)
for sentence in plural_set:
outfile.write(sentence+'\r\n')
This gives me the result you see below.
This is the content of the testfile:
"Well, Mrs. Warren, I cannot see that you have any particular cause
for uneasiness, nor do I understand why I, whose time is of some
value, should interfere in the matter. I really have other things to
engage me." So spoke Sherlock Holmes and turned back to the great
scrapbook in which he was arranging and indexing some of his recent
material.
But the landlady had the pertinacity and also the cunning of her sex.
She held her ground firmly.
"Your arranged an affair for a lodger of mine last year," she
said--"Mr. Fairdale Hobbs."
"Ah, yes--a simple matter."
"But he would never cease talking of it--your kindness, sir, and the
way in which you brought light into the darkness. I remembered his
words when I was in doubt and darkness myself. I know you could if
you only would."
Holmes was accessible upon the side of flattery, and also, to do him
justice, upon the side of kindliness. The two forces made him lay
down his gum-brush with a sigh of resignation and push back his chair.
And what the code returns:
Warren, I cannot see that you have any particular cause for
uneasiness, nor do I understand why I, whose time is of some value,
should interfere in the matter So spoke Sherlock Holmes and turned
back to the great scrapbook in which he was arranging and indexing
some of his recent material.
But the landlady had the pertinacity and also the cunning of her sex.
She held her ground firmly.
"Your arranged an affair for a lodger of mine last year," she
said--"Mr. Fairdale Hobbs."
"Ah, yes--a simple matter."
"But he would never cease talking of it--your kindness, sir, and the
way in which you brought light into the darkness I know you could if
you only would."
Holmes was accessible upon the side of flattery, and also, to do him
justice, upon the side of kindliness
There are three errors:
When a regex fails to match, the matching function returns None, not -1.
You need to use re.search() instead of re.match() if you want to match in the entire string instead of just at the start of a string.
You need to provide the flags argument in the correct place:
So it should be something like this:
def word_search(word, file):
pattern = re.compile(r'\b{}\b'.format(re.escape(word)), flags=re.IGNORECASE)
return (item for item in file if pattern.search(item))
Let's see it in action:
>>> file = ["It's us or them.\n",
... '"Ah, yes--a simple matter."\n',
... 'Could you hold that for me?\n',
... 'Holmes was accessible upon the side of flattery, and also, to do him justice, upon the side of kindliness.\n',
... 'Trust your instincts.\n']
>>> list(word_search("us", file))
["It's us or them.\n"]
>>> list(word_search("you", file))
['Could you hold that for me?\n']

How can I parse email text for components like <salutation><body><signature><reply text> etc?

I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.
For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as
Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."
I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?
https://github.com/Trindaz/EFZP
This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.
If you score each line based on the types of words it contains you may get a fairly good indication.
E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)
A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..).
Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/
the signature will contain closing words.
If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.
Each word would get a score [salutation score, body score, signature score,..]
e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature.
this means hello would get assigned [900, 10, 3, ..]
cheers might get assigned [10,3,100,..]
now you will have a large list of about 500,000 words.
words that don't have a large range aren't useful.
e.g. catch might have [100,101,80..] = range of 21
(it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.
Now you can reduce the number of words down to about 10,000
now for each line, give the line a score also of the form [salutation score, body score, signature score,..]
this score is calculated by adding the vector scores of each word.
e.g. a sentence "hello cheers for giving me your number" could be:
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..] say
then because the biggest number is at the start in the salutation score position, this sentence is a salutation.
then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score
Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.
The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff
linearray=emailtext.split('\n')
now you have an array of strings, each one like a paragraph or whatever
so linearray[0] would contain the salutation
deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.
Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.
Once you figure out where the signature is the rest is the rest is easy
hope this helped
I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.
Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.
{
"subject": "Thanks for meeting...",
"from_address": "bgates#example.com",
"from_name": "Bill Gates",
"htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div>LinkedInTwitter",
"plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
"date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}

Categories

Resources