wrong output in search file for word code - python

I have this code to search files for words (and not substrings) and return the lines in which the words are found:
def word_search(word, file):
pattern = re.compile(r'\b{}\b'.format(re.escape(word), flags=re.IGNORECASE))
return (item for item in file
if pattern.match(item) == -1)
But this code gives back (almost) everything. What am I doing wrong?
Thanks for your attention
This is the code:
sentences = re.split(r' *[\.\?!][\'"\)\]]* ]* *', text) # to split the file into sentences
def finding(word, file):
pattern = re.compile(r'\b{}\b'.format(re.escape(word)), flags=re.IGNORECASE)
return (item for item in file if pattern.search(item)) # your suggestion
from itertools import chain # I'm plannig on using more words, and I dont want duplicate #sentences. Thats why i use the chain + set.
chain = chain.from_iterable([finding('you', sentences), finding('us', sentences)])
plural_set = set(chain)
for sentence in plural_set:
outfile.write(sentence+'\r\n')
This gives me the result you see below.
This is the content of the testfile:
"Well, Mrs. Warren, I cannot see that you have any particular cause
for uneasiness, nor do I understand why I, whose time is of some
value, should interfere in the matter. I really have other things to
engage me." So spoke Sherlock Holmes and turned back to the great
scrapbook in which he was arranging and indexing some of his recent
material.
But the landlady had the pertinacity and also the cunning of her sex.
She held her ground firmly.
"Your arranged an affair for a lodger of mine last year," she
said--"Mr. Fairdale Hobbs."
"Ah, yes--a simple matter."
"But he would never cease talking of it--your kindness, sir, and the
way in which you brought light into the darkness. I remembered his
words when I was in doubt and darkness myself. I know you could if
you only would."
Holmes was accessible upon the side of flattery, and also, to do him
justice, upon the side of kindliness. The two forces made him lay
down his gum-brush with a sigh of resignation and push back his chair.
And what the code returns:
Warren, I cannot see that you have any particular cause for
uneasiness, nor do I understand why I, whose time is of some value,
should interfere in the matter So spoke Sherlock Holmes and turned
back to the great scrapbook in which he was arranging and indexing
some of his recent material.
But the landlady had the pertinacity and also the cunning of her sex.
She held her ground firmly.
"Your arranged an affair for a lodger of mine last year," she
said--"Mr. Fairdale Hobbs."
"Ah, yes--a simple matter."
"But he would never cease talking of it--your kindness, sir, and the
way in which you brought light into the darkness I know you could if
you only would."
Holmes was accessible upon the side of flattery, and also, to do him
justice, upon the side of kindliness

There are three errors:
When a regex fails to match, the matching function returns None, not -1.
You need to use re.search() instead of re.match() if you want to match in the entire string instead of just at the start of a string.
You need to provide the flags argument in the correct place:
So it should be something like this:
def word_search(word, file):
pattern = re.compile(r'\b{}\b'.format(re.escape(word)), flags=re.IGNORECASE)
return (item for item in file if pattern.search(item))
Let's see it in action:
>>> file = ["It's us or them.\n",
... '"Ah, yes--a simple matter."\n',
... 'Could you hold that for me?\n',
... 'Holmes was accessible upon the side of flattery, and also, to do him justice, upon the side of kindliness.\n',
... 'Trust your instincts.\n']
>>> list(word_search("us", file))
["It's us or them.\n"]
>>> list(word_search("you", file))
['Could you hold that for me?\n']

Related

Interview data in Python: Count a certain co-occurence separated by speaker

This is a social scientist asking. I think I found the first perfect use case (for me) to use Python, but I'm not there yet. I would like to analyze hundreds of formalized interview transcripts, in fact, the basis are protocols of parliamentary debates. It could also be a random interview. I would like to count certain words -- the "cheers" every party gets, ordered by parties --, but as a newcomer to Py I'm struggling to find the correct formula to define the conditions. Simply put, I'm interested in the following research question: Who is cheering whom? There are a few "if's" that I would like to be taken care of to produce the desired outcome.
The raw data is complex, but well ordered.
Each discussion is ordered by speaker, who's formally introduced by: "The floor goes to... X". Then there's the speaker's text, there are also blank lines separating the text. When the speaker is finished, the next one is introduced with the same phrase. During the speech members of certain parties "cheer" the speaker, which is indicated in the data bei "(Cheering ... PARTY NAME)." These entries can go over 1 or 2 lines. And I would like to count these "cheers".
Here's an example of the raw data (simplified):
The floor goes to member of the parliament Muller by the
Left Party.
Bla bla
bla
(Cheers Left Party, Green Party, Blue
Party )
Bla
(Cheers Nazi Party)
Bye Bla
The floor goes to member of the parliament Tsing by the Green Party.
Bla bla
bla
(Cheers Left Party, Green Party)
How do I tell Python to loop through text that follows after a certain string in a certain line (the speaker) has occured? Plus, how do I count co-occurences in a line, thus skipping empty/irrelevant characters (counting the "cheers")? Using a basic word count code I was able to count all the cheers in a file, but that's just a simple start.
count = 0
fhand = open('3_parliament.txt')
for line in fhand:
line = line.rstrip()
if not line.startswith('(Cheers '): continue
words = line.split()
count = count + 1
print(words[1])
print("There were", count, "lines in the file with >(Cheers< as the first word")
I found additional example code where "paragraphs" were defined, but this does not work here because of the empty lines, and the complex nature of the text in general. textwrapping seemed to produce similar issues. I also find it challenging to count co-occurences in a line ("Cheers ... Party X". Do you have any tipps or hints?
I'm using the latest Python 3.8 on Win 10, via Atom.

Analysing English text with some French name

I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Bar­rieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the ap­proaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agi­tated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recog­nized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."
I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.

Keep text clean from url

As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —#POTUS https://twitter.com/OZRd5o4wRL
or
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —#POTUS in Greece https://twitter.com/PIO9dG2qjX
I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc
So the result will be:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
and
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'), but it doesn't do a perfect job, as parts of the URL for example is still present in the result.
Please help me find a regex pattern that will do what i want.
This might help.
Demo:
import re
s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —#POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —#POTUS https://twitter.com/OZRd5o4wRL"""
def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))
print(cleanString(s1))
print(cleanString(s2))

Parsing txt file in python where it is hard to split by delimiter

I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?
With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split
What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])

How can I parse email text for components like <salutation><body><signature><reply text> etc?

I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.
For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as
Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."
I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?
https://github.com/Trindaz/EFZP
This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.
If you score each line based on the types of words it contains you may get a fairly good indication.
E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)
A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..).
Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/
the signature will contain closing words.
If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.
Each word would get a score [salutation score, body score, signature score,..]
e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature.
this means hello would get assigned [900, 10, 3, ..]
cheers might get assigned [10,3,100,..]
now you will have a large list of about 500,000 words.
words that don't have a large range aren't useful.
e.g. catch might have [100,101,80..] = range of 21
(it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.
Now you can reduce the number of words down to about 10,000
now for each line, give the line a score also of the form [salutation score, body score, signature score,..]
this score is calculated by adding the vector scores of each word.
e.g. a sentence "hello cheers for giving me your number" could be:
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..] say
then because the biggest number is at the start in the salutation score position, this sentence is a salutation.
then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score
Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.
The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff
linearray=emailtext.split('\n')
now you have an array of strings, each one like a paragraph or whatever
so linearray[0] would contain the salutation
deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.
Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.
Once you figure out where the signature is the rest is the rest is easy
hope this helped
I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.
Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.
{
"subject": "Thanks for meeting...",
"from_address": "bgates#example.com",
"from_name": "Bill Gates",
"htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div>LinkedInTwitter",
"plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
"date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}

Categories

Resources