Interview data in Python: Count a certain co-occurence separated by speaker

Interview data in Python: Count a certain co-occurence separated by speaker - python

This is a social scientist asking. I think I found the first perfect use case (for me) to use Python, but I'm not there yet. I would like to analyze hundreds of formalized interview transcripts, in fact, the basis are protocols of parliamentary debates. It could also be a random interview. I would like to count certain words -- the "cheers" every party gets, ordered by parties --, but as a newcomer to Py I'm struggling to find the correct formula to define the conditions. Simply put, I'm interested in the following research question: Who is cheering whom? There are a few "if's" that I would like to be taken care of to produce the desired outcome.
The raw data is complex, but well ordered.
Each discussion is ordered by speaker, who's formally introduced by: "The floor goes to... X". Then there's the speaker's text, there are also blank lines separating the text. When the speaker is finished, the next one is introduced with the same phrase. During the speech members of certain parties "cheer" the speaker, which is indicated in the data bei "(Cheering ... PARTY NAME)." These entries can go over 1 or 2 lines. And I would like to count these "cheers".
Here's an example of the raw data (simplified):
The floor goes to member of the parliament Muller by the
Left Party.
Bla bla
bla
(Cheers Left Party, Green Party, Blue
Party )
Bla
(Cheers Nazi Party)
Bye Bla
The floor goes to member of the parliament Tsing by the Green Party.
Bla bla
bla
(Cheers Left Party, Green Party)
How do I tell Python to loop through text that follows after a certain string in a certain line (the speaker) has occured? Plus, how do I count co-occurences in a line, thus skipping empty/irrelevant characters (counting the "cheers")? Using a basic word count code I was able to count all the cheers in a file, but that's just a simple start.
count = 0
fhand = open('3_parliament.txt')
for line in fhand:
line = line.rstrip()
if not line.startswith('(Cheers '): continue
words = line.split()
count = count + 1
print(words[1])
print("There were", count, "lines in the file with >(Cheers< as the first word")
I found additional example code where "paragraphs" were defined, but this does not work here because of the empty lines, and the complex nature of the text in general. textwrapping seemed to produce similar issues. I also find it challenging to count co-occurences in a line ("Cheers ... Party X". Do you have any tipps or hints?
I'm using the latest Python 3.8 on Win 10, via Atom.

Related

Want to extract text from a text or pdf file as different paragraphs

Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.

Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).

Matching company names in the news data using Python

I have news dataset which contains almost 10,000 news over the last 3 years.
I also have a list of companies (names of companies) which are registered in NYSE. Now I want to check whether list of company names in the list have appeared in the news dataset or not.
Example:
company Name: 'E.I. du Pont de Nemours and Company'
News: 'Monsanto and DuPont settle major disputes with broad patent-licensing deal, with DuPont agreeing to pay at least $1.75 billion over 10 years for rights to technology for herbicide-resistant soybeans.'
Now, I can find the news contains company name if the exact company name is in the news but you can see from the above example it is not the case.
I also tried another way i.e. I took the integral name in the company's full name i.e. in the above example 'Pont' is a word which should be definitely a part of the text when this company name is called. So it worked for majority of the times but then problem occurs in the following example:
Company Name: Ennis, Inc.
News: L D`ennis` Kozlowski, former chief executive convicted of looting nearly $100 million from Tyco International, has emerged into far more modest life after serving six-and-a-half year sentence and probation; Kozlowski, who became ultimate symbol of corporate greed in era that included scandals at Enron and WorldCom, describes his personal transformation and more humble pleasures that have replaced his once high-flying lifestyle.
Now you can see Ennis is matching with Dennis in the text so it giving irrelevant news results.
Can someone help in telling the right way of doing this ? Thanks.

Use a regex with boundaries for exact matches whether you choose the full name or some partial part you think is unique is up to you but using word boundaries D'ennis' won't match Ennis :
companies = ["name1", "name2",...]
companies_re = re.compile(r"|".join([r"\b{}\b".format(name) for name in companies]))
Depending on how many matches per news item, you may want to use companies_re.search(artice) or companies_re.find_all(article).
Also for case insensitive matches pass re.I to compile.
If the only line you want to check is also always the one starting with company company Name: you can narrow down the search:
for line in all_lines:
if line.startswith("company Name:"):
name = companies_re.search(line)
if name:
...
break

It sounds like you need the Aho-Corasick algorithm. There is a nice and fast implementation for python here: https://pypi.python.org/pypi/pyahocorasick/
It will only do exact matching, so you would need to index both "Du pont" and "Dupont", for example. But that's not too hard, you can use the Wikidata to help you find aliases: for example, look at the aliases of Dupont's entry: it includes both "Dupont" and "Du pont".
Ok so let's assume you have the list of company names with their aliases:
import ahocorasick
A = ahocorasick.Automaton()
companies = ["google", "apple", "tesla", "dupont", "du pont"]
for idx, key in enumerate(companies):
A.add_word(key, idx)
Next, make the automaton (see the link above for details on the algorithm):
A.make_automaton()
Great! Now you can simply search for all companies in some text:
your_text = """
I love my Apple iPhone. Do you know what a Googleplex is?
I ate some apples this morning.
"""
for end_index, idx in A.iter(your_text.lower()):
print(end_index, companies[idx])
This is the output:
15 apple
49 google
74 apple
The numbers correspond to the index of the last character of the company name in the text.
Easy, right? And super fast, this algorithm is used by some variants of GNU grep.
Saving/loading the automaton
If there are a lot of company names, creating the automaton may take some time, so you may want to create it just once, save it to disk (using pickle), then load it every time you need it:
# create_company_automaton.py
# ... create the automaton (see above)
import pickle
pickle.dump(A, open('company_automaton.pickle', 'wb'))
In the program that will use this automaton, you start by loading the automaton:
# use_company_automaton.py
import ahocorasick
import pickle
A = pickle.load(open("company_automaton.pickle", "rb"))
# ... use the automaton
Hope this helps! :)
Bonus details
If you want to match "Apple" in "Apple releases a new iPhone" but not in "I ate an apple this morning", you are going to have a hard time. But it is doable: for example, you could gather a set of articles containing the word "apple" and about the company, and a set of articles not about the company, then identify words (or n-grams) that are more likely when it's about the company (e.g. "iPhone"). Unfortunately you would need to do this for every company whose name is ambiguous.

You can try
difflib.get_close_matches
with the full company name.

Parsing txt file in python where it is hard to split by delimiter

I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?

With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split

What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])

wrong output in search file for word code

I have this code to search files for words (and not substrings) and return the lines in which the words are found:
def word_search(word, file):
pattern = re.compile(r'\b{}\b'.format(re.escape(word), flags=re.IGNORECASE))
return (item for item in file
if pattern.match(item) == -1)
But this code gives back (almost) everything. What am I doing wrong?
Thanks for your attention
This is the code:
sentences = re.split(r' *[\.\?!][\'"\)\]]* ]* *', text) # to split the file into sentences
def finding(word, file):
pattern = re.compile(r'\b{}\b'.format(re.escape(word)), flags=re.IGNORECASE)
return (item for item in file if pattern.search(item)) # your suggestion
from itertools import chain # I'm plannig on using more words, and I dont want duplicate #sentences. Thats why i use the chain + set.
chain = chain.from_iterable([finding('you', sentences), finding('us', sentences)])
plural_set = set(chain)
for sentence in plural_set:
outfile.write(sentence+'\r\n')
This gives me the result you see below.
This is the content of the testfile:
"Well, Mrs. Warren, I cannot see that you have any particular cause
for uneasiness, nor do I understand why I, whose time is of some
value, should interfere in the matter. I really have other things to
engage me." So spoke Sherlock Holmes and turned back to the great
scrapbook in which he was arranging and indexing some of his recent
material.
But the landlady had the pertinacity and also the cunning of her sex.
She held her ground firmly.
"Your arranged an affair for a lodger of mine last year," she
said--"Mr. Fairdale Hobbs."
"Ah, yes--a simple matter."
"But he would never cease talking of it--your kindness, sir, and the
way in which you brought light into the darkness. I remembered his
words when I was in doubt and darkness myself. I know you could if
you only would."
Holmes was accessible upon the side of flattery, and also, to do him
justice, upon the side of kindliness. The two forces made him lay
down his gum-brush with a sigh of resignation and push back his chair.
And what the code returns:
Warren, I cannot see that you have any particular cause for
uneasiness, nor do I understand why I, whose time is of some value,
should interfere in the matter So spoke Sherlock Holmes and turned
back to the great scrapbook in which he was arranging and indexing
some of his recent material.
But the landlady had the pertinacity and also the cunning of her sex.
She held her ground firmly.
"Your arranged an affair for a lodger of mine last year," she
said--"Mr. Fairdale Hobbs."
"Ah, yes--a simple matter."
"But he would never cease talking of it--your kindness, sir, and the
way in which you brought light into the darkness I know you could if
you only would."
Holmes was accessible upon the side of flattery, and also, to do him
justice, upon the side of kindliness

There are three errors:
When a regex fails to match, the matching function returns None, not -1.
You need to use re.search() instead of re.match() if you want to match in the entire string instead of just at the start of a string.
You need to provide the flags argument in the correct place:
So it should be something like this:
def word_search(word, file):
pattern = re.compile(r'\b{}\b'.format(re.escape(word)), flags=re.IGNORECASE)
return (item for item in file if pattern.search(item))
Let's see it in action:
>>> file = ["It's us or them.\n",
... '"Ah, yes--a simple matter."\n',
... 'Could you hold that for me?\n',
... 'Holmes was accessible upon the side of flattery, and also, to do him justice, upon the side of kindliness.\n',
... 'Trust your instincts.\n']
>>> list(word_search("us", file))
["It's us or them.\n"]
>>> list(word_search("you", file))
['Could you hold that for me?\n']

How can I parse email text for components like <salutation><body><signature><reply text> etc?

I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.
For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as
Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."
I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?

https://github.com/Trindaz/EFZP
This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.

If you score each line based on the types of words it contains you may get a fairly good indication.
E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)
A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..).
Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/
the signature will contain closing words.
If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.
Each word would get a score [salutation score, body score, signature score,..]
e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature.
this means hello would get assigned [900, 10, 3, ..]
cheers might get assigned [10,3,100,..]
now you will have a large list of about 500,000 words.
words that don't have a large range aren't useful.
e.g. catch might have [100,101,80..] = range of 21
(it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.
Now you can reduce the number of words down to about 10,000
now for each line, give the line a score also of the form [salutation score, body score, signature score,..]
this score is calculated by adding the vector scores of each word.
e.g. a sentence "hello cheers for giving me your number" could be:
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..] say
then because the biggest number is at the start in the salutation score position, this sentence is a salutation.
then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score
Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.

The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff
linearray=emailtext.split('\n')
now you have an array of strings, each one like a paragraph or whatever
so linearray[0] would contain the salutation
deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.
Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.
Once you figure out where the signature is the rest is the rest is easy
hope this helped

I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.
Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.
{
"subject": "Thanks for meeting...",
"from_address": "bgates#example.com",
"from_name": "Bill Gates",
"htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div>LinkedInTwitter",
"plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
"date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.