Extract data from within parenthesis in python

Extract data from within parenthesis in python - python

I know there are many questions with the same title. My situation is a little different. I have a string like:
"Cat(Money(8)Points(80)Friends(Online(0)Offline(8)Total(8)))Mouse(Money(10)Points(10000)Friends(Online(10)Offline(80)Total(90)))"
(Notice that there are parenthesis nested inside another)
and I need to parse it into nested dictionaries like for example:
d["Cat"]["Money"] == 8
d["Cat"]["Points"] = 80
d["Mouse"]["Friends"]["Online"] == 10
and so on. I would like to do this without libraries and regex. If you choose to use these, please explain the code in great detail.
Thanks in advance!
Edit:
Although this code will not make any sense, this is what I have so far:
o_str = "Jake(Money(8)Points(80)Friends(Online(0)Offline(8)Total(8)))Mouse(Money(10)Points(10000)Friends(Online(10)Offline(80)Total(90)))"
spl = o_str.split("(")
def reverseIndex(str1, str2):
try:
return len(str1) - str1.rindex(str2)
except Exception:
return len(str1)
def app(arr,end):
new_arr = []
for i in range(0,len(arr)):
if i < len(arr)-1:
new_arr.append(arr[i]+end)
else:
new_arr.append(arr[i])
return new_arr
spl = app(spl,"(")
ends = []
end_words = []
op = 0
cl = 0
for i in range(0,len(spl)):
print i
cl += spl[i].count(")")
op += 1
if cl == op-1:
ends.append(i)
end_words.append(spl[i])
#break
print op
print cl
print
print end_words
The end words are the sections at the beginning of each statement. I plan on using recursive to do the rest.

Now that was interesting. You really nerd-sniped me on this one...
def parse(tokens):
""" take iterator of tokens, parse to dictionary or atom """
dictionary = {}
# iterate tokens...
for token in tokens:
if token == ")" or next(tokens) == ")":
# token is ')' -> end of dict; next is ')' -> 'leaf'
break
# add sub-parse to dictionary
dictionary[token] = parse(tokens)
# return dict, if non-empty, else token
return dictionary or int(token)
Setup and demo:
>>> s = "Cat(Money(8)Points(80)Friends(Online(0)Offline(8)Total(8)))Mouse(Money(10)Points(10000)Friends(Online(10)Offline(80)Total(90)))"
>>> tokens = iter(s.replace("(", " ( ").replace(")", " ) ").split())
>>> pprint(parse(tokens))
{'Cat': {'Friends': {'Offline': 8, 'Online': 0, 'Total': 8},
'Money': 8,
'Points': 80},
'Mouse': {'Friends': {'Offline': 80, 'Online': 10, 'Total': 90},
'Money': 10,
'Points': 10000}}
Alternatively, you could also use a series of string replacements to turn that string into an actual Python dictionary string and then evaluate that, e.g. like this:
as_dict = eval("{'" + s.replace(")", "'}, ")
.replace("(", "': {'")
.replace(", ", ", '")
.replace(", ''", "")[:-3] + "}")
This will wrap the 'leafs' in singleton sets of strings, e.g. {'8'} instead of 8, but this should be easy to fix in a post-processing step.

Related

Generate strings using translations of several characters, mapped to several others

I'm facing quite a tricky problem in my python code. I looked around and was not able to find anyone with a similar problem.
I'd like to generate strings translating some characters into several, different ones.
I'd like that original characters, meant to be replaced (translated), to be replaced by several different ones.
What I'm looking to do is something like this :
text = "hi there"
translations = {"i":["b", "c"], "r":["e","f"]}
result = magicfunctionHere(text,translations)
print(result)
> [
"hb there",
"hc there",
"hi theee",
"hi thefe",
"hb theee",
"hb thefe",
"hc theee",
"hc thefe"
]
The result contains any combination of the original text with 'i' and 'r' replaced respectively by 'b' and 'c', and 'e' and 'f'.
I don't see how to do that, using itertools and functions like permutations, product etc...
I hope I'm clear enough, it is quite a specific problem !
Thank you for your help !

def magicfunction(ret, text, alphabet_location, translations):
if len(alphabet_location) == 0:
ret.append(text)
return ret
index = alphabet_location.pop()
for w in translations[text[index]]:
ret = magicfunction(ret, text[:index] + w + text[index + 1:], alphabet_location, translations)
alphabet_location.append(index)
return ret
def magicfunctionHere(text, translations):
alphabet_location = []
for key in translations.keys():
alphabet_location.append(text.find(key))
translations[key].append(key)
ret = []
ret = magicfunction(ret, text, alphabet_location, translations)
ret.pop()
return ret
text = "hi there"
translations = {"i":["b", "c"], "r":["e","f"]}
result = magicfunctionHere(text,translations)
print(result)

One crude way to go would be to use a Nested Loop Constructin 2 steps (Functions) as depicted in the Snippet below:
def rearrange_characters(str_text, dict_translations):
tmp_result = []
for key, value in dict_translations.items():
if key in str_text:
for replacer in value:
str_temp = str_text.replace(key, replacer, 1)
if str_temp not in tmp_result:
tmp_result.append(str_temp)
return tmp_result
def get_rearranged_characters(str_text, dict_translations):
lst_result = rearrange_characters(str_text, dict_translations)
str_joined = ','.join(lst_result)
for str_part in lst_result:
str_joined = "{},{}".format(str_joined, ','.join(rearrange_characters(str_part, dict_translations)))
return set(str_joined.split(sep=","))
text = "hi there"
translations = {"i": ["b", "c"], "r":["e","f"]}
result = get_rearranged_characters(text, translations)
print(result)
## YIELDS: {
'hb theee',
'hc thefe',
'hc there',
'hi thefe',
'hb thefe',
'hi theee',
'hc theee',
'hb there'
}
See also: https://eval.in/960803
Another equally convoluted approach would be to use a single function with nested loops like so:
def process_char_replacement(str_text, dict_translations):
tmp_result = []
for key, value in dict_translations.items():
if key in str_text:
for replacer in value:
str_temp = str_text.replace(key, replacer, 1)
if str_temp not in tmp_result:
tmp_result.append(str_temp)
str_joined = ','.join(tmp_result)
for str_part in tmp_result:
tmp_result_2 = []
for key, value in dict_translations.items():
if key in str_part:
for replacer in value:
str_temp = str_part.replace(key, replacer, 1)
if str_temp not in tmp_result_2:
tmp_result_2.append(str_temp)
str_joined = "{},{}".format(str_joined, ','.join(tmp_result_2))
return set(str_joined.split(sep=","))
text = "hi there"
translations = {"i": ["b", "c"], "r":["e","f"]}
result = process_char_replacement(text, translations)
print(result)
## YIELDS: {
'hb theee',
'hc thefe',
'hc there',
'hi thefe',
'hb thefe',
'hi theee',
'hc theee',
'hb there'
}
Refer to: https://eval.in/961602

Python how convert single quotes to double quotes to format as json string

I have a file where on each line I have text like this (representing cast of a film):
[{'cast_id': 23, 'character': "Roger 'Verbal' Kint", 'credit_id': '52fe4260c3a36847f8019af7', 'gender': 2, 'id': 1979, 'name': 'Kevin Spacey', 'order': 5, 'profile_path': '/x7wF050iuCASefLLG75s2uDPFUu.jpg'}, {'cast_id': 27, 'character': 'Edie's Finneran', 'credit_id': '52fe4260c3a36847f8019b07', 'gender': 1, 'id': 2179, 'name': 'Suzy Amis', 'order': 6, 'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]
I need to convert it in a valid json string, thus converting only the necessary single quotes to double quotes (e.g. the single quotes around word Verbal must not be converted, eventual apostrophes in the text also should not be converted).
I am using python 3.x. I need to find a regular expression which will convert only the right single quotes to double quotes, thus the whole text resulting in a valid json string. Any idea?

First of all, the line you gave as example is not parsable! … 'Edie's Finneran' … contains a syntax error, not matter what.
Assuming that you have control over the input, you could simply use eval() to read in the file. (Although, in that case one would wonder why you can't produce valid JSON in the first place…)
>>> f = open('list.txt', 'r')
>>> s = f.read().strip()
>>> l = eval(s)
>>> import pprint
>>> pprint.pprint(l)
[{'cast_id': 23,
'character': "Roger 'Verbal' Kint",
...
'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]
>>> import json
>>> json.dumps(l)
'[{"cast_id": 23, "character": "Roger \'Verbal\' Kint", "credit_id": "52fe4260ca36847f8019af7", "gender": 2, "id": 1979, "name": "Kevin Spacey", "order": 5, "rofile_path": "/x7wF050iuCASefLLG75s2uDPFUu.jpg"}, {"cast_id": 27, "character":"Edie\'s Finneran", "credit_id": "52fe4260c3a36847f8019b07", "gender": 1, "id":2179, "name": "Suzy Amis", "order": 6, "profile_path": "/b1pjkncyLuBtMUmqD1MztDSG80.jpg"}]'
If you don't have control over the input, this is very dangerous, as it opens you up to code injection attacks.
I cannot emphasize enough that the best solution would be to produce valid JSON in the first place.

If you do not have control over the JSON data, do not eval() it!
I created a simple JSON correction mechanism, as that is more secure:
def correctSingleQuoteJSON(s):
rstr = ""
escaped = False
for c in s:
if c == "'" and not escaped:
c = '"' # replace single with double quote
elif c == "'" and escaped:
rstr = rstr[:-1] # remove escape character before single quotes
elif c == '"':
c = '\\' + c # escape existing double quotes
escaped = (c == "\\") # check for an escape character
rstr += c # append the correct json
return rstr
You can use the function in the following way:
import json
singleQuoteJson = "[{'cast_id': 23, 'character': 'Roger \\'Verbal\\' Kint', 'credit_id': '52fe4260c3a36847f8019af7', 'gender': 2, 'id': 1979, 'name': 'Kevin Spacey', 'order': 5, 'profile_path': '/x7wF050iuCASefLLG75s2uDPFUu.jpg'}, {'cast_id': 27, 'character': 'Edie\\'s Finneran', 'credit_id': '52fe4260c3a36847f8019b07', 'gender': 1, 'id': 2179, 'name': 'Suzy Amis', 'order': 6, 'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]"
correctJson = correctSingleQuoteJSON(singleQuoteJson)
print(json.loads(correctJson))

Here is the code to get desired output
import ast
def getJson(filepath):
fr = open(filepath, 'r')
lines = []
for line in fr.readlines():
line_split = line.split(",")
set_line_split = []
for i in line_split:
i_split = i.split(":")
i_set_split = []
for split_i in i_split:
set_split_i = ""
rev = ""
i = 0
for ch in split_i:
if ch in ['\"','\'']:
set_split_i += ch
i += 1
break
else:
set_split_i += ch
i += 1
i_rev = (split_i[i:])[::-1]
state = False
for ch in i_rev:
if ch in ['\"','\''] and state == False:
rev += ch
state = True
elif ch in ['\"','\''] and state == True:
rev += ch+"\\"
else:
rev += ch
i_rev = rev[::-1]
set_split_i += i_rev
i_set_split.append(set_split_i)
set_line_split.append(":".join(i_set_split))
line_modified = ",".join(set_line_split)
lines.append(ast.literal_eval(str(line_modified)))
return lines
lines = getJson('test.txt')
for i in lines:
print(i)

Apart from eval() (mentioned in user3850's answer), you can use ast.literal_eval
This has been discussed in the thread: Using python's eval() vs. ast.literal_eval()?
You can also look at the following discussion threads from Kaggle competition which has data similar to the one mentioned by OP:
https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/89313#latest-517927
https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/80045#latest-518338

Preferential parsing of sentence elements using infixnotation in Pyparsing

I have some sentences that I need to convert to regex code and I was trying to use Pyparsing for it. The sentences are basically search rules, telling us what to search for.
Examples of sentences -
LINE_CONTAINS this is a phrase
-this is an example search rule telling that the line you are searching on should have the phrase this is a phrase
LINE_STARTSWITH However we - this is an example search rule telling that the line you are searching on should start with the phrase However we
The rules can be combined too, like- LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we
Now, I am trying to parse these sentences and then convert them to regex code. All lines start with either of the 2 symbols mentioned above (call them line_directives). I want to be able to consider these line_directives, and parse them appropriately and do the same for the phrase that follow them, albeit differently parsed. Using help from Paul McGuire(here)and my own inputs, I have the following code-
from pyparsing import *
import re
UPTO, AND, OR, WORDS = map(Literal, "upto AND OR words".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS LINE_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
word = ~keyword + Word(alphas)
phrase = Group(OneOrMore(word))
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
print (self.tokens[0], 20)
for el in self.tokens[0]:
print (el,type(el), 19)
print (type(self.tokens[0]), 18)
return "(%s)" %(' '.join(self.tokens[0])) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
def __repr__(self):
return repr(self.tokens[0])
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
return '.*'.join(t.generate() for t in tokens[::2]) # change this to the correct form of AND in regex
def __repr__(self):
return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
return '|'.join(t.generate() for t in tokens[::2])
def __repr__(self):
return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])
class UpToNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = tokens[0].generate()
print (123123)
word_re = r"\s+\S+"
space_re = r"\s+"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
print ret
return ret
def __repr__(self):
tokens = self.tokens[0]
ret = repr(tokens[0])
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
return ret
phrase_expr = infixNotation(phrase,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") +
Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
[(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
]
) # grammar for the entire rule/sentence
phrase_expr = infixNotation(line_contents_expr.setParseAction(LiteralNode),
[
(upto_expr, 2, opAssoc.LEFT, UpToNode),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
])
tests1 = """LINE_CONTAINS overexpressing gene AND other things""".splitlines()
for t in tests1:
t = t.strip()
if not t:
continue
# print(t, 12)
try:
parsed = phrase_expr.parseString(t)
except ParseException as pe:
print(' '*pe.loc + '^')
print(pe)
continue
print (parsed[0], 14)
print (type(parsed[0]))
print(parsed[0].generate(), 15)
This simple code, on running gives the following error-
((['LINE_CONTAINS', ([(['overexpressing', 'gene'], {})], {})],
{'phrase': [(([(['overexpressing', 'gene'], {})], {}), 1)],
'line_directive': [('LINE_CONTAINS', 0)]}), 14)
((['LINE_CONTAINS', ([(['overexpressing', 'gene'], {})], {})],
{'phrase': [(([(['overexpressing', 'gene'], {})], {}), 1)],
'line_directive': [('LINE_CONTAINS', 0)]}), 20)
('LINE_CONTAINS', <, 19)
(([(['overexpressing', 'gene'], {})], {}), , 19)
(, 18)
TypeError: sequence item 1: expected string, ParseResults found (line
29)
(The error code is not completely correct, as angular brackets are not well supported in blockquote here)
So my question is, even though I have written the grammar (using infixnotation) such that it treats LINE_CONTAINS as a line_directive and parse the remaining the line accordingly, why is it not able to parse properly? What is a good way to parse such lines?

How to replace text in curly brackets with another text based on comparisons using Python Regex

I am quiet new to regular expressions. I have a string that looks like this:
str = "abc/def/([default], [testing])"
and a dictionary
dict = {'abc/def/[default]' : '2.7', 'abc/def/[testing]' : '2.1'}
and using Python RE, I want str in this form, after comparisons of each element in dict to str:
str = "abc/def/(2.7, 2.1)"
Any help how to do it using Python RE?
P.S. its not the part of any assignment, instead it is the part of my project at work and I have spent many hours to figure out solution but in vain.

import re
st = "abc/def/([default], [testing], [something])"
dic = {'abc/def/[default]' : '2.7',
'abc/def/[testing]' : '2.1',
'bcd/xed/[something]' : '3.1'}
prefix_regex = "^[\w*/]*"
tag_regex = "\[\w*\]"
prefix = re.findall(prefix_regex, st)[0]
tags = re.findall(tag_regex, st)
for key in dic:
key_prefix = re.findall(prefix_regex, key)[0]
key_tag = re.findall(tag_regex, key)[0]
if prefix == key_prefix:
for tag in tags:
if tag == key_tag:
st = st.replace(tag, dic[key])
print st
OUTPUT:
abc/def/(2.7, 2.1, [something])

Here is a solution using re module.
Hypotheses :
there is a dictionary whose keys are composed of a prefix and a variable part, the variable part is enclosed in brackets ([])
the values are strings by which the variable parts are to be replaced in the string
the string is composed by a prefix, a (, a list of variable parts and a )
the variable parts in the string are enclosed in []
the variable parts in the string are separated by a comma followed by optional spaces
Python code :
import re
class splitter:
pref = re.compile("[^(]+")
iden = re.compile("\[[^]]*\]")
def __init__(self, d):
self.d = d
def split(self, s):
m = self.pref.match(s)
if m is not None:
p = m.group(0)
elts = self.iden.findall(s, m.span()[1])
return p, elts
return None
def convert(self, s):
p, elts = self.split(s)
return p + "(" + ", ".join((self.d[p + elt] for elt in elts)) + ")"
Usage :
s = "abc/def/([default], [testing])"
d = {'abc/def/[default]' : '2.7', 'abc/def/[testing]' : '2.1'}
sp = splitter(d)
print(sp.convert(s))
output :
abc/def/(2.7, 2.1)

Regex is probably not required here. Hope this helps
lhs,rhs = str.split("/(")
rhs1,rhs2 = rhs.strip(")").split(", ")
lhs+="/"
print "{0}({1},{2})".format(lhs,dict[lhs+rhs1],dict[lhs+rhs2])
output
abc/def/(2.7,2.1)

How to sum up the word count for each person in a dialogue?

I'm starting to learn Python and I'm trying to write a program that would import a text file, count the total number of words, count the number of words in a specific paragraph (said by each participant, described by 'P1', 'P2' etc.), exclude these words (i.e. 'P1' etc.) from my word count, and print paragraphs separately.
Thanks to #James Hurford I got this code:
words = None
with open('data.txt') as f:
words = f.read().split()
total_words = len(words)
print 'Total words:', total_words
in_para = False
para_type = None
paragraph = list()
for word in words:
if ('P1' in word or
'P2' in word or
'P3' in word ):
if in_para == False:
in_para = True
para_type = word
else:
print 'Words in paragraph', para_type, ':', len(paragraph)
print ' '.join(paragraph)
del paragraph[:]
para_type = word
else:
paragraph.append(word)
else:
if in_para == True:
print 'Words in last paragraph', para_type, ':', len(paragraph)
print ' '.join(paragraph)
else:
print 'No words'
My text file looks like this:
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.
The next part I need to do is summing up the words for each participant. I can only print them, but I don't know how to return/reuse them.
I would need a new variable with word count for each participant that I could manipulate later on, in addition to summing up all the words said by each participant, e.g.
P1all = sum of words in paragraph
Is there a way to count "you're" or "it's" etc. as two words?
Any ideas how to solve it?

I would need a new variable with word count for each participant that I could manipulate later on
No, you would need a Counter (Python 2.7+, else use a defaultdict(int)) mapping persons to word counts.
from collections import Counter
#from collections import defaultdict
words_per_person = Counter()
#words_per_person = defaultdict(int)
for ln in inputfile:
person, text = ln.split(':', 1)
words_per_person[person] += len(text.split())
Now words_per_person['P1'] contains the number of words of P1, assuming text.split() is a good enough tokenizer for your purposes. (Linguists disagree about the definition of word, so you're always going to get an approximation.)

Congrats on beginning your adventure with Python! Not everything in this post might make sense right now but bookmark it and comeback to it if it seems helpful later. Eventually you should try to move from scripting to software engineering, and here are a few ideas for you!
With great power comes great responsibility, and as a Python developer you need to be more disciplined than other languages which don't hold your hand and enforce "good" design.
I find it helps to start with a top-down design.
def main():
text = get_text()
p_text = process_text(text)
catalogue = process_catalogue(p_text)
BOOM! You just wrote the whole program -- now you just need to back and fill in the blanks! When you do it like this, it seems less intimidating. Personally, I don't consider myself smart enough to solve very big problems, but I'm a pro at solving small problems. So lets tackle one thing at a time. I'm going to start with 'process_text'.
def process_text(text):
b_text = bundle_dialogue_items(text)
f_text = filter_dialogue_items(b_text)
c_text = clean_dialogue_items(f_text)
I'm not really sure what those things mean yet, but I know that text problems tend to follow a pattern called "map/reduce" which means you perform and operation on something and then you clean it up and combine, so I put in some placeholder functions. I might go back and add more if necessary.
Now let's write 'process_catalogue'. I could've written "process_dict" but that sounded lame to me.
def process_catalogue(p_text):
speakers = make_catalogue(c_text)
s_speakers = sum_words_per_paragraph_items(speakers)
t_speakers = total_word_count(s_speakers)
Cool. Not too bad. You might approach this different than me, but I thought it would make sense to aggregate the items, the count the words per paragraph, and then count all the words.
So, at this point I'd probably make one or two little 'lib' (library) modules to back-fill the remaining functions. For the sake you being able to run this without worrying about imports, I'm going to stick it all in one .py file, but eventually you'll learn how to break these up so it looks nicer. So let's do this.
# ------------------ #
# == process_text == #
# ------------------ #
def bundle_dialogue_items(lines):
cur_speaker = None
paragraphs = Counter()
for line in lines:
if re.match(p, line):
cur_speaker, dialogue = line.split(':')
paragraphs[cur_speaker] += 1
else:
dialogue = line
res = cur_speaker, dialogue, paragraphs[cur_speaker]
yield res
def filter_dialogue_items(lines):
for name, dialogue, paragraph in lines:
if dialogue:
res = name, dialogue, paragraph
yield res
def clean_dialogue_items(flines):
for name, dialogue, paragraph in flines:
s_dialogue = dialogue.strip().split()
c_dialouge = [clean_word(w) for w in s_dialogue]
res = name, c_dialouge, paragraph
yield res
aaaand a little helper function
# ------------------- #
# == aux functions == #
# ------------------- #
to_clean = string.whitespace + string.punctuation
def clean_word(word):
res = ''.join(c for c in word if c not in to_clean)
return res
So it may not be obvious but this library is designed as a data processing pipeline. There several ways to process data, one is pipeline processing and another is batch processing. Let's take a look at batch processing.
# ----------------------- #
# == process_catalogue == #
# ----------------------- #
speaker_stats = 'stats'
def make_catalogue(names_with_dialogue):
speakers = {}
for name, dialogue, paragraph in names_with_dialogue:
speaker = speakers.setdefault(name, {})
stats = speaker.setdefault(speaker_stats, {})
stats.setdefault(paragraph, []).extend(dialogue)
return speakers
word_count = 'word_count'
def sum_words_per_paragraph_items(speakers):
for speaker in speakers:
word_stats = speakers[speaker][speaker_stats]
speakers[speaker][word_count] = Counter()
for paragraph in word_stats:
speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
return speakers
total = 'total'
def total_word_count(speakers):
for speaker in speakers:
wc = speakers[speaker][word_count]
speakers[speaker][total] = 0
for c in wc:
speakers[speaker][total] += wc[c]
return speakers
All these nested dictionaries are getting a little complicated. In actual production code I would replace these with some more readable classes (along with adding tests and docstrings!!), but I don't want to make this more confusing than it already is! Alright, for your convenience below is the whole thing put together.
import pprint
import re
import string
from collections import Counter
p = re.compile(r'(\w+?):')
def get_text_line_items(text):
for line in text.split('\n'):
yield line
def bundle_dialogue_items(lines):
cur_speaker = None
paragraphs = Counter()
for line in lines:
if re.match(p, line):
cur_speaker, dialogue = line.split(':')
paragraphs[cur_speaker] += 1
else:
dialogue = line
res = cur_speaker, dialogue, paragraphs[cur_speaker]
yield res
def filter_dialogue_items(lines):
for name, dialogue, paragraph in lines:
if dialogue:
res = name, dialogue, paragraph
yield res
to_clean = string.whitespace + string.punctuation
def clean_word(word):
res = ''.join(c for c in word if c not in to_clean)
return res
def clean_dialogue_items(flines):
for name, dialogue, paragraph in flines:
s_dialogue = dialogue.strip().split()
c_dialouge = [clean_word(w) for w in s_dialogue]
res = name, c_dialouge, paragraph
yield res
speaker_stats = 'stats'
def make_catalogue(names_with_dialogue):
speakers = {}
for name, dialogue, paragraph in names_with_dialogue:
speaker = speakers.setdefault(name, {})
stats = speaker.setdefault(speaker_stats, {})
stats.setdefault(paragraph, []).extend(dialogue)
return speakers
def clean_dict(speakers):
for speaker in speakers:
stats = speakers[speaker][speaker_stats]
for paragraph in stats:
stats[paragraph] = [''.join(c for c in word if c not in to_clean)
for word in stats[paragraph]]
return speakers
word_count = 'word_count'
def sum_words_per_paragraph_items(speakers):
for speaker in speakers:
word_stats = speakers[speaker][speaker_stats]
speakers[speaker][word_count] = Counter()
for paragraph in word_stats:
speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
return speakers
total = 'total'
def total_word_count(speakers):
for speaker in speakers:
wc = speakers[speaker][word_count]
speakers[speaker][total] = 0
for c in wc:
speakers[speaker][total] += wc[c]
return speakers
def get_text():
text = '''BOB: blah blah blah blah
blah hello goodbye etc.
JERRY:.............................................
...............
BOB:blah blah blah
blah blah blah
blah.
BOB: boopy doopy doop
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.'''
text = get_text_line_items(text)
return text
def process_catalogue(c_text):
speakers = make_catalogue(c_text)
s_speakers = sum_words_per_paragraph_items(speakers)
t_speakers = total_word_count(s_speakers)
return t_speakers
def process_text(text):
b_text = bundle_dialogue_items(text)
f_text = filter_dialogue_items(b_text)
c_text = clean_dialogue_items(f_text)
return c_text
def main():
text = get_text()
c_text = process_text(text)
t_speakers = process_catalogue(c_text)
# take a look at your hard work!
pprint.pprint(t_speakers)
if __name__ == '__main__':
main()
So this script is almost certainly overkill for this application, but the point is to see what (questionably) readable, maintainable, modular Python code might look like.
Pretty sure output looks something like:
{'BOB': {'stats': {1: ['blah',
'blah',
'blah',
'blah',
'blah',
'hello',
'goodbye',
'etc'],
2: ['blah',
'blah',
'blah',
'blah',
'blah',
'blah',
'blah'],
3: ['boopy', 'doopy', 'doop']},
'total': 18,
'word_count': Counter({1: 8, 2: 7, 3: 3})},
'JERRY': {'stats': {1: ['', '']}, 'total': 2, 'word_count': Counter({1: 2})},
'P1': {'stats': {1: ['Bla', 'bla', 'bla'], 2: ['Bla', 'bla']},
'total': 5,
'word_count': Counter({1: 3, 2: 2})},
'P2': {'stats': {1: ['Bla', 'bla', 'bla', 'bla']},
'total': 4,
'word_count': Counter({1: 4})},
'P3': {'stats': {1: ['Bla']}, 'total': 1, 'word_count': Counter({1: 1})}}

You can do this with two variables. One to keep track of what person is speaking, the other to keep the paragraphs for the persons speaking. For storing the paragraphs and associating who it is that the paragraph belongs to use a dict with the person as the key and a list of paragraphs that person said associated with this key.
para_dict = dict()
para_type = None
for word in words:
if ('P1' in word or
'P2' in word or
'P3' in word ):
#extract the part we want leaving off the ':'
para_type = word[:2]
#create a dict with a list of lists
#to contain each paragraph the person uses
if para_type not in para_dict:
para_dict[para_type] = list()
para_dict[para_type].append(list())
else:
#Append the word to the last list in the list of lists
para_dict[para_type][-1].append(word)
From here you can sum up the number of words spoken thus
for person, para_list in para_dict.items():
counts_list = list()
for para in para_list:
counts_list.append(len(para))
print person, 'spoke', sum(counts_list), 'words'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract data from within parenthesis in python - python

Related

Generate strings using translations of several characters, mapped to several others

Python how convert single quotes to double quotes to format as json string

Preferential parsing of sentence elements using infixnotation in Pyparsing

How to replace text in curly brackets with another text based on comparisons using Python Regex

How to sum up the word count for each person in a dialogue?

Categories

Resources