Rearrange Dialogue Documents to Dataframe

Rearrange Dialogue Documents to Dataframe - python

I plan to analyze some conference documents, and before the analysis, I need to rearrange these documents into data frame. The format I expect is that for each row of the data, the first value is the speaker and the second value is the utterance of that speaker. For instance, ["Jo", "I just had tacos."]. The sample document can be reached here. Below is the progress so far:
file = open('the document','r')
Name = []
sentence = []
for line in file:
if line.find("Column") != -1:
continue
if line.find("Section") or line.find("Index") or line.find("Home Page"):
continue
if line.find(':') != -1:
tokens = line.split(":")
Name.append(tokens[0])
else:
sentence.append(line + " ")
My first question is that how I can combine the speaker and the utterance in one list and then search for the next person. The second question is that is there any better way to get rid of the content before Oral Answers to Questions and after The House divided: Ayes 240, Noes 329.Division No. 54][9.59 pm.
I appreciate any help.

Here, I have come up with a simple solution. This simple solution has three parts
When there is an empty line
When the line ends with :
Otherwise
Here is the code:
import re
from collections import defaultdict
def clean_speaker(sp):
sp = re.sub(r"(\(\w+\))", "", sp) #remove single words within parentheses
sp = re.sub(r"(\d+\.?)", "", sp) #remove digits such as 1. or 2.
return sp.strip()
document = []
with open('the document','r') as fin:
foundSpeaker = False
dialogue = defaultdict(str)
for line in fin.readlines():
line = line.strip() #remove white-spaces
#----- when line is empty -----
if not line:
dialogue = defaultdict(str)
foundSpeaker = False
#----- When line ends with : -----
elif line[-1] == ":":
if dialogue:
document.append(dialogue)
dialogue = defaultdict(str)
foundSpeaker = True
dialogue["Speaker"] = clean_speaker(line[:-1])
#----- Otherwise -----
else:
if foundSpeaker:
dialogue["Sentence"] += " " + line
else:
if dialogue:
document.append(dialogue)
dialogue = defaultdict(str)
foundSpeaker = False
continue
Now, the variable document has all the dialogue in the given file... it's a list of dictionary where each dictionary has just two keys (speaker, and sentence). So, we can see what's inside document like so:
for d in document:
for key, value in d.items():
print(key+":", value)
Or you can do something smarter where you convert that list into pandas.dataframe and write that dataframe into csv like so:
import pandas as pd
pd.DataFrame.from_dict(document).to_csv('document.csv')
Now, open document.csv and you will find everything in order... I hope this helps you

Related

Extract name of function and arguments in separate text

I am trying to extract words only from the body of the function. Below you can see my text.
# Example of estimation
## Example of estimation
### Example of estimation
"Some calculation"
""" Note :
The data here is artificial.
Idea is to show how code will look like after estimation.
More information www.google.com
"""
#iterate_jit(nopython=True)
def fun_min_ssc(min_wage, tax_rate,calc_min_profit):
calc_min_profit= min_wage * tax_rate + min_wage - (min_wage*2)
return calc_min_profit
Text that starting with : #,##,###,”,""", # is not needed.
Now I want to extract only arguments from the body of the function such as :
Name of the function: fun_min_ssc and
Arguments of the function : min_wage, tax_rate,calc_min_profit
I tried to solve this problem with the function below :
f= open("text.txt","w+")
f.write('''# Example of estimation
## Example of estimation
### Example of estimation
"Some calculation"
""" Note :
The data here is artificial.
Idea is to show how code will look like after estimation.
More information www.google.com
"""
#iterate_jit(nopython=True)
def cal_min_ssc(min_wage, tax_rate,min_profit):
min_profit = min_wage * tax_rate + min_wage - (min_wage*2)
return min_profit
''')
for line in f.readlines():
print(line, end='')
f.close()
os.getcwd()
os.listdir()
os.chdir('C:/') <---Import your path
file_reader = open('C:/text.txt') <----Import your path
os.getcwd()
# Open the file in read mode
text = open("text.txt", "r")
# Creating dictonary and count freqency
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
words = line.split(",")
words = line.split("*")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
So can anybody help me how to solve this problem or suggest some other approach that can solve this problem ?

This might get you on the right track. I have used a regex statement as a specific search criteria to find the lines that start with def and end with :.
x = re.search(r"^def.*:$", line)
Once I have the line in question, I split the line using def and the opening bracket of the function (. This allows me to easily grab the function name.
values = x[0].split('def ')[1].split('(')
function_name = values[0]
I then have to grab the other section, but remove the last two characters ie. ):
arguments = values[1][:-2].split(', ')
As the arguments are separated by a comma, I can then use that as a split separator. However, I must warn you, make sure they are consistently separated in the same way...i.e. with or without a space after the comma.
I have printed the desired output, however, you can add these items to a list or whatever structure you desire:
Here is my example code (without all the file input stuff):
import re
text = '''# Example of estimation
## Example of estimation
### Example of estimation
"Some calculation"
""" Note :
The data here is artificial.
Idea is to show how code will look like after estimation.
More information www.google.com
"""
#iterate_jit(nopython=True)
def cal_min_ssc(min_wage, tax_rate, min_profit):
min_profit = min_wage * tax_rate + min_wage - (min_wage*2)
return min_profit
'''
lines = text.split('\n')
for line in lines:
x = re.search(r"^def.*:$", line)
if x != None:
values = x[0].split('def ')[1].split('(')
function_name = values[0]
arguments = values[1][:-2].split(', ')
print(f"Function Name: {function_name}")
print(f"Arguments: {arguments}")
OUTPUT:
Function Name: cal_min_ssc
Arguments: ['min_wage', 'tax_rate', 'min_profit']

My code is missing some of the lines im trying to get out of a file

The basic task is to write a function, get_words_from_file(filename), that returns a list of lower case words that are within the region of interest. They share with you a regular expression: "[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", that finds all words that meet this definition. My code works well on some of the tests but fails when the line that indicates the region of interest is repeated.
Here's is my code:
import re
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename,'r', encoding='utf-8') as file:
flag = False
words = []
count = 0
for line in file:
if line.startswith("*** START OF"):
while count < 1:
flag=True
count += 1
elif line.startswith("*** END"):
flag=False
break
elif(flag):
new_line = line.lower()
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+",
new_line)
words.extend(words_on_line)
return words
#test code:
filename = "bee.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
print(word)
The issue is the string "*** START OF" is repeated and isn't included when it is inside the region of interest.
The test code should result in:
bee.txt loaded ok.↩
16 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
start↩
of↩
synthetic↩
test↩
case↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
But I'm getting:
bee.txt loaded ok.↩
11 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
Any help would be great!
Attached is a screenshot of the file

The specific problem of your code is the if .. elif .. elif statement, you're ignoring all lines that look like the line that signals the start or end of a block, even if it's in the test block.
You wanted something like this for your function:
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename, 'r', encoding='utf-8') as file:
in_block = False
words = []
for line in file:
if not in_block and line == "*** START OF A SYNTHETIC TEST CASE ***\n":
in_block = True
elif in_block and line == "*** END TEST CASE ***\n":
break
elif in_block:
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line.lower())
words.extend(words_on_line)
return words
This is assuming you are actually looking for the whole line as a marker, but of course you can still use .startswith() if you actually accept that as the start or end of the block, as long as it's sufficiently unambiguous.
Your idea of using a flag is fine, although naming a flag to whatever it represents is always a good idea.

Extract data between two lines from text file

Say I have hundreds of text files like this example :
NAME
John Doe
DATE OF BIRTH
1992-02-16
BIO
THIS is
a PRETTY
long sentence
without ANY structure
HOBBIES
//..etc..
NAME, DATE OF BIRTH, BIO, and HOBBIES (and others) are always there, but text content and the number of lines between them can sometimes change.
I want to iterate through the file and store the string between each of these keys. For example, a variable called Name should contain the value stored between 'NAME' and 'DATE OF BIRTH'.
This is what I turned up with :
lines = f.readlines()
for line_number, line in enumerate(lines):
if "NAME" in line:
name = lines[line_number + 1] # In all files, Name is one line long.
elif "DATE OF BIRTH" in line:
date = lines[line_number + 2] # Date is also always two lines after
elif "BIO" in line:
for x in range(line_number + 1, line_number + 20): # Length of other data can be randomly bigger
if "HOBBIES" not in lines[x]:
bio += lines[x]
else:
break
elif "HOBBIES" in line:
#...
This works well enough, but I feel like instead of using many double loops, there must be a smarter and less hacky way to do it.
I'm looking for a general solution where NAME would store everything until DATE OF BIRTH, and BIO would store everything until HOBBIES, etc. With the intention of cleaning up and removing extra white lintes later.
Is it possible?
Edit : While I was reading through the answers, I realized I forgot a really significant detail, the keys will sometimes be repeated (in the same order).
That is, a single text file can contain more than one person. A list of persons should be created. The key Name signals the start of a new person.

I did it storing everything in a dictionary, see code below.
f = open("test.txt")
lines = f.readlines()
dict_text = {"NAME":[], "DATEOFBIRTH":[], "BIO":[]}
for line_number, line in enumerate(lines):
if not ("NAME" in line or "DATE OF BIRTH" in line or "BIO" in line):
text = line.replace("\n","")
dict_text[location].append(text)
else:
location = "".join((line.split()))

You could use a regular expression:
import re
keys = """
NAME
DATE OF BIRTH
BIO
HOBBIES
""".strip().splitlines()
key_pattern = '|'.join(f'{key.strip()}' for key in keys)
pattern = re.compile(fr'^({key_pattern})', re.M)
# uncomment to see the pattern
# print(pattern)
with open(filename) as f:
text = f.read()
parts = pattern.split(text)
... process parts ...
parts will be a list strings. The odd indexed positions (parts[1], parts[3], ...) will be the keys ('NAME', etc) and the even indexed positions (parts[2], parts[4], ...) will be the text in between the keys. parts[0] will be whatever was before the first key.

Instead of reading lines you could cast the file as one long string. Use string.index() to find the start index of your trigger words, then set everything from that index to the next trigger word index to a variable.
Something like:
string = str(f)
important_words = ['NAME', 'DATE OF BIRTH']
last_phrase = None
for phrase in important_words:
phrase_start = string.index(phrase)
phrase_end = phrase_start + len(phrase)
if last_phrase is not None:
get_data(string, last_phrase, phrase_start)
last_phrase = phrase_end
def get_data(string, previous_end_index, current_start_index):
usable_data = string[previous_end_index: current_start_index]
return usable_data
Better/shorter variable names should probably be used

You can just read the text in as 1 long string. And then make use of .split()
This will only work if the categories are in order and don't repeat.
Like so;
Categories = ["NAME", "DOB", "BIO"] // in the order they appear in text
Output = {}
Text = str(f)
for i in range(1,len(Categories)):
SplitText = Text.split(Categories[i])
Output.update({Categories[i-1] : SplitText[0] })
Text = SplitText[1]
Output.update({Categories[-1] : Text})

You can try the following.
keys = ["NAME","DATE OF BIRTH","BIO","HOBBIES"]
f = open("data.txt", "r")
result = {}
for line in f:
line = line.strip('\n')
if any(v in line for v in keys):
last_key = line
else:
result[last_key] = result.get(last_key, "") + line
print(result)
Output
{'NAME': 'John Doe', 'DATE OF BIRTH': '1992-02-16', 'BIO ': 'THIS is a PRETTY long sentence without ANY structure ', 'HOBBIES ': '//..etc..'}

Linear search to find spelling errors in Python

I'm working on learning Python with Program Arcade Games and I've gotten stuck on one of the labs.
I'm supposed to compare each word of a text file (http://programarcadegames.com/python_examples/en/AliceInWonderLand200.txt) to find if it is not in the dictionary file (http://programarcadegames.com/python_examples/en/dictionary.txt) and then print it out if it is not. I am supposed to use a linear search for this.
The problem is even words I know are not in the dictionary file aren't being printed out. Any help would be appreciated.
My code is as follows:
# Imports regular expressions
import re
# This function takes a line of text and returns
# a list of words in the line
def split_line(line):
split = re.findall('[A-Za-z]+(?:\'\"[A-Za-z]+)?', line)
return split
# Opens the dictionary text file and adds each line to an array, then closes the file
dictionary = open("dictionary.txt")
dict_array = []
for item in dictionary:
dict_array.append(split_line(item))
print(dict_array)
dictionary.close()
print("---Linear Search---")
# Opens the text for the first chapter of Alice in Wonderland
chapter_1 = open("AliceInWonderland200.txt")
# Breaks down the text by line
for each_line in chapter_1:
# Breaks down each line to a single word
words = split_line(each_line)
# Checks each word against the dictionary array
for each_word in words:
i = 0
# Continues as long as there are more words in the dictionary and no match
while i < len(dict_array) and each_word.upper() != dict_array[i]:
i += 1
# if no match was found print the word being checked
if not i <= len(dict_array):
print(each_word)
# Closes the first chapter file
chapter_1.close()

Linear search to find spelling errors in Python
Something like this should do (pseudo code)
sampleDict = {}
For each word in AliceInWonderLand200.txt:
sampleDict[word] = True
actualWords = {}
For each word in dictionary.txt:
actualWords[word] = True
For each word in sampleDict:
if not (word in actualDict):
# Oh no! word isn't in the dictionary
A set may be more appropriate than a dict, since the value of the dictionary in the sample isn't important. This should get you going, though

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K

You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.

If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.

Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')

import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.