recreate multi-token strings from tokens given indices and text source

recreate multi-token strings from tokens given indices and text source - python

I'm preparing a script that reconstitutes multi-token strings from a tokenized text for tokens that have specific labels. My tokens are associated with their start and end indices in the original text.
This is an example piece of text:
t = "Breakfast at Tiffany's is a novella by Truman Capote."
The tokens data structure containing the original text indices and labels:
[(['Breakfast', 0, 9], 'BOOK'),
(['at', 10, 12], 'BOOK'),
(['Tiffany', 13, 20], 'BOOK'),
(["'", 20, 21], 'BOOK'),
(['s', 21, 22], 'BOOK'),
(['is', 23, 25], 'O'),
(['a', 26, 27], 'O'),
(['novella', 28, 35], 'O'),
(['by', 36, 38], 'O'),
(['Truman', 39, 45], 'PER'),
(['Capote', 46, 52], 'PER'),
(['.', 52, 53], 'O')]
This data structure was generated from t as follows
import re
tokens = [[m.group(0), m.start(), m.end()] for m in re.finditer(r"\w+|[^\w\s]", t, re.UNICODE)]
tags = ['BOOK', 'BOOK', 'BOOK', 'BOOK', 'BOOK', 'O', 'O', 'O', 'O', 'PER', 'PER', 'O']
token_tuples = list(zip(tokens, tags))
What I would like my script to do is to iterate through token_tuples and if it encounters a non-O token, it breaks off from the main iteration and reconstitutes the tagged multi-token span until it hits the nearest token with O.
This is the current script:
for i in range(len(token_tuples)):
if token_tuples[i][1] != 'O':
tag = token_tuples[i][1]
start_ix = token_tuples[i][0][1]
slider = i+1
while slider < len(token_tuples):
if tag != token_tuples[slider][1]:
end_ix = token_tuples[slider][0][2]
print((t[start_ix:end_ix], tag))
break
else:
slider+=1
This prints:
("Breakfast at Tiffany's is", 'BOOK')
("at Tiffany's is", 'BOOK')
("Tiffany's is", 'BOOK')
("'s is", 'BOOK')
('s is', 'BOOK')
('Truman Capote.', 'PER')
('Capote.', 'PER')
What needs to be modified so that the output for this example is:
> ("Breakfast at Tiffany's", "BOOK")
> ("Truman Capote", "PER")

Here's one solution. If you can come up with something less long-winded, I'd be happy to choose your answer instead!
def extract_entities(t, token_tuples):
entities = []
tag = ''
for i in range(len(token_tuples)):
if token_tuples[i][1] != 'O':
if token_tuples[i][1] != tag:
tag = token_tuples[i][1]
start_ix = token_tuples[i][0][1]
if i+1 < len(token_tuples):
if tag != token_tuples[i+1][1]:
end_ix = token_tuples[i][0][2]
entities.append((t[start_ix:end_ix], tag))
tag = ''
return(entities)

Related

Istead of replacing a part of a string with the value associated with the key 22, it just replaces it with the value associated with key 2, 2 times

I am converting information about a delivery from a file given to me, it contains information like this:
name, item_number, item_number, item_number
for example
Joe, 2, 22, 10, 17
The issue is whenever i try to replace the number in the line, with a value associated with the key, which is indetical to the item_number from the file, it returns for example the value associated with 2, x2 times instead of the value associated with 22.
import sys
f = open("testFil.txt", "r")
list_of_lists = []
items = {
1: "Cigaretter",
2: "Snus",
3: "Kaffe",
4: "Te",
5: "Solbriller",
6: "Mørk Chokolade",
7: "Kiks",
8: "Harebo Mix",
9: "Salt Chips",
10: "Pepper Chips",
11: "Sour Cream Chips",
12: "Oreo",
13: "Ritter Sport",
14: "Chokolade Kiks",
15: "Mælk",
16: "Sukker",
17: "Brød",
18: "Kuglepen",
19: "Juice",
20: "Avis",
21: "Toilet Papir",
22: "Tandbørste",
23: "Kondomer",
24: "Tandpasta",
25: "Køkkenrulle"}
with open("testFil.txt", "r") as f:
for line in f:
line = line.replace("\n","")
for i in items:
line = line.replace(str(i), items[i])
list_of_lists.append(line.split(", "))
for i in list_of_lists:
for j in i:
if i.count(j) > 1:
i[i.index(j)] = str(i.count(j)) + "x " + j
for k in range(i.count(j)):
i.remove(j)
customer_count = -1
def last_customer():
print("This is the last order")
print(list_of_lists[next_customer()])
def luk_programmet():
sys.exit()
def next_customer():
global customer_count
customer_count += 1
return customer_count
def print_customer():
a = input("")
if a == "Next Order":
if customer_count == len(list_of_lists) - 2:
print("This is the last order")
print(list_of_lists[next_customer()])
luk_programmet()
else:
try:
print(list_of_lists[next_customer()])
except IndexError:
print("This is the last order")
luk_programmet()
elif a == "/close the program":
luk_programmet()
else:
print("You typed it wrong.")
print_customer()
#Prints the customer list
print("Write 'Next Order' to recieve the next order")
for i in range(len(list_of_lists)):
if print_customer() == "Done":
sys.exit()
print_customer()
The txt file:
Joe, 1, 2, 1, 8
Micky, 19, 19, 15, 13
Berta, 4, 3, 3, 3
Frede, 24, 22, 8, 2
per, 1, 9, 18, 24
I have tried making the key into a string, in the hopes that would help, but that didn't work out.

The core problem you have is that you're performing string replacement, which as you've observed, replaces every instance of "2" with a value.
To illustrate:
>>> "222".replace("2", "something")
'somethingsomethingsomething'
So, what's a better approach?
Given the following file contents as a source:
Joe, 1, 2, 1, 8
Micky, 19, 19, 15, 13
Berta, 4, 3, 3, 3, 3
Frede, 24, 22, 8, 2
per, 1, 9, 18, 24, 12, 22, 1, 23
We can use Python's inbuild csv library to parse the file contents, and do some funky stuff with csv.DictReader:
with open("testFil.txt") as f:
reader = csv.DictReader(f, fieldnames=["person"], restkey="items", skipinitialspace=True)
for row in reader:
item_names = []
for item in row["items"]:
item_name = items[int(item)]
item_names.append(item_name)
row["item_names"] = item_names
print(row)
Few things to highlight:
We tell the DictReader that there's only one column (the first one) with the fieldnames arugment.
The restval argument is the name given to the un-named columns (which will conveniently be a list of values):
The optional restval parameter specifies the value to be written if the dictionary is missing a key in fieldnames
We use skipinitialspace as the values in the input have leading spaces between the commas!
We can then perform a lookup of the exact value, rather than a replacement!
I have chosen to build up the item list as a separate list which I add back on to the row.
The output from the above code is this:
{'person': 'Joe', 'items': ['1', '2', '1', '8'], 'item_names': ['Cigaretter', 'Snus', 'Cigaretter', 'Harebo Mix']}
{'person': 'Micky', 'items': ['19', '19', '15', '13'], 'item_names': ['Juice', 'Juice', 'Mælk', 'Ritter Sport']}
{'person': 'Berta', 'items': ['4', '3', '3', '3', '3'], 'item_names': ['Te', 'Kaffe', 'Kaffe', 'Kaffe', 'Kaffe']}
{'person': 'Frede', 'items': ['24', '22', '8', '2'], 'item_names': ['Tandpasta', 'Tandbørste', 'Harebo Mix', 'Snus']}
{'person': 'per', 'items': ['1', '9', '18', '24', '12', '22', '1', '23'], 'item_names': ['Cigaretter', 'Salt Chips', 'Kuglepen', 'Tandpasta', 'Oreo', 'Tandbørste', 'Cigaretter', 'Kondomer']}

Row comparison and append loop by columns

I have a bunch of school data that I maintain on a master list for monthly testing scores. Everytime a child takes a score and there is an update on 'Age', 'Score', 'School' I would insert a new row with updated data and keep track of all the changes. I am trying to figure out a python script to do this but since I am a newbie, I keep running in to issues.
I tried writing a loop but keep getting errors to include "False", "The Truth value of a series is ambigious", "tuple indices must be integers, not str"
master_df = pd.DataFrame({'ID': ['A', 'B', 'C', 'D'],
'Age':[15,14,17,13],
'School':['AB', 'CD', 'EF', 'GH'],
'Score':[80, 75, 62, 100],
'Date': ['3/1/2019', '3/1/2019', '3/1/2019', '3/1/2019']})
updates_df = pd.DataFrame({'ID': ['A', 'B', 'C', 'D'],
'Age':[16,14,17,13],
'School':['AB', 'ZX', 'EF', 'GH'],
'Score':[80, 90, 62, 100],
'Date': ['4/1/2019', '4/1/2019', '4/1/2019', '4/1/2019']})
# What I am trying to get is:
updated_master = pd.DataFrame({'ID': ['A', 'A', 'B', 'B', 'C','D'],
'Age':[15,16,14,14,17,13],
'School':['AB', 'AB', 'CD', 'ZX', 'EF', 'GH'],
'Score':[80, 80, 75, 90, 62, 100],
'Date': ['3/1/2019', '4/1/2019', '3/1/2019', '4/1/2019', '3/1/2019', '3/1/2019']})
temp_delta_list = []
m_score = master_df.iloc[1:, master_df.columns.get_loc('Score')]
m_age = master_df.iloc[1:, master_df.columns.get_loc('Age')]
m_school = master_df.iloc[1:, master_df.columns.get_loc('School')]
u_score = updates_df.iloc[1:, updates_df.columns.get_loc('Score')]
u_age = updates_df.iloc[1:, updates_df.columns.get_loc('Age')]
u_school = updates_df.iloc[1:, updates_df.columns.get_loc('School')]
for i in updates_df['ID'].values:
updated_temp_score = updates_df[updates_df['ID'] == i], u_score
updated_temp_age = updates_df[updates_df['ID'] == i], u_age
updated_temp_school = updates_df[updates_df['ID'] == i], u_school
master_temp_score = master_df[master_df['ID'] == i], m_score
master_temp_age = master_df[master_df['ID'] == i], m_age
master_temp_school = updates_df[master_df['ID'] == i], m_school
if (updated_temp_score == master_temp_score) | (updated_temp_age == master_temp_age) | (updated_temp_school == master_temp_school):
pass
else:
temp_deltas = updates_df[(updates_df['ID'] == i)]
temp_delta_list.append(temp_deltas)
I ultimately want to have the loop compare each row values for each ID and return rows that have any difference and then append the master_df

Can I use the .format feature when using screen.blit in Pygame?

Heys, I recently couldnt figure out how to blit lists onto my pygame screen using certain x and y values, allowing the text to blit say x += 20, moving every other string in my list over 20 units everytime it blits. I recently did some .format stuff with printing just in the console, is there a feature like this for screen.blit so I can use the same formatting in a pygame window? Included my code underneath. Thanks in advance :D
import pygame
NAMES = ['Deanerys T.', 'Jon S.', 'Gregor C.', 'Khal D.', 'Cersei L.', 'Jamie L.',
'Tyrion L.', 'Sansa S.', 'Ayra S.', 'Ned S.']
DATE_OF_BIRTH = ['6/10/1996', '6/12/1984', '3/12/1980', '8/4/1986', '7/2/1970',
'7/2/1975', '12/24/1980', '11/30/1993', '5/18/1999', '6/27/1984']
AGE = [22, 34, 38, 32, 48, 43, 38, 25, 19, 34]
MARITAL_STATUS = ['Not Married', 'Not Married', 'Not Married', 'Not Married',
'Married', 'Not Married', 'Married', 'Married', 'Not Married', 'Married']
NUM_OF_CHILDREN = [3, 0, 0, 4, 2, 0, 1, 1, 0, 5]
for i in range(10):
print("{:>12} was born {:>10}, is age {:>2}. They have {:>1} children, and are {:>4}".format(NAMES[i], DATE_OF_BIRTH[i], AGE[i], NUM_OF_CHILDREN[i], MARITAL_STATUS[i]))
print("\n")
for i in range(10):
print("NAME: {:>12} DATE OF BIRTH: {}".format(NAMES[i], DATE_OF_BIRTH[i], ))
print("\n")
for i in range(10):
print("NAME: {:>12} AGE: {}".format(NAMES[i], AGE[i], ))
print("\n")
for i in range(10):
print("NAME: {:>12} MARRIAGE STATUS: {}".format(NAMES[i], MARITAL_STATUS[i], ))
print("\n")
for i in range(10):
print("NAME: {:>12} NUMBER OF CHILDREN: {}".format(NAMES[i], NUM_OF_CHILDREN[i], ))
print("\n")

The format is an operator on the string, so it's no problem to use it with pygame.
For Example:
hp = 56
player_hp_text = "Hit Points: {:>3}".format( hp )
player_hp_bitmap = myfont.render( player_hp_text, 16, (255, 255,0) )
screen.blit( player_hp_bitmap, ( 10, 10 ) )

reportlab dynamic data-driven header outputs wrong subtitle

I have created some fictitious, though representative, clinical trial type data using Pandas, and now come to some test reporting in ReportLab.
The data has a block (~50 rows) where the treatment column is 'Placebo' and the same amount where the treatment is 'Active'. I simply want to list the data using a sub-heading of 'Treatment Group: Placebo' for the first set and 'Treatment Group: Active' for the second.
There are some hits on a similar topic, and, indeed I've used one of the suggested techniques, namely to extend the arguments of a header functions using partial from functools.
title1 = "ACME Corp CONFIDENTIAL"
title2 = "XYZ123 / Anti-Hypertensive Draft"
title3 = "Protocol XYZ123"
title4 = "Study XYZ123"
title5 = "Listing of Demographic Data by Treatment Arm"
title6 = "All subjects"
def title(canvas, doc, bytext):
canvas.saveState()
canvas.setFont(styleN.fontName, styleN.fontSize)
canvas.drawString(DOCMARGIN, PAGE_HEIGHT*.975, title1)
canvas.drawString(DOCMARGIN, PAGE_HEIGHT*.950, title2)
canvas.drawString(DOCMARGIN, PAGE_HEIGHT*.925, title3)
canvas.drawCentredString(PAGE_WIDTH/2.0, PAGE_HEIGHT*.900, title4)
canvas.drawCentredString(PAGE_WIDTH/2.0, PAGE_HEIGHT*.875, title5)
canvas.drawCentredString(PAGE_WIDTH/2.0, PAGE_HEIGHT*.850, title6)
canvas.drawString(DOCMARGIN, PAGE_HEIGHT*.825, "Treatment Group:" + bytext)
canvas.restoreState()
This is then called as follows. n_groups has the value of 2 from a summary query and 0 maps to 'Placebo' and 1 maps to active.
def build_pdf(doc):
ptemplates = []
for armcd in range(n_groups):
ptemplates.append(PageTemplate(id = 'PT' + str(armcd), frames = [dataFrame,],
onPage = partial(title, bytext=t_dict[armcd]),
onPageEnd = foot))
doc.addPageTemplates(ptemplates)
elements = []
for armcd in range(n_groups):
elements.append(NextPageTemplate('PT' + str(armcd)))
sublist = [t for t in lista if t[0] == (armcd+1)]
sublist.insert(0,colheads)
data_table = Table(sublist, 6*[40*mm], len(sublist)*[DATA_CELL_HEIGHT], repeatRows=1)
data_table.setStyle(styleC)
elements.append(data_table)
elements.append(PageBreak())
doc.build(elements)
The report produces 6 pages. The first 3 pages of placebo data are correct, pages 5 & 6 of active data are correct, but page 4 - which should be the first page of the second 'active' group has the sub-title 'Treatment Group: Placebo'.
I have re-organized the order of the statements multiple times, but can't get Page 4 to sub-title correctly. Any help, suggestions or magic would be much appreciated.
[Edit 1: sample data structure]
The 'top' of the data starts as:
[
[1, 'Placebo', '000001-000015', '1976-09-20', 33, 'F', 'Black'],
[1, 'Placebo', '000001-000030', '1959-04-26', 50, 'M', 'Asian'],
[1, 'Placebo', '000001-000031', '1946-02-07', 64, 'F', 'Asian'],
[1, 'Placebo', '000001-000046', '1947-11-08', 62, 'M', 'Asian'],
etc for 50 rows, then continues with
[2, 'Active', '000001-000002', '1962-02-28', 48, 'F', 'Black'],
[2, 'Active', '000001-000008', '1975-10-20', 34, 'M', 'Black'],
[2, 'Active', '000001-000013', '1959-01-19', 51, 'M', 'White'],
[2, 'Active', '000001-000022', '1962-01-12', 48, 'F', 'Black'],
[2, 'Active', '000001-000036', '1976-10-17', 33, 'F', 'Asian'],
[2, 'Active', '000001-000045', '1980-12-31', 29, 'F', 'White'],
for another 50.
The column header inserted is:
['Treatment Arm Code',
'Treatment Arm',
'Site ID - Subject ID',
'Date of Birth',
'Age (Years)',
'Gender',
'Ethnicity'],
[Edit 2: A solution - move the PageBreak() and make it conditional:]
def build_pdf(doc):
ptemplates = []
for armcd in range(n_groups):
ptemplates.append(PageTemplate(id = 'PT' + str(armcd), frames = [dataFrame,],
onPage = partial(title, bytext=t_dict[armcd]),
onPageEnd = foot))
doc.addPageTemplates(ptemplates)
elements = []
for armcd in range(n_groups):
elements.append(NextPageTemplate('PT' + str(armcd)))
if armcd > 0:
elements.append(PageBreak())
sublist = [t for t in lista if t[0] == (armcd+1)]
sublist.insert(0,colheads)
data_table = Table(sublist, 6*[40*mm], len(sublist)*[DATA_CELL_HEIGHT], repeatRows=1)
data_table.setStyle(styleC)
elements.append(data_table)
doc.build(elements)

Extract lists from a list in dictionary

Suppose I have this dictionary:
self.dict = {'A':[[10, 20],[23,76,76],[23,655,54]], 'B':[30, 40, 50], 'C':[60, 100]}
Where the key 'A' is a list of lists. I want to get only the first 2 lists of 'A', i.e. [10, 20],[23,76,76]. I tried the idea of looping but it does not work well. :
class T(object):
def __init__(self):
self.dict = {'A':[[10, 20],[23,76,76],[23,655,54]], 'B':[30, 40, 50], 'C':[60, 100]}
def output(self):
for i in self.dict:
for j in self.dict[i]:
first_two_lists = j
print ("%s" % (first_two_lists))
if __name__ == '__main__':
T().output()
How can I get that ?

>>> d = {'A':[[10, 20],[23,76,76],[23,655,54]], 'B':[30, 40, 50], 'C':[60, 100]}
>>> d['A'][:2]
[[10, 20], [23, 76, 76]]

Using list slicing:
>>> d = {'A':[[10, 20],[23,76,76],[23,655,54]], 'B':[30, 40, 50], 'C':[60, 100]}
>>> d.get('A')[:2]
[[10, 20], [23, 76, 76]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

recreate multi-token strings from tokens given indices and text source - python

Related

Istead of replacing a part of a string with the value associated with the key 22, it just replaces it with the value associated with key 2, 2 times

Row comparison and append loop by columns

Can I use the .format feature when using screen.blit in Pygame?

reportlab dynamic data-driven header outputs wrong subtitle

Extract lists from a list in dictionary

Categories

Resources