Make sentence from value of dictionary - python

link for original txt file
https://medusa.ugent.be/en/exercises/187053144/description/wM6YaQUbWdHKPhQX/media/ICD.txt
This is what I got:
given_string = 'You are what you eat.'
dictionary ={'D89.1': 'Cryoglobulinemia', 'M87.332': 'Other secondary osteonecrosis of left radius', 'M25.57': 'Pain in ankle and joints of foot', 'H59.111': 'Intraoperative hemorrhage and hematoma of right eye and adnexa complicating an ophthalmic procedure', 'I82.5Z9': 'Chronic embolism and thrombosis of unspecified deep veins of unspecified distal lower extremity', 'T38.3X': 'Poisoning by, adverse effect of and underdosing of insulin and oral hypoglycemic [antidiabetic] drugs', 'H95.52': 'Postprocedural hematoma of ear and mastoid process following other procedure', 'Q90.1': 'Trisomy 21, mosaicism (mitotic nondisjunction)', 'X83.8': 'Intentional self-harm by other specified means', 'H02.145': 'Spastic ectropion of left lower eyelid', 'M67.341': 'Transient synovitis, right hand', 'P07.32': 'Preterm newborn, gestational age 29 completed weeks', 'R44.8': 'Other symptoms and signs involving general sensations and perceptions', 'R03.1': 'Nonspecific low blood-pressure reading', 'Q03': 'Congenital hydrocephalus', 'C11.0': 'Malignant neoplasm of superior wall of nasopharynx', 'C44.4': 'Other and unspecified malignant neoplasm of skin of scalp and neck', 'N48.5': 'Ulcer of penis', 'T50.2X1': 'Poisoning by carbonic-anhydrase inhibitors, benzothiadiazides and other diuretics, accidental (unintentional)', 'V92.13': 'Drowning and submersion due to being thrown overboard by motion of other powered watercraft', 'D30.0': 'Benign neoplasm of kidney', 'M08.06': 'Unspecified juvenile rheumatoid arthritis, knee', 'T41.5X4': 'Poisoning by therapeutic gases, undetermined', 'T59.3X2': 'Toxic effect of lacrimogenic gas, intentional self-harm', 'S84.91': 'Injury of unspecified nerve at lower leg level, right leg', 'Z80.4': 'Family history of malignant neoplasm of genital organs', 'M05.34': 'Rheumatoid heart disease with rheumatoid arthritis of hand', 'Y36.531': 'War operations involving thermal radiation effect of nuclear weapon, civilian', 'H59.88': 'Other intraoperative complications of eye and adnexa, not elsewhere classified', 'R29.91': 'Unspecified symptoms and signs involving the musculoskeletal system', 'M71.139': 'Other infective bursitis, unspecified wrist', 'S00.441': 'External constriction of right ear', 'V04': 'Pedestrian injured in collision with heavy transport vehicle or bus', 'C92.1': 'Chronic myeloid leukemia, BCR/ABL-positive', 'I82.60': 'Acute embolism and thrombosis of unspecified veins of upper extremity', 'I75.89': 'Atheroembolism of other site', 'S51.031': 'Puncture wound without foreign body of right elbow', 'Z01.110': 'Encounter for hearing examination following failed hearing screening', 'I06.8': 'Other rheumatic aortic valve diseases', 'Z68.25': 'Body mass index (BMI) 25.0-25.9, adult', 'A66': 'Yaws', 'S78.921': 'Partial traumatic amputation of right hip and thigh, level unspecified', 'F44': 'Dissociative and conversion disorders', 'O87.8': 'Other venous complications in the puerperium', 'K04.3': 'Abnormal hard tissue formation in pulp', 'V38.7': 'Person on outside of three-wheeled motor vehicle injured in noncollision transport accident in traffic accident', 'V36.1': 'Passenger in three-wheeled motor vehicle injured in collision with other nonmotor vehicle in nontraffic accident', 'B94.9': 'Sequelae of unspecified infectious and parasitic disease', 'K50.911': "Crohn's disease, unspecified, with rectal bleeding", 'S00.52': 'Blister (nonthermal) of lip and oral cavity', 'T43.1': 'Poisoning by, adverse effect of and underdosing of monoamine-oxidase-inhibitor antidepressants', 'B99.8': 'Other infectious disease', 'S97.12': 'Crushing injury of lesser toe(s)', 'S02.69': 'Fracture of mandible of other specified site', 'V29.10': 'Motorcycle passenger injured in collision with unspecified motor vehicles in nontraffic accident', 'Z68.35': 'Body mass index (BMI) 35.0-35.9, adult', 'A81.2': 'Progressive multifocal leukoencephalopathy', 'V44.4': 'Person boarding or alighting a car injured in collision with heavy transport vehicle or bus', 'M62.51': 'Muscle wasting and atrophy, not elsewhere classified, shoulder', 'M62.151': 'Other rupture of muscle (nontraumatic), right thigh', 'V52.2': 'Person on outside of pick-up truck or van injured in collision with two- or three-wheeled motor vehicle in nontraffic accident', 'E09.622': 'Drug or chemical induced diabetes mellitus with other skin ulcer', 'S43.492': 'Other sprain of left shoulder joint', 'M08.212': 'Juvenile rheumatoid arthritis with systemic onset, left shoulder', 'R00.0': 'Tachycardia, unspecified', 'G21.8': 'Other secondary parkinsonism', 'W58.01': 'Bitten by alligator', 'D46.1': 'Refractory anemia with ring sideroblasts', 'H61.32': 'Acquired stenosis of external ear canal secondary to inflammation and infection', 'H95.0': 'Recurrent cholesteatoma of postmastoidectomy cavity', 'Z72.4': 'Inappropriate diet and eating habits', 'Z68.41': 'Body mass index (BMI) 40.0-44.9, adult', 'S20.172': 'Other superficial bite of breast, left breast', 'I63.232': 'Cerebral infarction due to unspecified occlusion or stenosis of left carotid arteries', 'M14.811': 'Arthropathies in other specified diseases classified elsewhere, right shoulder', 'E13.41': 'Other specified diabetes mellitus with diabetic mononeuropathy', 'H02.53': 'Eyelid retraction', 'V95.49': 'Other spacecraft accident injuring occupant', 'D74.0': 'Congenital methemoglobinemia', 'D60.1': 'Transient acquired pure red cell aplasia', 'T52.1X2': 'Toxic effect of benzene, intentional self-harm', 'O71.2': 'Postpartum inversion of uterus', 'M08.439': 'Pauciarticular juvenile rheumatoid arthritis, unspecified wrist', 'M01.X72': 'Direct infection of left ankle and foot in infectious and parasitic diseases classified elsewhere', 'H95.3': 'Accidental puncture and laceration of ear and mastoid process during a procedure', 'C74.92': 'Malignant neoplasm of unspecified part of left adrenal gland', 'G00': 'Bacterial meningitis, not elsewhere classified', 'M19.011': 'Primary osteoarthritis, right shoulder', 'G72.49': 'Other inflammatory and immune myopathies, not elsewhere classified', 'Z68.34': 'Body mass index (BMI) 34.0-34.9, adult', 'V86.64': 'Passenger of military vehicle injured in nontraffic accident', 'L20.9': 'Atopic dermatitis, unspecified', 'S65.51': 'Laceration of blood vessel of other and unspecified finger', 'B67.1': 'Echinococcus granulosus infection of lung', 'S08.81': 'Traumatic amputation of nose', 'Z36.5': 'Encounter for antenatal screening for isoimmunization', 'S59.22': 'Salter-Harris Type II physeal fracture of lower end of radius', 'M66.359': 'Spontaneous rupture of flexor tendons, unspecified thigh', 'I69.919': 'Unspecified symptoms and signs involving cognitive functions following unspecified cerebrovascular disease', 'I25.700': 'Atherosclerosis of coronary artery bypass graft(s), unspecified, with unstable angina pectoris', 'V24.0': 'Motorcycle driver injured in collision with heavy transport vehicle or bus in nontraffic accident', 'S53.025': 'Posterior dislocation of left radial head', 'Q72.819': 'Congenital shortening of unspecified lower limb', 'G44.82': 'Headache associated with sexual activity', 'M93.2': 'Osteochondritis dissecans', 'V44.6': 'Car passenger injured in collision with heavy transport vehicle or bus in traffic accident', 'O90.89': 'Other complications of the puerperium, not elsewhere classified', 'T83.518': 'Infection and inflammatory reaction due to other urinary catheter', 'Z02.9': 'Encounter for administrative examinations, unspecified', 'S55.091': 'Other specified injury of ulnar artery at forearm level, right arm'}
Each character of the string must be replaced by randomly choosing among all possible Hippocrates-codes that encode the character, and return result contain code where character is in, and index of character in value
so. this is the answer that I supposed to get
A66.0 M62.51.29 V44.6.68 H95.3.70 M08.06.26 S51.031.39 V92.13.17 V95.49.25 P07.32.46 C11.0.44 V04.45 E13.41.30 G21.8.5 R00.0.4 V52.2.54 B67.1.38 V24.0.43 M01.X72.10 C74.92.35 G72.49.35 Z68.41.24
and, this is the answer that i got.
F44.6.4 S78.922.3 W36.1.17 S93.121.2 E10.32.39 A00.1.12 S90.464.3 T37.1X.9 T43.2.17 W24.0.3 Q60.3.5 V59.9.14 S66.911.5 W93.42 V14.1.34 Y92.139.14 T21.06.12 T65.89.6 Q95.3.4 S85.161.16 S93.121.7 T37.1X.18 V49.60.23 T37.1X5.7 F98.29.16 J10.89.14
for get that I wrote code like this
import re
import random
class Hippocrates:
def __init__(self, code):
self.code = code
def description(self, x):
line_list = []
split_point = []
k = []
v = []
with open(self.code) as f:
for line in f:
for i in line:
if i == " ":
split_point.append(line.find(i))
with open(self.code) as f:
for line in f:
line_list.append(line.rstrip())
for i in line_list:
a = i.split(" ", 1)
k.append(a[0])
v.append(a[1])
d = dict(zip(k, v))
for key, value in d.items():
if x == key:
return d[key]
else:
raise ValueError('invalid ICD-code')
def character(self, numb):
line_list = []
split_point = []
k = []
v = []
with open(self.code) as f:
for line in f:
for i in line:
if i == " ":
split_point.append(line.find(i))
with open(self.code) as f:
for line in f:
line_list.append(line.rstrip())
for i in line_list:
a = i.split(" ", 1)
k.append(a[0])
v.append(a[1])
d = dict(zip(k, v))
rev = numb[::-1]
revs = rev.split('.',1)
r1 =(revs[1][::-1])
r2 = (revs[0][::-1])
for key, value in d.items():
if r1 == key:
answer = d[key]
result = answer[int(r2)]
return result
else:
raise ValueError('invalid Hippocrates-code')
def codes(self, char):
line_list = []
split_point = []
k = []
v = []
r_v = []
code_result = []
des_result = []
des_result2 = []
location = []
final = []
with open(self.code) as f:
for line in f:
for i in line:
if i == " ":
split_point.append(line.find(i))
with open(self.code) as f:
for line in f:
line_list.append(line.rstrip())
for i in line_list:
a = i.split(" ", 1)
k.append(a[0])
v.append(a[1])
d = dict(zip(k, v))
for i in v:
for x in i:
if x == char:
r_v.append(i)
for key, value in d.items():
for i in r_v:
if i == value:
code_result.append(key)
for key in d.keys():
for i in code_result:
if i == key:
des_result.append(d[i])
for i in des_result:
if i not in des_result2:
des_result2.append(i)
for i in des_result2:
regex = re.escape(char)
a = [m.start() for m in re.finditer(regex,i)]
location.append(a)
location = (sum(location,[]))
for i in range(len(code_result)):
answer = (str(code_result[i]) +'.'+ str(location[i]))
final.append(answer)
return (set(final))
def encode(self, plaintxt):
line_list = []
split_point = []
#key of dictionary
k = []
#value of dictionary
v = []
#description that contain character with index
r = []
#list of possible choice
t = []
#randomly choosen result from t
li_di = []
#descriptoin
des = []
#index of char in description
index_char = []
#answer to print
resul = []
dictlist = []
answers = []
with open(self.code) as f:
for line in f:
for i in line:
if i == " ":
split_point.append(line.find(i))
with open(self.code) as f:
for line in f:
line_list.append(line.rstrip())
for i in line_list:
a = i.split(" ", 1)
k.append(a[0])
v.append(a[1])
d = dict(zip(k, v))
print(d)
for key, value in d.items():
for i in plaintxt:
if i in value:
answer = d[key] +':'+ str(d[key].index(i))
r.append(answer)
print(r)
a = len(plaintxt)
b=0
for i in range(len(r)):
t.append(r[b::a])
b+=1
if b == len(plaintxt):
break
for i in t:
li_di.append(random.choice(i))
for i in li_di:
sep = i.split(":", 1)
des.append(sep[0])
index_char.append(sep[1])
print(index_char)
for i in des:
for key, value in d.items():
if i == value:
resul.append(key)
print(resul)
for i in range(len(resul)):
answers.append(resul[i]+'.'+index_char[i]+'')
return(" ".join(answers))
the codes that represent character in given_string should be in same order with, original given string, but i messed it up. how can i fix this?

This should work for your encode function:
def encode(self, plaintxt):
code_map = {}
codes = []
with open(self.code) as f:
for line in f:
line = line.rstrip().split(' ', 1)
code_map[line[0]] = line[1]
for ch in plaintxt:
matches = []
for key, value in code_map.items():
pos = -1
while True:
pos = value.find(ch, pos + 1)
if pos != -1:
matches.append((key, pos))
else:
break
if not matches:
raise ValueError(f'Character {ch} cannot be encoded as there are no matches')
code_tuple = random.choice(matches)
code, idx = code_tuple
codes.append(f'{code}.{idx}')
return ' '.join(codes)
Edit: I updated this to make it more space-efficient, by getting rid of char_map and appending codes as it goes
First, it creates a dict of keys as codes and values as the corresponding strings. Then it iterates through the given plaintxt string, and searches all of the values of the dict for matches (including multiple matches in a single value), and adds this to a matches list of tuples, where each tuple contains a suitable code and the index of the match. If there are no matches, it raises a ValueError as soon as it runs into an issue. It chooses randomly from each list of tuples to choose some code and index pair, and appends this to a list on the fly, and then at the end it joins this list to make your encoded string.

If memory is not a problem, I think you should build an index of possible choices of each character from the dictionary. Here is an example code:
import random
def build_char_codes(d):
result = {}
for key, val in d.items():
for i in range(len(val)):
ch = val[i]
if ch not in result:
result[ch] = {key: [i]}
else:
result[ch][key] = result[ch].get(key, []) + [i]
return result
def get_code(ch, char_codes):
key = random.sample(char_codes[ch].keys(), 1)[0]
char_pos = random.choice(char_codes[ch][key])
code = '{}.{}'.format(key, char_pos)
return code
char_codes = build_char_codes(dictionary)
given_string = 'You are what you eat.'
codes = [get_code(ch, char_codes) for ch in given_string]
print(' '.join(codes))
Notes:
char_codes index all possible choices of each character in the dictionary
it sample all the key in dictionary first (uniformly random), and then it sample the position in the string (uniformly random). But it is not sampling uniformly among all the possible choices of a character.

In preparation for the transformation, you could create a dictionary with each letter in the ICD description mapping to a list of codes that contain it at various indexes.
Then, the transformation process would simply be a matter of picking one of the code.index from the entry in the dictionary for each letter in the given string:
preparation ...
with open(fileName,'r') as f:
icd = [line.split(" ",1) for line in f.read().split("\n")]
icdLetters = dict() # list of ICD codes with index for each possible letter
for code,description in icd:
for i,letter in enumerate(description):
icdLetters.setdefault(letter,[]).append(f"{code}.{i}")
transformation....
import random
given_string = 'You are what you eat.'
result = [ random.choice(icdLetters.get(c,["-"])) for c in given_string ]
output:
print(result)
['A66.0', 'T80.22.35', 'S53.136.34', 'C40.90.33', 'S53.136.43', 'Z96.621.12', 'B57.30.24', 'H59.121.55', 'V14.1.43', 'S93.121.47', 'H59.121.9', 'V04.92.17', 'T80.22.80', 'O16.1.22', 'T25.61.10', 'S53.136.34', 'F44.6.32', 'M67.232.29', 'M89.771.34', 'S93.121.7', 'Z68.36.29']
If you want to save some memory, your dictionary could store indexes in the main list of icd codes and descriptions instead of the formatted values:
with open(fileName,'r') as f:
icd = [line.split(" ",1) for line in f.read().split("\n")]
icdLetters = dict()
for codeIndex,(code,description) in enumerate(icd):
for letterIndex,letter in enumerate(description):
icdLetters.setdefault(letter,[]).append((codeIndex,letterIndex))
import random
def letterToCode(letter):
if letter not in icdLetters: return "-"
codeIndex,letterIndex = random.choice(icdLetters[letter])
return f"{icd[codeIndex][0]}.{letterIndex}"
given_string = 'You are what you eat.'
result = [ letterToCode(c) for c in given_string ]

Related

How to create a nested dictionary from a text file

So, my file looks like this :
Intestinal infectious diseases (001-003)
001 Cholera
002 Typhoid and paratyphoid fevers
003 Other salmonella infections
Tuberculosis (004-006)
004 Primary tuberculous infection
005 Pulmonary tuberculosis
006 Other respiratory tuberculosis
.
.
.
I'm supposed to make a nested dictionary with the disease group as keys and the dictionary containing the disease code and name, as value for the first dictionary. I'm having some trouble separating the disease codes into their own disease groups. Here's what I've done so far:
import json
icd9_encyclopedia={}
lines = []
f = open("icd9_info.txt", 'r')
for line in f:
line = line.rstrip("\n")
if line[0].isnumeric() == True:
icd9_encyclopedia[line] = ???
f.close()
solution
import itertools
from pathlib import Path
# load text lines
lines = Path('data.txt').read_text().split('\n')
# build output dictionary
icd9_encyclopedia = {
# build single group dictionary
group_name: {
int(code): disease_name
# split each disease line into code and text name
for disease_string in disease_strings
for (code, _, disease_name) in [disease_string.partition(' ')]
}
# get groups separated by an empty line
# isolate first item in each group as its name
for x, (group_name, *disease_strings) in itertools.groupby(lines, bool) if x
}
result
{'Intestinal infectious diseases (001-003)': {1: 'Cholera',
2: 'Typhoid and paratyphoid '
'fevers',
3: 'Other salmonella infections'},
'Tuberculosis (004-006)': {4: 'Primary tuberculous infection',
5: 'Pulmonary tuberculosis',
6: 'Other respiratory tuberculosis'}}
Here's another take on the problem that uses just basic Python:
from pprint import pprint
icd9_encyclopedia={}
key = None
item = {}
with open("icd9_info.txt") as f:
for line in f:
line = line.strip()
if not line[0].isdigit():
# Start a new item
if key:
# Store the prior item in the main dictionary
icd9_encyclopedia[key] = item
# Initialize the new item
key = line
item = {}
else:
# A detail entry - add it to the current item
num, rest = line.split(' ', 1)
item[num] = rest
# Store the final item to the dictionary
if key:
icd9_encyclopedia[key] = item
pprint(icd9_encyclopedia)
Result:
{'Intestinal infectious diseases (001-003)': {'001': 'Cholera',
'002': 'Typhoid and paratyphoid '
'fevers',
'003': 'Other salmonella '
'infections'},
'Tuberculosis (004-006)': {'004': 'Primary tuberculous infection',
'005': 'Pulmonary tuberculosis',
'006': 'Other respiratory tuberculosis'}}
I used defaultdict to easily make a nested dictionary, as follows:
from collections import defaultdict
icd9_encyclopedia = defaultdict(dict)
disease_group = ""
with open("icd9_info.txt", 'r') as f:
for line in [i[:-1] for i in f.readlines()]: # [:-1] to remove '\n' for each line
if line == "": # skip if blank line
continue
if not line[0].isdigit():
disease_group = line # temporarily save current disease group name for the following lines
else:
code, name = line.split(maxsplit=1)
icd9_encyclopedia[disease_group][code] = name
for key, value in icd9_encyclopedia.items():
print(key, value)
#Intestinal infectious diseases (001-003) {'001': 'Cholera', '002': 'Typhoid and paratyphoid fevers', '003': 'Other salmonella infections'}
#Tuberculosis (004-006) {'004': 'Primary tuberculous infection', '005': 'Pulmonary tuberculosis', '006': 'Other respiratory tuberculosis'}
You can see more detail about defaultdict here: https://www.geeksforgeeks.org/defaultdict-in-python/
validInt checks weather the data is a valid integer
def validInt(data):
try:
int(data)
except Exception as e:
return False
pass
return True
encyclo = {}
with open("file.data",'r') as f:
lines = f.readlines()
for line in lines:
if len(line.strip()) == 0:#line should not be empty
continue
first = line.split(' ')[0]
if validInt(first):
di = encyclo[list(encyclo.keys())[-1]] # returns a dictionary
di[first] = line[len(first):] # inserting data to dictionary len(first) is used to skip the numeric part
else:
encyclo[line] = {}
for key, value in encyclo.items():#displaying data
print(key, value)
$ python3 test.py
Intestinal infectious diseases (001-003)
{'001': ' Cholera\n', '002': ' Typhoid and paratyphoid fevers\n', '003': ' Other salmonella infections\n'}
Tuberculosis (004-006)
{'004': ' Primary tuberculous infection\n', '005': ' Pulmonary tuberculosis\n', '006': ' Other respiratory tuberculosis\n'}

How do I fix: "TypeError: cannot unpack non-iterable NoneType object"

I've been searching through stackoverflow and other various sites, but I've been unable to resolve this error for about a week now.
I'm trying the get the minimum and maximum values from each country within the dictionary. The key of the dictionary is the region. I'm unsure of where the type error is but, I'd appreciate it if someone could help.
Here's the error:
min_tup, max_tup = get_min_max(D,region,option)
File "proj08.py", line 107, in get_min_max
return min[0], max[0]
UnboundLocalError: local variable 'min' referenced before assignment
Here's the sample input:
Region,option: North America , 2
Here's the documentation explaining the function and .csv
https://www.cse.msu.edu/~cse231/Online/Projects/Project08/Project08.pdf
https://www.cse.msu.edu/~cse231/Online/Projects/Project08/data_short.csv
Here's the code:
import csv
from operator import itemgetter
# do NOT import sys
REGION_LIST = ['East Asia & Pacific',
'Europe & Central Asia',
'Latin America & Caribbean',
'Middle East & North Africa',
'North America',
'South Asia',
'Sub-Saharan Africa']
PROMPT = "\nSpecify a region from this list or 'q' to quit -- \nEast
Asia & Pacific,Europe & Central Asia,Latin America & Caribbean,Middle
East & North
Africa,North America,South Asia,Sub-Saharan Africa: "
def open_file():
# Opens a file
while True:
try:
file = input("Input a file: ")
fp = open(file, "r")
return fp
except FileNotFoundError:
print("Invalid filename, please try again.")
def read_file(fp):
# Sets read Csv file to a variable
reader = csv.reader(fp)
# Skips the header
next(reader, None)
# Country List
country_list = []
# sets a dictionary
Dict = dict()
for line in reader:
try:
skipper = ""
if skipper in line:
continue
else:
region = line[6]
country = line[0].strip()
electricty = float(line[2])
fertility = float(line[3])
gdp = float(line[4])
life_expectancy = float(line[5])
country_list = [country, electricty, fertility, GDP,
life_expectancy]
if region in Dict.keys():
Dict[region].append(country_list)
elif region not in Dict.keys():
Dict[region] = [country_list]
else:
continue
except KeyError:
continue
except ValueError:
continue
return Dict
def get_min_max(Dict, region, option):
lis = []
for k, v in Dict.items():
if region in k[0]:
if option == 1:
electricity = v[1]
tup = tuple(k, electricity)
lis.append(tup)
min = sorted(lis, key=itemgetter(1))
max = sorted(lis, key=itemgetter(1), reverse=True)
if option == 2:
fertility = v[2]
tup = tuple(k, fertility)
lis.append(tup)
min = sorted(lis, key=itemgetter(1))
max = sorted(lis, key=itemgetter(1), reverse=True)
if option == 3:
gdp = v[3]
tup = tuple(k, gdp)
lis.append(tup)
min = sorted(lis, key=itemgetter(1))
max = sorted(lis, key=itemgetter(1), reverse=True)
if option == 4:
life_expectancy = v[4]
tup = tuple(k, life_expectancy)
lis.append(tup)
min = sorted(lis, key=itemgetter(1))
max = sorted(lis, key=itemgetter(1), reverse=True)
return min[0], max[0]
def display_all_countries(D, region):
if region in REGION_LIST:
if region == 'all':
print("\nDisplaying {} Region:\n".format(region))
print("{:32s}{:>20s}{:>20s}{:>17s}{:>18s}".format(
"Country", "Electricity Access", "Fertility rate", "GDP
per capita", "Life expectancy"))
for k, v in D.items():
if region in v[0]:
country = v[0]
electricity = v[1]
fertility = v[2]
gdp = v[3]
life = v[4]
tup = (country, electricity, fertility, gdp, life)
sorted(tup, key=itemgetter(0), reverse=True)
print("{:32s}{:>20.2f}{:>20.2f}{:>17.2f}
{:>18.2f}".format(
tup[0], tup[1], tup[2], tup[3], tup[4]))
if region not in REGION_LIST:
return None
def get_top10(D):
pass
def display_options():
"""
DO NOT CHANGE
Display menu of options for program
"""
OPTIONS = """\nMenu
1: Minimum and Maximum Countries Access to Electricity
2: Minimum and Maximum Countries Fertility Rate
3: Minimum and Maximum Countries GDP per Capita
4: Minimum and Maximum Countries Life Expectancy
5: List of countries in a region
6: Top 10 Countries in the world by GDP per Capita\n"""
print(OPTIONS)
def main():
file = open_file()
# while True:
# if user == 'East Asia & Pacific' or user == 'Europe &
Central Asia' or user == 'Middle East & North Africa' or user ==
'Latin America & Caribbean' or user == 'North America' or user ==
'South Asia' or user == 'Sub-Saharan Africa':
# print("\nRegion: ".format(user))
# display_options()
# if user == "Q" or user == "q":
# break
# else:
# user = input(PROMPT)
region = 'North America'
option = '2'
superD = read_file(file)
mina = get_min_max(superD, region, option)
#print(mina)
if __name__ == '__main__':
main()
The error is telling you that you can't use unpacking assignment such as
x, y = function()
Because the function returned something that can't be unpacked (None, in this case)
This means that your function returned None somehow. We can't say for sure without a reusable example, but I would guess that its because of the first if condition in your function, which can return None.
Although it is allowed, it is generally not a great idea to have multiple different return types in a python function. This is because the caller has to know how to handle different things that the function might do, instead of being able to trust that the function will work and give them a good answer (assuming of course, that they are using correct inputs.)

How to set contents of a file that don't start with "\t" as keys, and those who start with "\t" and end with "\n" as values to the key before them?

I want make a dictionary that looks like this: { 'The Dorms': {'Public Policy' : 50, 'Physics Building' : 100, 'The Commons' : 120}, ...}
This is the list :
['The Dorms\n', '\tPublic Policy, 50\n', '\tPhysics Building, 100\n', '\tThe Commons, 120\n', 'Public Policy\n', '\tPhysics Building, 50\n', '\tThe Commons, 60\n', 'Physics Building\n', '\tThe Commons, 30\n', '\tThe Quad, 70\n', 'The Commons\n', '\tThe Quad, 15\n', '\tBiology Building, 20\n', 'The Quad\n', '\tBiology Building, 35\n', '\tMath Psych Building, 50\n', 'Biology Building\n', '\tMath Psych Building, 75\n', '\tUniversity Center, 125\n', 'Math Psych Building\n', '\tThe Stairs by Sherman, 50\n', '\tUniversity Center, 35\n', 'University Center\n', '\tEngineering Building, 75\n', '\tThe Stairs by Sherman, 25\n', 'Engineering Building\n', '\tITE, 30\n', 'The Stairs by Sherman\n', '\tITE, 50\n', 'ITE']
This is my code:
def load_map(map_file_name):
# map_list = []
map_dict = {}
map_file = open(map_file_name, "r")
map_list = map_file.readlines()
for map in map_file:
map_content = map.strip("\n").split(",")
map_list.append(map_content)
for map in map_list:
map_dict[map[0]] = map[1:]
print(map_dict)
if __name__ == "__main__":
map_file_name = input("What is the map file? ")
load_map(map_file_name)
Since your file's content is apparently literal Python data, you should use ast.literal_eval to parse it not some ad-hoc method.
Then you can just loop around your values and process them:
def load_map(mapfile):
with open(mapfile, encoding='utf-8') as f:
data = ast.literal_eval(f.read())
m = {}
current_section = None
for item in data:
if not item.startswith('\t'):
current_section = m[item.strip()] = {}
else:
k, v = item.split(',')
current_section[k.strip()] = int(v.strip())
print(m)

How to find all longest common substrings that exist in multiple documents?

I have many text documents that I want to compare to one another and remove all text that is exactly the same between them. This is to remove find boiler plate text that is consistent so it can be removed for NLP.
The best way I figured to do this is to find Longest Common Sub-strings that exist or are mostly present in all the documents. However, doing this has been incredibly slow.
Here is an example of what I am trying to accomplish:
DocA:
Title: To Kill a Mocking Bird
Author: Harper Lee
Published: July 11, 1960
DocB:
Title: 1984
Author: George Orwell
Published: June 1949
DocC:
Title: The Great Gatsby
Author: F. Scott Fitzgerald
The output would show something like:
{
'Title': 3,
'Author': 3,
'Published': 2,
}
The results would then be used to strip out the commonalities between documents.
Here is some code I have tested in python. It's incredibly with any significant amount of permutations:
file_perms = list(itertools.permutations(files, 2))
results = {}
for p in file_perms:
doc_a = p[0]
doc_b = p[1]
while True:
seq_match = SequenceMatcher(a=doc_a, b=doc_b)
match = seq_match.find_longest_match(0, len(doc_a), 0, len(doc_b))
if (match.size >= 5):
doc_a_start, doc_a_stop = match.a, match.a + match.size
doc_b_start, doc_b_stop = match.b, match.b + match.size
match_word = doc_a[doc_a_start:doc_a_stop]
if match_word in results:
results[match_word] += 1
else:
results[match_word] = 1
doc_a = doc_a[:doc_a_start] + doc_a[doc_a_stop:]
doc_b = doc_b[:doc_b_start] + doc_b[doc_b_stop:]
else:
break
df = pd.DataFrame(
{
'Value': [x for x in results.keys()],
'Count': [x for x in results.values()]
}
)
print(df)
create a set from each document,
build a counter for every word how many time it appears
iterate over every document, when you find a word that appears in 70% -90% of documents,
append it and the word after it as a tuple to a new counter
and again..
from collections import Counter
one_word = Counter()
for doc in docs:
word_list = docs.split(" ")
word_set = set(word_list)
for word in word_set:
one_word[word]+=1
two_word = Counter()
threshold = len(docs)*0.7
for doc in docs:
word_list = doc.split(" ")
for i in range(len(word_list)-1):
if one_word[word_list[i]]>threshold:
key = (word_list[i], word_list[i+1])
you can play with the threshold and continue as long as the counter is not empty
the docs are lyrics of songs believer, by the river of Babylon, I could stay awake, rattlin bog
from collections import Counter
import os
import glob
TR =1 #threshold
dir = r"D:\docs"
path = os.path.join(dir,"*.txt")
files = glob.glob(path)
one_word = {}
all_docs = {}
for file in files:
one_word[file] = set()
all_docs[file] = []
with open(file) as doc:
for row in doc:
for word in row.split():
one_word[file].add(word)
all_docs[file].append(word)
#now one_word is a dict where the kay is file name and the value is set of words in it
#all_docs is a dict file name is the key and the value is the complete doc stord in a list word by word
common_Frase = Counter()
for key in one_word:
for word in one_word[key]:
common_Frase[word]+=1
#common_Frase containe a count of all words appearence in all files (every file can add a word once)
two_word = {}
for key in all_docs:
two_word[key] = set()
doc = all_docs[key]
for index in range(len(doc)-1):
if common_Frase[doc[index]]>TR:
val = (doc[index], doc[index+1])
two_word[key].add(val)
for key in two_word:
for word in two_word[key]:
common_Frase[word]+=1
#now common_Frase contain a count of all two words frase
three_word = {}
for key in all_docs:
three_word[key] = set()
doc = all_docs[key]
for index in range(len(doc)-2):
val2 = (doc[index], doc[index+1])
if common_Frase[val2]>TR:
val3 = (doc[index], doc[index+1], doc[index+2])
three_word[key].add(val3)
for key in three_word:
for word in three_word[key]:
common_Frase[word]+=1
for k in common_Frase:
if common_Frase[k]>1:
print(k)
this is the outpot
when like all Don't And one the my hear and feeling Then your of I'm in me The you away I never to be what a ever thing there from By down Now words that was ('all', 'the') ('And', 'the') ('the', 'words') ('By', 'the') ('and', 'the') ('in', 'the')

How to train a sense2vec model

The documentation of sense2vec mentions 3 primary files - the first of them being merge_text.py. I have tried several types of inputs- txt,csv,bzipped file since merge_text.py tries to open files compressed by bzip2.
The file can be found at:
https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py
What type of input format does this script require?
Further, if anyone could please suggest how to train the model.
I extended and adjusted the code samples from sense2vec.
You go from this input text:
"As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money."
To this:
as|ADV far|ADV as|ADP saudi_arabia|ENT and|CCONJ its|ADJ motif|NOUN that|ADJ is|VERB very|ADV simple|ADJ also|ADV saudis|ENT are|VERB good|ADJ at|ADP money|NOUN and|CCONJ arithmetic|NOUN faced|VERB with|ADP painful_choice|NOUN of|ADP losing|VERB money|NOUN maintaining|VERB current_production|NOUN at|ADP us$|SYM 60|MONEY per|ADP barrel|NOUN or|CCONJ taking|VERB two_million|CARDINAL barrel|NOUN per|ADP day|NOUN off|ADP market|NOUN and|CCONJ losing|VERB much_more_money|NOUN it|PRON 's|VERB easy_choice|NOUN take|VERB path|NOUN that|ADJ is|VERB less|ADV painful|ADJ if|ADP there|ADV are|VERB secondary_reason|NOUN like|ADP hurting|VERB us|ENT tight_oil_producer|NOUN or|CCONJ hurting|VERB iran|ENT and|CCONJ russia|ENT 's|VERB great|ADJ but|CCONJ it|PRON 's|VERB really|ADV just|ADV about|ADP money|NOUN
Double line breaks are interpreted as separate documents.
Urls are recognized as such, stripped down to domain.tld and marked as |URL
Nouns (also noun being part of noun phrases) are lemmatized (as motives become motifs)
Words with POS-tags like DET (determinate article) and PUNCT (for punctuation) are dropped
Here's the code. Let me know if you have questions.
I'll probably publish it on github.com/woltob soon.
import spacy
import re
nlp = spacy.load('en')
nlp.matcher = None
LABELS = {
'ENT': 'ENT',
'PERSON': 'PERSON',
'NORP': 'ENT',
'FAC': 'ENT',
'ORG': 'ENT',
'GPE': 'ENT',
'LOC': 'ENT',
'LAW': 'ENT',
'PRODUCT': 'ENT',
'EVENT': 'ENT',
'WORK_OF_ART': 'ENT',
'LANGUAGE': 'ENT',
'DATE': 'DATE',
'TIME': 'TIME',
'PERCENT': 'PERCENT',
'MONEY': 'MONEY',
'QUANTITY': 'QUANTITY',
'ORDINAL': 'ORDINAL',
'CARDINAL': 'CARDINAL'
}
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|ยด')
def strip_meta(text):
text = text.replace('per cent', 'percent')
text = text.replace('>', '>').replace('<', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
text = double_linebreak_re.sub('{2break}', text)
text = single_linebreak_re.sub(' ', text)
text = text.replace('{2break}', '\n')
text = whitespace_re.sub(' ', text)
text = quote_re.sub('', text)
return text
def transform_doc(doc):
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
for np in doc.noun_chunks:
while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
np = np[1:]
np.merge(np.root.tag_, np.text, np.root.ent_type_)
strings = []
for sent in doc.sents:
sentence = []
if sent.text.strip():
for w in sent:
if w.is_space:
continue
w_ = represent_word(w)
if w_:
sentence.append(w_)
strings.append(' '.join(sentence))
if strings:
return '\n'.join(strings) + '\n'
else:
return ''
def represent_word(word):
if word.like_url:
x = url_re.search(word.text.strip().lower())
if x:
return x.group(3)+'|URL'
else:
return word.text.lower().strip()+'|URL?'
text = re.sub(r'\s', '_', word.text.strip().lower())
tag = LABELS.get(word.ent_type_)
# Dropping PUNCTUATION such as commas and DET like the
if tag is None and word.pos_ not in ['PUNCT', 'DET']:
tag = word.pos_
elif tag is None:
return None
# if not word.pos_:
# tag = '?'
return text + '|' + tag
corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''
corpus_stripped = strip_meta(corpus)
doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
# only lemmatize NOUN and PROPN
if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
# Keep the original word with the length of the lemma, then add the white space, if it was there.:
lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
# print(word.text, lemma_)
corpus_.append(lemma_)
# print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
# All other words are added normally.
else:
corpus_.append(word.text_with_ws)
result = transform_doc(nlp(''.join(corpus_)))
sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w')
file.write(result)
file.close()
print(result)
You could visualise your model using Gensim in Tensorboard using this approach:
https://github.com/ArdalanM/gensim2tensorboard
I'll also adjust this code to work with the sense2vec approach (e.g. the words become lowercase in the preprocessing step, just comment it out in the code).
Happy coding,
woltob
The input file should be a bzipped json. To use a plain text file just edit the merge_text.py as follow:
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield line.decode('utf-8', errors='ignore')
# yield ujson.loads(line)['body']

Categories

Resources