I want to display all three lists side by side with the names associated with the values in a table format. I am manually doing it right now and it's taking a while for all 20 files I must do. Thank you for your help!
maxpreandpost = [Pre1,Pre2,Pre3,Pre4,Pre5,Pre6,Pre7,Pre8,Pre9,Pre10,Post1,Post2,Post3,Post4,Post5,Post6,Post7,Post8,Post9,Post10]
for i in maxpreandpost:
height = max(i.Z)
print (height)
165.387
160.214
159.118
186.685
163.744
160.717
184.026
171.25099999999995
175.73
156.512
150.339
131.528
148.52100000000004
126.738
136.389
148.334
129.855
153.599
144.595
159.32299999999995
lenpreandpost = [Pre1,Pre2,Pre3,Pre4,Pre5,Pre6,Pre7,Pre8,Pre9,Pre10,Post1,Post2,Post3,Post4,Post5,Post6,Post7,Post8,Post9,Post10]
for i in lenpreandpost:
duration = len(i.Z)
print (duration)
690
543
292
271
293
147
209
355
230
293
395
256
349
255
335
255
231
243
315
267
dis = [Pre1,Pre2,Pre3,Pre4,Pre5,Pre6,Pre7,Pre8,Pre9,Pre10,Post1,Post2,Post3,Post4,Post5,Post6,Post7,Post8,Post9,Post10]
for i in dis:
p1 = [max(i.X),max(i.Y)]
p2 = [min(i.X),min(i.Y)]
distance = math.sqrt(((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2))
print (distance)
2219.0546989150585
2337.434842606099
1857.1832474809803
1450.0472277998394
1512.6539831504758
1058.5635689541748
1653.517987682021
1854.670452561212
1861.8190476064021
1775.672511965326
1872.275393720069
1814.9932559772114
1852.3299779009246
1875.2281201398403
1867.1599096301313
1708.250531327712
1793.8521780715407
1862.7949271803914
1872.843665022548
1800.2239125453254
Sure, append all values to output lists and then add them to a pandas dataframe:
import pandas as pd
heightmax = []
maxpreandpost = [Pre1,Pre2,Pre3,Pre4,Pre5,Pre6,Pre7,Pre8,Pre9,Pre10,Post1,Post2,Post3,Post4,Post5,Post6,Post7,Post8,Post9,Post10]
for i in maxpreandpost:
height = max(i.Z)
heightmax.append(height)
duration_pre_post = []
lenpreandpost = [Pre1,Pre2,Pre3,Pre4,Pre5,Pre6,Pre7,Pre8,Pre9,Pre10,Post1,Post2,Post3,Post4,Post5,Post6,Post7,Post8,Post9,Post10]
for i in lenpreandpost:
duration = len(i.Z)
duration_pre_post.append(duration)
dis_p1_p2 = []
dis = [Pre1,Pre2,Pre3,Pre4,Pre5,Pre6,Pre7,Pre8,Pre9,Pre10,Post1,Post2,Post3,Post4,Post5,Post6,Post7,Post8,Post9,Post10]
for i in dis:
p1 = [max(i.X),max(i.Y)]
p2 = [min(i.X),min(i.Y)]
distance = math.sqrt(((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2))
dis_p1_p2.append(distance)
df = pd.DataFrame() # initialize empty dataframe
# Store each list as a column in the df.
df['HeightMax'] = heightmax
df['DurationPrePost'] = duration_pre_post
df['DistanceP1P2'] = dis_p1_p2
#if you want to write this out to a tabular file:
df.to_csv('./Desktop/myDf.csv', sep='\t', index=False)
The output of this would be something like:
HeightMax DurationPrePost DistanceP1P2
0 165.387 690 2219.0546989150585
1 160.214 543 2337.434842606099
2 159.118 292 1857.1832474809803
3 186.685 271 1450.0472277998394
4 163.744 293 1512.6539831504758
... #extends to end of lists
Related
I am trying to use tabulate with the zip_longest function. So I have it like this:
from __future__ import print_function
from tabulate import tabulate
from itertools import zip_longest
import itertools
import locale
import operator
import re
50 ="['INGBNL2A, VAT number: NL851703884B01 i\nTel, +31 (0}1 80 61 88 \n\nrut ard wegetables\n\x0c']"
fruit_words = ['Appels', 'Ananas', 'Peen Waspeen',
'Tomaten Cherry', 'Sinaasappels',
'Watermeloenen', 'Rettich', 'Peren', 'Peen', 'Mandarijnen', 'Meloenen', 'Grapefruit']
def total_amount_fruit_regex(format_=re.escape):
return r"(\d*(?:\.\d+)*)\s*~?=?\s*(" + '|'.join(
format_(word) for word in fruit_words) + ')'
def total_fruit_per_sort():
number_found = re.findall(total_amount_fruit_regex(), verdi50)
fruit_dict = {}
for n, f in number_found:
fruit_dict[f] = fruit_dict.get(f, 0) + int(n)
result = '\n'.join(f'{key}: {val}' for key, val in fruit_dict.items())
return result
def fruit_list(format_=re.escape):
return "|".join(format_(word) for word in fruit_words)
def findallfruit(regex):
return re.findall(regex, verdi50)
def verdi_total_number_fruit_regex():
return rf"(\d*(?:\.\d+)*)\s*\W+(?:{fruit_list()})"
def show_extracted_data_from_file():
regexes = [
verdi_total_number_fruit_regex(),
]
matches = [findallfruit(regex) for regex in regexes]
fruit_list = total_fruit_per_sort().split("\n")
return "\n".join(" \t ".join(items) for items in zip_longest(tabulate(*matches, fruit_list, headers=['header','header2'], fillvalue='', )))
print(show_extracted_data_from_file())
But then I get this error:
TypeError at /controlepunt140
tabulate() got multiple values for argument 'headers'
So how to improve this?
So if you remove the tabulate function. Then the format looks like this:
16 Watermeloenen: 466
360 Appels: 688
6 Sinaasappels: 803
75
9
688
22
80
160
320
160
61
So expected output is with headers:
header1 header2
------- -------
16 Watermeloenen: 466
360 Appels: 688
6 Sinaasappels: 803
75
9
688
22
80
160
320
160
61
Like how it works in tabulate.
You should be passing a single table to the tabulate() function, passing multiple lists results in the TypeError: tabulate() got multiple values for argument 'headers' you are seeing.
Updating your return statement -
def show_extracted_data_from_file():
regexes = [
verdi_total_number_fruit_regex(),
]
matches = [findallfruit(regex) for regex in regexes]
fruit_list = total_fruit_per_sort().split("\n")
return tabulate(zip_longest(*matches, fruit_list), headers=['header1','header2'])
Output:
header1 header2
--------- ------------------
16 Watermeloenen: 466
360 Appels: 688
6 Sinaasappels: 803
75
9
688
22
80
160
320
160
61
I want to use stanza for tokenizing, pos tagging and parsing some text I have, but it keeps giving me this error. I've tried changing the way a I call it but nothing happens. Any ideas?
My code(Here a iterate through a list of list of text and appli stanza to each one)
t = time()
data_stanza = []
for text in data:
stz = apply_stanza(text[0])
data_stanza.append(stz)
print('Time to run: {} mins'.format(round((time() - t) / 60, 2)))
This is the function I use to apply_stanza to each text:
nlp = stanza.Pipeline('pt')
def apply_stanza(text):
doc = nlp(text)
All = []
for sent in doc.sentences:
for word in sent.words:
All.append((word.id,word.text,word.lemma,word.upos,word.feats,word.head,word.deprel))
return All
The error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-17-7ac303eec8e8> in <module>
3 data_staza = []
4 for text in data:
----> 5 stz = apply_stanza(text[0])
6 data_stanza.append(stz)
7
<ipython-input-16-364c3ac30f32> in apply_stanza(text)
2
3 def apply_stanza(text):
----> 4 doc = nlp(text)
5 All = []
6 for sent in doc.sentences:
~\anaconda3\lib\site-packages\stanza\pipeline\core.py in __call__(self, doc)
174 assert any([isinstance(doc, str), isinstance(doc, list),
175 isinstance(doc, Document)]), 'input should be either str, list or Document'
--> 176 doc = self.process(doc)
177 return doc
178
~\anaconda3\lib\site-packages\stanza\pipeline\core.py in process(self, doc)
168 for processor_name in PIPELINE_NAMES:
169 if self.processors.get(processor_name):
--> 170 doc = self.processors[processor_name].process(doc)
171 return doc
172
~\anaconda3\lib\site-packages\stanza\pipeline\mwt_processor.py in process(self, document)
31 preds = []
32 for i, b in enumerate(batch):
---> 33 preds += self.trainer.predict(b)
34
35 if self.config.get('ensemble_dict', False):
~\anaconda3\lib\site-packages\stanza\models\mwt\trainer.py in predict(self, batch, unsort)
77 self.model.eval()
78 batch_size = src.size(0)
---> 79 preds, _ = self.model.predict(src, src_mask, self.args['beam_size'])
80 pred_seqs = [self.vocab.unmap(ids) for ids in preds] # unmap to tokens
81 pred_seqs = utils.prune_decoded_seqs(pred_seqs)
~\anaconda3\lib\site-packages\stanza\models\common\seq2seq_model.py in predict(self, src, src_mask, pos, beam_size)
259 done = []
260 for b in range(batch_size):
--> 261 is_done = beam[b].advance(log_probs.data[b])
262 if is_done:
263 done += [b]
~\anaconda3\lib\site-packages\stanza\models\common\beam.py in advance(self, wordLk, copy_indices)
82 # bestScoresId is flattened beam x word array, so calculate which
83 # word and beam each score came from
---> 84 prevK = bestScoresId / numWords
85 self.prevKs.append(prevK)
86 self.nextYs.append(bestScoresId - prevK * numWords)
RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future release div will perform
true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
ATT: It turns after all that it was and error with the mwt module of stanza pipeline, so I just specified not to use it.
Use // for division instead of /.
Try to edit your code as follows:
print('Time to run: {} mins'.format(round((time() - t) // 60, 2)))
Using floor division (//) will floor the result to the largest possible integer.
Using torch.true_divide(Dividend, Divisor) or numpy.true_divide(Dividend, Divisor) in stead.
For example: 3/4 = torch.true_divide(3, 4)
https://pytorch.org/docs/stable/generated/torch.true_divide.html
https://numpy.org/doc/stable/reference/generated/numpy.true_divide.html
I am trying to find the mean of an array created from data in a CSV file using Python. Data in the array is included between a range of values, so it does not include all the values in the column of the CSV. My current code that creates the array is shown below. Several arrays have been created, but I only need to find the mean of the array called "T07s". I am consistently getting the error "cannot perform reduce with flexible type" when using the function np.mean(T07s)
import csv
class dataPoint:
def __init__(self, V, T07, T19, T27, Time):
self.V = V
self.T07 = T07
self.T19 = T19
self.T27 = T27
self.Time = Time
dataPoints = []
with open("data_final.csv") as csvfile:
reader = csv.reader(csvfile)
next(reader)
for row in reader:
if 229 <= float(row[2]) <= 231:
temp = dataPoint(row[1], row[12], row[24], row[32], row[0].split(" ")[1])
dataPoints.append(temp)
T07s = np.array([x.T07 for x in dataPoints])
The data included in T07s is shown below:
for x in T07s:
print(x)
37.2
539
435.6
717.4
587
757.9
861.8
1024.2
325
117.9
136.3
167.8
809
405.3
405.1
112.7
1317.1
1731.8
1080.2
1208.6
1212.6
1363.8
1715.3
2376.4
2563.9
2998.4
2934.7
2862.4
390.8
2332.2
2121
2237.6
2334.1
2082.2
1892.1
1888.8
1960.6
1329.1
1657.2
2042.4
1417.5
977.3
1442.8
561.2
500.3
413.3
324.1
693.7
750
865.7
434.2
635.2
815.7
171.4
829.3
815.3
774.8
1411.6
1685.1
1345.1
1193.2
1674.9
1636.4
1389.8
753.3
1102.8
908.3
1223.2
1199.4
1040.7
1040.9
824.7
620
795.7
810.4
378.8
643.2
441.8
682.8
417.8
515.6
2354.7
1938.8
1512.4
1933.5
1739.8
2281.9
1997.5
2833.4
182.8
202.4
217.3
234.2
741.9
Clearly more of a simple solution:
import pandas as pd
data = pd.read_csv('data_final.csv')
data_filtered = data[data.iloc[:,2] >= 229 & data.iloc[:,2] <= 231]
print(data_filtered['T07'].mean())
I have a data set as given below-
Timestamp = 22-05-2019 08:40 :Light = 64.00 :Temp_Soil = 20.5625 :Temp_Air = 23.1875 :Soil_Moisture_1 = 756 :Soil_Moisture_2 = 780 :Soil_Moisture_3 = 1002
Timestamp = 22-05-2019 08:42 :Light = 64.00 :Temp_Soil = 20.5625 :Temp_Air = 23.125 :Soil_Moisture_1 = 755 :Soil_Moisture_2 = 782 :Soil_Moisture_3 = 1002
And I want to Reshape(rearrange) the dataset to orient header columns like [Timestamp, Light, Temp_Soil, Temp_Air, Soil_Moisture_1, Soil_Moisture_2, Soil_Moisture_3] and their values as the row entry in Python.
One of possible solutions:
Instead of a "true" input file, I used a string:
inp="""Timestamp = 22-05-2019 08:40 :Light = 64.00 :TempSoil = 20.5625 :TempAir = 23.1875 :SoilMoist1 = 756 :SoilMoist2 = 780 :SoilMoist3 = 1002
Timestamp = 22-05-2019 08:42 :Light = 64.00 :TempSoil = 20.5625 :TempAir = 23.125 :SoilMoist1 = 755 :SoilMoist2 = 782 :SoilMoist3 = 1002"""
buf = pd.compat.StringIO(inp)
To avoid "folding" of output lines, I shortened field names.
Then let's create the result DataFrame and a list of "rows" to append to it.
For now - both of them are empty.
df = pd.DataFrame(columns=['Timestamp', 'Light', 'TempSoil', 'TempAir',
'SoilMoist1', 'SoilMoist2', 'SoilMoist3'])
src = []
Below is a loop processing input rows:
while True:
line = buf.readline()
if not(line): # EOF
break
lst = re.split(r' :', line.rstrip()) # Field list
if len(lst) < 2: # Skip empty source lines
continue
dct = {} # Source "row" (dictionary)
for elem in lst: # Process fields
k, v = re.split(r' = ', elem)
dct[k] = v # Add field : value to "row"
src.append(dct)
And the last step is to append rows from src to df :
df = df.append(src, ignore_index =True, sort=False)
When you print(df), for my test data, you will get:
Timestamp Light TempSoil TempAir SoilMoist1 SoilMoist2 SoilMoist3
0 22-05-2019 08:40 64.00 20.5625 23.1875 756 780 1002
1 22-05-2019 08:42 64.00 20.5625 23.125 755 782 1002
For now all columns are of string type, so you can change the required
columns to either float or int:
df.Light = pd.to_numeric(df.Light)
df.TempSoil = pd.to_numeric(df.TempSoil)
df.TempAir = pd.to_numeric(df.TempAir)
df.SoilMoist1 = pd.to_numeric(df.SoilMoist1)
df.SoilMoist2 = pd.to_numeric(df.SoilMoist2)
df.SoilMoist3 = pd.to_numeric(df.SoilMoist3)
Note that to_numeric() function is clever enough to recognize the possible
type to convert to, so first 3 columns changed their type to float64
and the next 3 to int64.
You can check it executing df.info().
One more possible conversion is to change Timestamp column
to DateTime type:
df.Timestamp = pd.to_datetime(df.Timestamp)
i have a program in python 3 that read and compare files (that have the same name) in tow folder "gold" and "predcition"
but generate this error, my file are in UTF8 format so the caracter that generate the error is XE2 X80 (in ANSI it is â€) :
Traceback (most recent call last):
File "C:\scienceie2017_train\test.py", line 215, in <module>
calculateMeasures(folder_gold, folder_pred, remove_anno)
File "C:\scienceie2017_train\test.py", line 34, in calculateMeasures
res_full_pred, res_pred, spans_pred, rels_pred = normaliseAnnotations(f_pred, remove_anno)
File "C:\scienceie2017_train\test.py", line 132, in normaliseAnnotations
for l in file_anno:
File "C:\Users\chedi\Anaconda3\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 915-916: invalid continuation byte
the code is:
#!/usr/bin/python
# by Mattew Peters, who spotted that sklearn does macro averaging not
# micro averaging correctly and changed it
import os
from sklearn.metrics import precision_recall_fscore_support
import sys
def calculateMeasures(folder_gold="data/dev/", folder_pred="data_pred/dev/", remove_anno=""):
'''
Calculate P, R, F1, Macro F
:param folder_gold: folder containing gold standard .ann files
:param folder_pred: folder containing prediction .ann files
:param remove_anno: if set if "rel", relations will be ignored. Use this setting to only evaluate
keyphrase boundary recognition and keyphrase classification. If set to "types", only keyphrase boundary recognition is evaluated.
Note that for the later, false positive
:return:
'''
flist_gold = os.listdir(folder_gold)
res_all_gold = []
res_all_pred = []
targets = []
for f in flist_gold:
# ignoring non-.ann files, should there
# be any
if not str(f).endswith(".ann"):
continue
f_gold = open(os.path.join(folder_gold, f), "r", encoding="utf")
try:
f_pred = open(os.path.join(folder_pred, f), "r", encoding="utf8")
res_full_pred, res_pred, spans_pred, rels_pred = normaliseAnnotations(f_pred, remove_anno)
except IOError:
print(f + " file missing in " + folder_pred + ". Assuming no predictions are available for this file.")
res_full_pred, res_pred, spans_pred, rels_pred = [], [], [], []
res_full_gold, res_gold, spans_gold, rels_gold = normaliseAnnotations(f_gold, remove_anno)
spans_all = set(spans_gold + spans_pred)
for i, r in enumerate(spans_all):
if r in spans_gold:
target = res_gold[spans_gold.index(r)].split(" ")[0]
res_all_gold.append(target)
if not target in targets:
targets.append(target)
else:
res_all_gold.append("NONE")
if r in spans_pred:
target_pred = res_pred[spans_pred.index(r)].split(" ")[0]
res_all_pred.append(target_pred)
else:
res_all_pred.append("NONE")
#y_true, y_pred, labels, targets
prec, recall, f1, support = precision_recall_fscore_support(res_all_gold, res_all_pred, labels=targets, average=None)
metrics = {}
for k, target in enumerate(targets):
metrics[target] = {
'precision': prec[k],
'recall': recall[k],
'f1-score': f1[k],
'support': support[k]
}
# now
# micro-averaged
if remove_anno != 'types':
prec, recall, f1, s = precision_recall_fscore_support(res_all_gold, res_all_pred, labels=targets, average='micro')
metrics['overall'] = {
'precision': prec,
'recall': recall,
'f1-score': f1,
'support': sum(support)
}
else:
# just
# binary
# classification,
# nothing
# to
# average
metrics['overall'] = metrics['KEYPHRASE-NOTYPES']
print_report(metrics, targets)
return metrics
def print_report(metrics, targets, digits=2):
def _get_line(results, target, columns):
line = [target]
for column in columns[:-1]:
line.append("{0:0.{1}f}".format(results[column], digits))
line.append("%s" % results[columns[-1]])
return line
columns = ['precision', 'recall', 'f1-score', 'support']
fmt = '%11s' + '%9s' * 4 + '\n'
report = [fmt % tuple([''] + columns)]
report.append('\n')
for target in targets:
results = metrics[target]
line = _get_line(results, target, columns)
report.append(fmt % tuple(line))
report.append('\n')
# overall
line = _get_line(
metrics['overall'], 'avg / total', columns)
report.append(fmt % tuple(line))
report.append('\n')
print(''.join(report))
def normaliseAnnotations(file_anno, remove_anno):
'''
Parse annotations from the annotation files: remove relations (if requested), convert rel IDs to entity spans
:param file_anno:
:param remove_anno:
:return:
'''
res_full_anno = []
res_anno = []
spans_anno = []
rels_anno = []
for l in file_anno:
print(l)
print(l.strip('\n'))
r_g = l.strip('\n').split("\t")
print(r_g)
print(len(r_g))
r_g_offs = r_g[1].split()
print(r_g_offs)
if remove_anno != "" and r_g_offs[0].endswith("-of"):
continue
res_full_anno.append(l.strip())
if r_g_offs[0].endswith("-of"):
arg1 = r_g_offs[1].replace("Arg1:", "")
arg2 = r_g_offs[2].replace("Arg2:", "")
for l in res_full_anno:
r_g_tmp = l.strip().split("\t")
if r_g_tmp[0] == arg1:
ent1 = r_g_tmp[1].replace(" ", "_")
if r_g_tmp[0] == arg2:
ent2 = r_g_tmp[1].replace(" ", "_")
spans_anno.append(" ".join([ent1, ent2]))
res_anno.append(" ".join([r_g_offs[0], ent1, ent2]))
rels_anno.append(" ".join([r_g_offs[0], ent1, ent2]))
else:
spans_anno.append(" ".join([r_g_offs[1], r_g_offs[2]]))
keytype = r_g[1]
if remove_anno == "types":
keytype = "KEYPHRASE-NOTYPES"
res_anno.append(keytype)
for r in rels_anno:
r_offs = r.split(" ")
# reorder hyponyms to start with smallest index
# 1, 2
if r_offs[0] == "Synonym-of" and r_offs[2].split("_")[1] < r_offs[1].split("_")[1]:
r = " ".join([r_offs[0], r_offs[2], r_offs[1]])
if r_offs[0] == "Synonym-of":
for r2 in rels_anno:
r2_offs = r2.split(" ")
if r2_offs[0] == "Hyponym-of" and r_offs[1] == r2_offs[1]:
r_new = " ".join([r2_offs[0], r_offs[2], r2_offs[2]])
rels_anno[rels_anno.index(r2)] = r_new
if r2_offs[0] == "Hyponym-of" and r_offs[1] == r2_offs[2]:
r_new = " ".join([r2_offs[0], r2_offs[1], r_offs[2]])
rels_anno[rels_anno.index(r2)] = r_new
rels_anno = list(set(rels_anno))
res_full_anno_new = []
res_anno_new = []
spans_anno_new = []
for r in res_full_anno:
r_g = r.strip().split("\t")
if r_g[0].startswith("R") or r_g[0] == "*":
continue
ind = res_full_anno.index(r)
res_full_anno_new.append(r)
res_anno_new.append(res_anno[ind])
spans_anno_new.append(spans_anno[ind])
for r in rels_anno:
res_full_anno_new.append("R\t" + r)
res_anno_new.append(r)
spans_anno_new.append(" ".join([r.split(" ")[1], r.split(" ")[2]]))
return res_full_anno_new, res_anno_new, spans_anno_new, rels_anno
if __name__ == '__main__':
folder_gold = "data/dev/"
folder_pred = "data_pred/dev/"
remove_anno = "" # "", "rel" or "types"
if len(sys.argv) >= 2:
folder_gold = sys.argv[1]
if len(sys.argv) >= 3:
folder_pred = sys.argv[2]
if len(sys.argv) == 4:
remove_anno = sys.argv[3]
calculateMeasures(folder_gold, folder_pred, remove_anno)
example of prediction file
T1 Task 4 20 particular phase
T2 Task 4 26 particular phase field
T3 Task 15 26 phase field
T4 Task 15 32 phase field model
T5 Task 21 32 field model
T6 Task 93 118 dimensional thermal phase
T7 Task 105 118 thermal phase
T8 Task 105 124 thermal phase field
T9 Task 15 26 phase field
T10 Task 15 32 phase field model
T11 Task 21 32 field model
T12 Task 146 179 dimensional thermal-solutal phase
T13 Task 158 179 thermal-solutal phase
T14 Task 158 185 thermal-solutal phase field
T15 Task 15 26 phase field
T16 Task 15 32 phase field model
T17 Task 21 32 field model
T18 Task 219 235 physical problem
T19 Task 300 330 natural relaxational phenomena
T20 Task 308 330 relaxational phenomena
T21 Task 340 354 resulting PDEs
T22 Task 362 374 Allen–Cahn
T23 Task 383 403 Carn–Hilliard type
T24 Task 445 461 time derivatives
T25 Task 509 532 variational derivatives
T26 Task 541 554 functional â€
T27 Task 570 581 free energy
T28 Task 570 592 free energy functional
T29 Task 575 592 energy functional
T30 Task 570 581 free energy
T31 Task 702 717 domain boundary
T32 Task 780 797 difficult aspects
T33 Task 817 836 relaxational aspect
T34 Task 874 898 stable numerical schemes
T35 Task 881 898 numerical schemes
example of gold file
T1 Material 2 20 fluctuating vacuum
T2 Process 45 59 quantum fields
T3 Task 45 59 quantum fields
T4 Process 74 92 free Maxwell field
T5 Process 135 151 Fermionic fields
T6 Process 195 222 undergo vacuum fluctuations
T7 Process 257 272 Casimir effects
T8 Task 396 411 nuclear physics
T9 Task 434 464 “MIT bag model” of the nucleon
T10 Task 518 577 a collection of fermionic fields describing confined quarks
T11 Process 732 804 the bag boundary condition modifies the vacuum fluctuations of the field
T12 Task 983 998 nuclear physics
T13 Material 1063 1080 bag-model nucleon
T14 Material 507 514 nucleon
T15 Task 843 856 Casimir force
T16 Process 289 300 such fields
"–".encode("cp1256").decode("utf8") = –, an en dash.
The file you are opening appears to be encoded in UTF-8 and you are not specifying the encoding that should be used when open()ing them (just add encoding="utf8" to the parameters).
Python will use the operating system's default character encoding and you appear to be using Windows, where it's always something other than UTF-8. Take a look at
import locale
locale.getpreferredencoding()
to find out what encoding Python will use by default when reading and writing files.