Need help writing and reading with Python

Need help writing and reading with Python - python

At this moment I wrote this code:
class device:
naam_device = ''
stroomverbuirk = 0
aantal_devices = int(input("geef het aantal devices op: "))
i = aantal_devices
x = 0
voorwerp = {}
while i > 0:
voorwerp[x] = device()
i = i - 1
x = x + 1
i = 0
while i < aantal_devices :
voorwerp[i].naam_device = input("Wat is device %d voor een device: " % (i+1))
# hier moet nog gekeken worden naar afvang van foute invoer bijv. als gebruiker een string of char invoert ipv een float
voorwerp[i].stroomverbruik = float(input("hoeveel ampére is uw device?: "))
i += 1
i = 0
totaal = 0.0
##test while print
while i < aantal_devices:
print(voorwerp[i].naam_device,voorwerp[i].stroomverbruik)
#dit totaal moet nog worden geschreven naar een bestand zodat je na 256 invoeren een totaal kan bepalen.
totaal = totaal + voorwerp[i].stroomverbruik
i = i + 1
print("totaal ampére = ",totaal)
aantal_koelbox = int(input("Hoeveel koelboxen neemt u mee?: "))
if aantal_koelbox <= 2 or aantal_koelbox > aantal_devices:
if aantal_koelbox > aantal_devices:
toestaan = input("Deelt u de overige koelboxen met mede-deelnemers (ja/nee)?: ")
if toestaan == "ja":
print("Uw gegevens worden opgeslagen! u bent succesvol geregistreerd.")
if toestaan == "nee":
print("Uw gegevens worden niet opgeslagen! u voldoet niet aan de eisen.")
else:
print("Uw gegevens worden niet opgeslagen! u voldoet niet aan de eisen.")
Now I want to write the value of totaal to a file, and later when I saved 256 of these inputs I want to write another program to read the 256 inputs and give the sum of those and divide that number by 14. If someone could help me on the right track with writing the values and later read them I can try to find out how to do the last part.
But I've been trying for 2 days now and still found no good solution for writing and reading.

The tutorial covers this very nicely, as MattDMo points out. But I'll summarize the relevant part here.
The key idea is to open a file, then write each totaal in some format, then make sure the file gets closed at the end.
What format? Well, that depends on your data. Sometimes you have fixed-shape records, which you can store as CSV rows. Sometimes you have arbitrary Python objects, which you can store as pickles. But in this case, you can get away with using the simplest format at all: a line of text. As long as your data are single values that can be unambiguously converted to text and back, and don't have any newline or other special characters in them, this works. So:
with open('thefile.txt', 'w') as f:
while i < aantal_devices:
print(voorwerp[i].naam_device,voorwerp[i].stroomverbruik)
#dit totaal moet nog worden geschreven naar een bestand zodat je na 256 invoeren een totaal kan bepalen.
totaal = totaal + voorwerp[i].stroomverbruik
f.write('{}\n'.format(totaal))
i = i + 1
That's it. The open opens the file, creating it if necessary. The with makes sure it gets closed at the end of the block. The write writes a line consisting of whatever's in totaal, formatted as a string, followed by a newline character.
To read it back later is even simpler:
with open('thefile.txt') as f:
for line in f:
totaal = int(line)
# now do stuff with totaal

Use serialization to store the data in the files and then de-serialize them back in to the original state for computation.
By serializing the data you can restore the data to the original state (value and type, i.e. 1234 as a int and not as string)
Off you go to the docs :) : https://docs.python.org/2/library/pickle.html
P.S. For people to be able to help you he code needs to be readable, That way you can get a better answer in the future.

You can write them to a file like so:
with open(os.path.join(output_dir, filename), 'w') as output_file:
output_file.write("%s" % totaal)
And then sum them like this:
sum = 0
for input_file in os.listdir(output_dir):
if os.path.isfile(input_file):
with open(os.path.join(output_dir, input_file), 'r') as infile:
sum += int(infile.read())
print sum/14
However, I would consider whether you really need to write each totaal to a separate file. There's probably a better way to solve your problem, but I think this should do what you asked for.
P.S. I would try to read your code and make a more educated attempt to help you, but I don't know Dutch!

Related

Lemmatization taking forever with Spacy

I'm trying to lemmatize chat registers in a dataframe using spacy. My code is:
nlp = spacy.load("es_core_news_sm")
df["text_lemma"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
I have aprox 600.000 rows and the apply takes more than two hours to execute. Is there a faster package/way to lemmatize? (I need a solution that works for spanish)
I have only tried using spacy package

The slow-down in processing speed is coming from the multiple calls to the spaCy pipeline via nlp(). The faster way to process large texts is to instead process them as a stream using the nlp.pipe() command. When I tested this on 5000 rows of dummy text, it offered a ~3.874x improvement in speed (~9.759sec vs ~2.519sec) over the original method. There are ways to improve this further if required, see this checklist for spaCy optimisation I made.
Solution
# Assume dataframe (df) already contains column "text" with text
# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")
# Process large text as a stream via `nlp.pipe()` and iterate over the results, extracting lemmas
lemma_text_list = []
for doc in nlp.pipe(df["text"]):
lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma"] = lemma_text_list
Full code for testing timings
import spacy
import pandas as pd
import time
# Random Spanish sentences
rand_es_sentences = [
"Tus drafts influirán en la puntuación de las cartas según tu número de puntos DCI.",
"Información facilitada por la División de Conferencias de la OMI en los cuestionarios enviados por la DCI.",
"Oleg me ha dicho que tenías que decirme algo.",
"Era como tú, muy buena con los ordenadores.",
"Mas David tomó la fortaleza de Sion, que es la ciudad de David."]
# Duplicate sentences specified number of times
es_text = [sent for i in range(1000) for sent in rand_es_sentences]
# Create data-frame
df = pd.DataFrame({"text": es_text})
# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")
# Original method (very slow due to multiple calls to `nlp()`)
t0 = time.time()
df["text_lemma_1"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
t1 = time.time()
print("Total time: {}".format(t1-t0)) # ~9.759 seconds on 5000 rows
# Faster method processing rows as stream via `nlp.pipe()`
t0 = time.time()
lemma_text_list = []
for doc in nlp.pipe(df["text"]):
lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma_2"] = lemma_text_list
t1 = time.time()
print("Total time: {}".format(t1-t0)) # ~2.519 seconds on 5000 rows

How to get the original sentence from a text file by knowing an offset of a word in python?

I am new to python and I wonder if there is an efficient way to find the original sentence from a text file by knowing an offset of a word. Suppose that I have a test.txt file like this:
test.txt
Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.
Suppose that I know the offset of the word "wheat" which is [13,18].
My codes look like this:
import nltk
from nltk.tokenize import word_tokenize
with open("test.txt") as f:
list_phrase = f.readlines()
f.seek(0)
contents = f.read()
for index, phrase in enumerate(list_phrase):
j = word_tokenize(phrase)
if contents[13:18] in j:
print(list_phrase[index])
The output of my codes will print both sentences i.e ( "Ceci est une wheat phrase corn." and "This is the third wheat word." )
How to detect exactly the real phrase of a word by knowing its offset?
Note that the offset that I considered continues between many sentences (2 sentences in this case). For example, the offset of the word "barley" should be [61,67].
The desire output of the print above should be:
Ceci est une wheat phrase corn.
As we know that its offset is [13,18].
Any help for this would be much appreciated. Thank you so much!

If you are looking for raw speed then the standard library is probably the best approach to take.
# Generate a large text file with 10,000,001 lines.
with open('very-big.txt', 'w') as file:
for _ in range(10000000):
file.write("All work and no play makes Jack a dull boy.\n")
file.write("Finally we get to the line containing the word 'wheat'.\n")
Given the search_word and its offset in the line we're looking for we can calculate the limit for the string comparison.
search_word = 'wheat'
offset = 48
limit = offset + len(search_word)
The simplest approach is to iterate over the enumerated lines of text and perform a string comparison on each line.
with open('very-big.txt', 'r') as file:
for line, text in enumerate(file, start=1):
if (text[offset:limit] == search_word):
print(f'Line {line}: "{text.strip()}"')
The runtime for this solution is 155 ms on a 2012 Mac mini (2.3GHz i7 CPU). That seems pretty fast for processing 10,000,001 lines but it can be improved upon by checking the length of the text before attempting the string comparison.
with open('very-big.txt', 'r') as file:
for line, text in enumerate(file, start=1):
if (len(text) >= limit) and (text[offset:limit] == search_word):
print(f'Line {line}: "{text.strip()}"')
The runtime for the improved solution is 71 ms on the same computer. It's a significant improvement but of course mileage will vary depending on the text file.
Generated output:
Line 10000001: "Finally we get to the line containing the word 'wheat'."
EDIT: Including file offset information
with open('very-big.txt', 'r') as file:
file_offset = 0
for line, text in enumerate(file, start=1):
line_length = len(text)
if line_length >= limit and (text[offset:limit] == search_word):
print(f'[{file_offset + offset}, {file_offset + limit}] Line {line}: "{text.strip()}"')
file_offset += line_length
Sample output:
[430000048, 430000053] Line 10000001: "Finally we get to the line containing the word 'wheat'."
Encore une fois
This code checks if the known offset of the text is between the values of the offset of the start of the current line and the end of the line. The text found at the offset is also verified.
long_string = """Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.
"""
import io
search_word = 'barley'
known_offset = 61
limit = known_offset + len(search_word)
# Use the multi-line string defined above as file input
with io.StringIO(long_string) as file:
file_offset = 0
for line, text in enumerate(file, start=1):
line_length = len(text)
if file_offset < known_offset < (file_offset + line_length) \
and (text[(known_offset-file_offset):(limit-file_offset)] == search_word):
print(f'[{known_offset},{limit}]\nLine: {line}\n{text}')
file_offset += line_length
Output:
[61,67]
Line: 2
Ceci est une deuxième phrase barley.

If you already know the position of the word, tokenizing is not what you want to do. By tokenizing, you change the sequence (for which you know the position) to a list of words, where you don't know which element is your word.
Therefore, you should leave it at the phrase and just compare the part of the phrase with your word:
with open("test.txt") as f:
list_phrase = f.readlines()
f.seek(0)
contents = f.read()
for index, phrase in enumerate(list_phrase):
if phrase[13:18].lower() == "wheat": ## .lower() is only necessary if the word might be in upper case.
print(list_phrase[index])
This would only return the sentences where wheat is at the position [13:18]. All other occurrences of wheat would not be recognized.

Python ValueError: shape mismatch: objects cannot be broadcast to a single shape

My dataset file looks like
__label__ita Adesso datemi le chiavi.
__label__ara ياله من طفل محبب! يييي!
__label__eng You're a really bad bartender.
__label__epo En kiu hotelo vi restados?
__label__spa Él dijo haber perdido su vigor a los cuarenta.
__label__tat Сиңа булышмакчы идем.
__label__heb את מה פותח המפתח הזה?
__label__eng I caught a glimpse of him from the bus.
__label__eng I advise you to do that today.
__label__jpn この歌の歌い方を教えてくれますか。
__label__deu Ich habe gewusst, dass ihr Tom nicht vergessen würdet.
I'm using this function to parse the first column labels
def parse_labels(path):
with open(path, 'r') as f:
return np.array( list(map(lambda x: x[9:], f.read().decode('utf-8').split() )) )
so I split the row and get the ita label from the prefix __label__ita by example, but it breaks for some reason
test_labels = parse_labels(args.test)
print("Test labels:%d (sample)\n%s" % (len(test_labels),test_labels[:1]) )
print("labels:%s" % test_labels)
and I get
Test labels:71828 (sample)
[u'ita']
labels:[u'ita' u'' u'' ... u'' u'' u'']
while I should have had
[u'ita',u'ara',u'eng',...]

The title of your question does not seem to match the content, and I am answering the question posed in the body. I made your code a little more modular and tested it. It returns the desired list that you have at the end of the question (u'ita',u'ara',u'eng',...]):
def parse_labels(path):
test_labels = []
with open(path,'rb') as f:
for line in f:
test_labels.append(line.decode('utf-8').split(' ')[0][10:])
return [x for x in test_labels if x] #removes empty strings
parse_labels(args.test)

Since the language codes are at fixed offsets in each line, this can be processed more simply with a list comprehension. data.txt is the UTF-8-encoded input data. This code will work in Python 2 and 3:
from __future__ import print_function
import io
def parse_labels(path):
with io.open(path,encoding='utf8') as f:
return [line[9:12] for line in f]
print(parse_labels('data.txt'))
Output (Python 3):
['ita', 'ara', 'eng', 'epo', 'spa', 'tat', 'heb', 'eng', 'eng', 'jpn', 'deu']

Replace Single Character in a line of a Text file with Python

I have a text file with all of them currently having the same end character (N), which is being used to identify progress the system makes. I want to change the end character to "Y" in case the program ends via an error or other interruptions so that upon restarting the program will search until a line has the end character "N" and begin working from there. Below is my code as well as a sample from the text file.
UPDATED CODE:
def GeoCode():
f = open("geocodeLongLat.txt", "a")
with open("CstoGC.txt",'r') as file:
print("Geocoding...")
new_lines = []
for line in file.readlines():
check = line.split('~')
print(check)
if 'N' in check[-1]:
geolocator = Nominatim()
dot_number, entry_name, PHY_STREET,PHY_CITY,PHY_STATE,PHY_ZIP = check[0],check[1],check[2],check[3],check[4],check[5]
address = PHY_STREET + " " + PHY_CITY + " " + PHY_STATE + " " + PHY_ZIP
f.write(dot_number + '\n')
try:
location = geolocator.geocode(address)
f.write(dot_number + "," + entry_name + "," + str(location.longitude) + "," + str(location.latitude) + "\n")
except AttributeError:
try:
address = PHY_CITY + " " + PHY_STATE + " " + PHY_ZIP
location = geolocator.geocode(address)
f.write(dot_number + "," + entry_name + "," + str(location.longitude) + "," + str(location.latitude) + "\n")
except AttributeError:
print("Cannot Geocode")
check[-1] = check[-1].replace('N','Y')
new_lines.append('~'.join(check))
with open('CstoGC.txt','r+') as file: # IMPORTANT to open as 'r+' mode as 'w/w+' will truncate your file!
for line in new_lines:
file.writelines(line)
f.close()
Output:
2967377~DARIN COLE~22112 TWP RD 209~ALVADA~OH~44802~Y
WAY 64 SUITE 100~EADS~TN~38028~N
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~N
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~N
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~N
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~N
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~N
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~N
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
143608~LARRY A PETERSON & DONNA M PETERSON~W6359 450TH AVE~ELLSWORTH~WI~54011~N
635528~JAMES E WEBB~3926 GREEN ROAD~SPRINGFIELD~TN~37172~N
805496~WAYNE MLADY~22272 135TH ST~CRESCO~IA~52136~N
704996~SAVINA C MUNIZ~814 W LA QUINTA DR~PHARR~TX~78577~N
893169~BINDEWALD MAINTENANCE INC~213 CAMDEN DR~SLIDELL~LA~70459~N
948130~LOGISTICIZE LTD~861 E PERRY ST~PAULDING~OH~45879~N
438760~SMOOTH OPERATORS INC~W8861 CREEK ROAD~DARIEN~WI~53114~N
518872~A B C RELOCATION SERVICES INC~12 BOCKES ROAD~HUDSON~NH~03051~N
576143~E B D ENTERPRISES INC~29 ROY ROCHE DRIVE~WINNIPEG~MB~R3C 2E6~N
968264~BRIAN REDDEMANN~706 WESTGOR STREET~STORDEN~MN~56174-0220~N
721468~QUALITY LOGISTICS INC~645 LEONARD RD~DUNCAN~SC~29334~N
As you can see I am already keeping track of which line I am at just by using x. Should I use something like file.readlines()?
Sample of text document:
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~N
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~N
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~N
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~N
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~N
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~N
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~N
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
Thank you!
Edit: updated code thanks to #idlehands

There are a few ways to do this.
Option #1
My original thought was to use the tell() and seek() method to go back a few steps but it quickly shows that you cannot do this conveniently when you're not opening the file in bytes and definitely not in a for loop of readlines(). You can see the reference threads here:
Is it possible to modify lines in a file in-place?
How to solve "OSError: telling position disabled by next() call"
The investigation led to this piece of code:
with open('file.txt','rb+') as file:
line = file.readline() # initiate the loop
while line: # continue while line is not None
print(line)
check = line.split(b'~')[-1]
if check.startswith(b'N'): # carriage return is expected for each line, strip it
# ... do stuff ... #
file.seek(-len(check), 1) # place the buffer at the check point
file.write(check.replace(b'N', b'Y')) # replace "N" with "Y"
line = file.readline() # read next line
In the first referenced thread one of the answers mentioned this could lead you to potential problems, and directly modifying the bytes on the buffer while reading it is probably considered a bad idea™. A lot of pros probably will scold me for even suggesting it.
Option #2a
(if file size is not horrendously huge)
with open('file.txt','r') as file:
new_lines = []
for line in file.readlines():
check = line.split('~')
if 'N' in check[-1]:
# ... do stuff ... #
check[-1] = check[-1].replace('N','Y')
new_lines.append('~'.join(check))
with open('file.txt','r+') as file: # IMPORTANT to open as 'r+' mode as 'w/w+' will truncate your file!
for line in new_lines:
file.writelines(line)
This approach loads all the lines into memory first, so you do the modification in memory but leave the buffer alone. Then you reload the file and write the lines that were changed. The caveat is that technically you are rewriting the entire file line by line - not just the string N even though it was the only thing changed.
Option #2b
Technically you could open the file as r+ mode from the onset and then after the iterations have completed do this (still within the with block but outside of the loop):
# ... new_lines.append('~'.join(check)) #
file.seek(0)
for line in new_lines:
file.writelines(line)
I'm not sure what distinguishes this from Option #1 since you're still reading and modifying the file in the same go. If someone more proficient in IO/buffer/memory management wants to chime in please do.
The disadvantage for Option 2a/b is that you always end up storing and rewriting the lines in the file even if you are only left with a few lines that needs to be updated from 'N' to 'Y'.
Results (for all solutions):
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~Y
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~Y
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~Y
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~Y
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~Y
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~Y
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~Y
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~Y
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~Y
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~Y
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~Y
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~Y
And if you were to say, encountered a break at the line starting with 220940, the file would become:
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~Y
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~Y
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~Y
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~Y
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~Y
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~Y
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~Y
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
There are pros and cons to these approaches. Try and see which one fits your use case the best.

I would read the entire input file into a list and .pop() the lines off one at a time. In case of an error, append the popped item to the list and write overwrite the input file. This way it will always be up to date and you won't need any other logic.

How to use multiline flag in python regex?

I want to transform chunks of text into a database of single line entries database with regex. But I don't know why the regex group isn't recognized.
Maybe because the multiline flag isn't properly set.
I am a beginner at python.
import re
with open("a-j-0101.txt", encoding="cp1252") as f:
start=1
ecx=r"(?P<entrcnt>[0-9]{1,3}) célébrités ou évènements"
ec1=""
nmx=r"(?P<ename>.+)\r\nAfficher le.*"
nm1=""
for line in f:
if start == 1:
out = open('AST0101.txt' + ".txt", "w", encoding="cp1252") #utf8 cp1252
ec1 = re.search(ecx,line)
out.write(ec1.group("entrcnt"))
start=0
out.write(r"\r\n")
nm1 = re.search(nmx,line, re.M)
out.write(str(nm1.group("ename")).rstrip('\r\n'))
out.close()
But I get the error:
File "C:\work-python\transform-asth-b.py", line 16, in <module>
out.write(str(nm1.group("ename")).rstrip('\r\n'))
builtins.AttributeError: 'NoneType' object has no attribute 'group'
here is the input:
210 célébrités ou évènements ont été trouvés pour la date du 1er janvier.
Création de l'euro
Afficher le...
...
...
...
expected output:
210
Création de l'euro ;...
... ;...
... ;...
EDIT: I try to change nmx to match \n or \r\n but no result:
nmx=r"(?P<ename>.+)(\n|\r\n)Afficher le"
best regards

In this statement:
nm1 = re.search(nmx,line, re.M)
you get an NoneType object (nm1 = None), because no matches were found. So make more investigation on the nmx attribute, why you get no matches in the regex.
By the way if it´s possible to get a NoneType object, you can avoid this by preventing a NoneType:
If nm1 is not None:
out.write(str(nm1.group("ename")).rstrip('\r\n'))
else:
#handle your NoneType case

If you are reading a single line at a time, there is no way for a regex to match on a previous line you have read and then forgotten.
If you read a group of lines, you can apply a regex to the collection of lines, and the multiline flag will do something useful. But your current code should probably simply search for r'^Afficher le\.\.\.' and use the state machine (start == 0 or start == 1) to do this in the right context.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need help writing and reading with Python - python

Related

Lemmatization taking forever with Spacy

How to get the original sentence from a text file by knowing an offset of a word in python?

Python ValueError: shape mismatch: objects cannot be broadcast to a single shape

Replace Single Character in a line of a Text file with Python

How to use multiline flag in python regex?

Categories

Resources