Re-formatting a text file - python

I am fairly new to Python. I have a text file, full of common misspellings. The correct spelling of the word is prefixed with a $ character, and all misspelled versions of the word preceding it; one on each line.
mispelling.txt:
$year
eyar
yera
$years
eyars
eyasr
yeasr
yeras
yersa
I want to create a new text file, based on mispelling.txt, where the format appears as this:
new_mispelling.txt:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
The correct spelling of the word is on the right-hand side of its misspelling, separated by ->; on the same line.
Question:
How do I read in the file, read $ as a new word and thus a new line in my output file, propagate an output file and save to disk?
The purpose of this is to have my collected data be of the same format as this open-source Wikipedia entry dataset of "all" commonly misspelled words, that doesn't contain my own entries of words and misspellings.

As you process the file line-by-line, if you find a word that starts with $, set that as the "currently active correct spelling". Then each subsequent line is a misspelling for that word, so format that into a string and write it to the output file.
current_word = ""
with open("mispelling.txt") as f_in, open("new_mispelling.txt", "w") as f_out:
for line in f_in:
line = line.strip() # Remove whitespace at start and end
if line.startswith("$"):
# If the line starts with $
# Slice the current line from the second character to the end
# And save it as current_word
current_word = line[1:]
else:
# If it doesn't start with $, create the string we want
# And write it.
f_out.write(f"{line}->{current_word}\n")
With your input file, this gives:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
The f"{line}->{current_word}\n" construct is called an f-string and is used for string interpolation in python 3.6+.

A regex solution:
You can use the pattern '\$(\w+)(.*?)(?=\$|$)' and join each value starting with $ with the subsequent values by ->, then join all these by \n for the groups captured, then finally join all such values by \n. Make sure to use re.DOTALL flag since its a multi-line string:
import re
txt='''$year
eyar
yera
$years
eyars
eyasr
yeasr
yeras
yersa'''
print('\n'.join('\n'.join('->'.join((v, m.group(1)))
for v in m.group(2).strip('\n').split('\n')) for m in
re.finditer('\$(\w+)(.*?)(?=\$|$)', txt, re.DOTALL)))
OUTPUT:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
I'm leaving file read/write for you assuming that's not the problem you are asking for

Related

Load a text file paragraph into a string without libraries

sorry if this question may look a bit dumb for some of you but i'm totally a beginner at programming in Python so i'm quite bad and got a still got a lot to learn.
So basically I have this long text file separated by paragraphs, sometimes the newline can be double or triple to make the task more hard for us so i added a little check and looks like it's working fine so i have a variable called "paragraph" that tells me in which paragraph i am currently.
Now basically i need to scan this text file and search for some sequences of words in it but the newline character is the worst enemy here, for example if i have the string = "dummy text" and i'm looking into this:
"random questions about files with a dummy
text and strings
hey look a new paragraph here"
As you can see there is a newline between dummy and text so reading the file line by line doesn't work. So i was wondering to load directly the entire paragraph to a string so this way i can even remove punctuation and stuff more easly and check directly if those sequences of words are contained in it.
All this must be done without libraries.
However my piece of code of paragraph counter works while the file is being read, so if uploading a whole paragraph in a string is possible i should basically use something like "".join until the paragraph increases by 1 because we're on the next paragraph? Any idea?
This should do the trick. It is very short and elegant:
with open('dummy text.txt') as file:
data = file.read().replace('\n', '')
print(data)#prints out the file
The output is:
"random questions about files with a dummy text and strings hey look a new paragraph here"
I think you do not need to think it in a difficult way. Here is a very commonly used pattern for this kind of problems.
paragraphs = []
lines = []
for line in open('text.txt'):
if not line.strip(): # empty line
if lines:
paragraphs.append("".join(lines))
lines = []
else:
lines.append(line)
if lines:
paragraphs.append("".join(lines))
If a stripped line is empty, you encounter the second \n and it means that you have to join previous lines to a paragraph.
If you encounter the 3rd \n, you must not join again so remove your previous lines (lines = []). In this way, you will not join the same paragraph again.
To check the last line, try this pattern.
f = open('text.txt')
line0 = f.readline()
while True:
# do what you have to do with the previous line, `line0`
line = f.readline()
if not line: # `line0` was the last line
# do what you have to do with the last line
break
line0 = line
You can strip the newline character. Here is an example from a different problem.
data = open('resources.txt', 'r')
book_list = []
for line in data:
new_line = line.rstrip('\n')
book_list.append(new_line)

Matching a simple string with regex not working?

I have a large txt-file and want to extract all strings with these patterns:
/m/meet_the_crr
/m/commune
/m/hann_2
Here is what I tried:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read().replace("\n", "")
print(re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents))
The result I get is a simple "None". What am I doing wrong here?
You need to not remove lineends and use the re.MULTILINE flag so you get multiple results from a bigger text returned:
# write a demo file
with open("t.txt","w") as f:
f.write("""
/m/meet_the_crr\n
/m/commune\n
/m/hann_2\n\n
# your text looks like this after .read().replace(\"\\n\",\"\")\n
/m/meet_the_crr/m/commune/m/hann_2""")
Program:
import re
regex = r"^\/m\/[a-zA-Z0-9_-]+$"
with open("t.txt","r") as f:
contents = f.read()
found_all = re.findall(regex,contents,re.M)
print(found_all)
print("-")
print(open("t.txt").read())
Output:
['/m/meet_the_crr', '/m/commune', '/m/hann_2']
Filecontent:
/m/meet_the_crr
/m/commune
/m/hann_2
# your text looks like this after .read().replace("\n","")
/m/meet_the_crr/m/commune/m/hann_2
This is about what Wiktor Stribiżew did tell you in his comment - although he suggested to use a better pattern as well: r'^/m/[\w-]+$'
There is nothing logically wrong with your code, and in fact your pattern will match the inputs you describe:
result = re.match(r'^\/m\/[a-zA-Z0-9_-]+$', '/m/meet_the_crr')
if result:
print(result.groups()) # this line is reached, as there is a match
Since you did not specify any capture groups, you will see () being printed to the console. You could capture the entire input, and then it would be available, e.g.
result = re.match(r'(^\/m\/[a-zA-Z0-9_-]+$)', '/m/meet_the_crr')
if result:
print(result.groups(1)[0])
/m/meet_the_crr
You are reading a whole file into a variable (into memory) using .read(). With .replace("\n", ""), you re,ove all newlines in the string. The re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents) tries to match the string that entirely matches the \/m\/[a-zA-Z0-9_-]+ pattern, and it is impossible after all the previous manipulations.
There are at least two ways out. Either remove .replace("\n", "") (to prevent newline removal) and use re.findall(r'^/m/[\w-]+$', contents, re.M) (re.M option will enable matching whole lines rather than the whole text), or read the file line by line and use your re.match version to check each line for a match, and if it matches add to the final list.
Example:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read()
print(re.findall(r'^/m/[\w-]+$', contents, re.M))
Or
import re
with open("testfile.txt", "r") as text_file:
for line in text_file:
if re.match(r'/m/[\w-]+\s*$', line):
print(line.rstrip())
Note I used \w to make the pattern somewhat shorter, but if you are working in Python 3 and only want to match ASCII letters and digits, use also re.ASCII option.
Also, / is not a special char in Python regex patterns, there is no need escaping it.

How to split last word before a certain char in python string and replace that same line in a txt file?

I am currently trying to repeatedly replace a word in a line but there are two current issues with my code. I can successfully locate the lines I want to replace, but with my current code i fail to 1. store the specific value in the strings that i want and then replace the word on that same line.
The text I want to replace appears to two times and looks like this in the textfile:
One_Number_ = "0"
if [ One_Number_ == "0" ]
I wish to change "One" here to something else each time I run the program. What I've tried to do is the following:
with open(os.path.join('file.txt'), 'r') as file:
lines = file.readlines()
with open(os.path.join('file.txt'), 'w') as file:
for line in lines:
if (line.__contains__('_Number_')):
replaceline = line.rsplit('_', 1)[0]
line.replace (replaceline, "NewWord")
file.write(line)
The if-statement runs but the line does not replace "One".
The strings also do not get separated correctly meaning that replacelinedoes not contain just the "One".
How can I adjust my code so it successfully locates the two lines in the textfile that needs to replaced with the new string ("NewWord", as I have just used as an example just now)?
You will need to do the .replace() like:
line = line.replace (replaceline, "NewWord")
The replace method returns a new string is does not modify the existing string.

Using Python regex to check for string in file

I'm using Python regex to check a log file that contains the output of the Windows command tasklist for anything ending with .exe. This log file contains output from multiple callings of tasklist. After I get a list of strings with .exe in them, I want to write them out to a text file after checking to see if that string already exists in the output file. Instead of the desired output, it writes out duplicates of strings already present in the text file. (svchost.exe shows up several times for example.) The goal is to have a text file with a list of each unique process enumerated by tasklist with no duplicates of processes already written in the file.
import re
file1 = open('taskinfo.txt', 'r')
strings = re.findall(r'.*.exe', file1.read())
file1.close()
file2 = open('exes.txt', 'w+')
for item in strings:
line_to_write = re.match(item, file2.read())
if line_to_write == None:
file2.write(item)
file2.write('\n')
else:
pass
I used print statements to debug and made sure than item is the desired output.
There are some problems with your regex. Try this:
strings = re.findall(r'\b\S*\.exe\b', file1.read())
This will only take the text connected to the .exe by starting at a word boundary (\b) and grabbing all non-space characters (\S). Additionally, when you had .exe instead of \.exe, the . was matching as a wildcard, rather than a literal period.

how to : multiline to oneline by removing newlines

I'm a newbie in python who is starting to learn it.
I wanted to make a script that count the same letter pattern in a text file. Problem is my text file has multiple lines. I couldn't find some of my patterns as they went over to the next line.
My file and the pattern are a DNA sequence.
Example:
'attctcgatcagtctctctagtgtgtgagagactctagctagatcgtccactcactgac**ga
tc**agtcagt**gatc**tctcctactacaaggtgacatgagtgtaaattagtgtgagtgagtgaa'
I'm looking for 'gatc'. The second one was counted, but the first wasn't.
So, how can I make this file to a one line text file?
You can join the lines when you read the pattern from the file:
fd = open('dna.txt', 'r')
dnatext = ''.join(fd.readlines())
dnatext.count('gatc')
dnatext = text.replace('\n', '') // join text lines
gatc_count = dnatext.count('gatc') // count 'gatc' occurrences
This should do the trick :
dnatext = "".join(dnatext.split("\n"))

Categories

Resources