Python | Reformatting each line in a text file consistently - python

I have made my own corpus of misspelled words.
misspellings_corpus.txt:
English, enlist->Enlish
Hallowe'en, Halloween->Hallowean
I'm having an issue with my format. Thankfully, it is at least consistent.
Current format:
correct, wrong1, wrong2->wrong3
Desired format:
wrong1,wrong2,wrong3->correct
The order of wrong<N> isn't of concern,
There might be any number of wrong<N> words per line (separated by a comma: ,),
There's only 1 correct word per line (which should be to the right of ->).
Failed Attempt:
with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
for line in oldfile:
correct = line.split(', ')[0].strip()
print(correct)
W = line.split(', ')[1].strip()
print(W)
wrong_1 = W.split('->')[0] # however, there might be loads of wrong words
wrong_2 = W.split('->')[1]
newfile.write(wrong_1 + ', ' + wrong_2 + '->' + correct)
Output new.txt (isn't working):
enlist, Enlish->EnglishHalloween, Hallowean->Hallowe'en
Solution: (Inspired by #alexis)
with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
for line in oldfile:
#line = 'correct, wrong1, wrong2->wrong3'
line = line.strip()
terms = re.split(r", *|->", line)
newfile.write(",".join(terms[1:]) + "->" + terms[0] + '\n')
Output new.txt:
enlist,Enlish->English
Halloween,Hallowean->Hallowe'en

Let's assume all the commas are word separators. I'll break each line on commas and arrows, for convenience:
import re
line = 'correct, wrong1, wrong2->wrong3'
terms = re.split(r", *|->", line)
new_line = ", ".join(terms[1:]) + "->" + terms[0]
print(new_line)
You can put that back in a file-reading loop, right?

I'd suggest building up a list, rather than assuming the number of elements. When you split on the comma, the first element is the correct word, elements [1:-1] are misspellings, and [-1] is going to be the one you have to split on the arrow.
I think you're also finding that write needs a newline character as in "\n" as suggested in the comments.

Related

How to save a string in a file as a quoted string?

I have a file file.md that I want to read and get it as a string.
Then I want to take that string and save it in another file, but as a string with quotes (and all). The reason is I want to transfer the content of my markdown file to a markdown string so that I can include it in html using the javascript marked library.
How can I do that using a python script?
Here's what I have tried so far:
with open('file.md', 'r') as md:
text=""
lines = md.readlines()
for line in lines:
line = "'" + line + "'" + '+'
text = text + line
with open('file.txt', 'w') as txt:
txt.write(text)
Input file.md
This is one line of markdown
This is another line of markdown
This is another one
Desired output: file.txt
"This is one line of markdown" +
"This is another line of markdown" +
(what should come here by the way to encode an empty line?)
"This is another one"
There are two things you need to pay attention here.
First is that you should not change your iterator line while it is running through lines. Instead, assign it to a new string variable (I call it new_line).
Second, if you add more characters at the end of each line, it will be placed after the end-of-line character and thus be moved into the next line when you write it to a new file. Instead, skip the last character of each line and add the line break manually.
If I understand you right, this should give you the wanted output:
with open('file.md', 'r') as md:
text = ""
lines = md.readlines()
for line in lines:
if line[-1] == "\n":
text += "'" + line[:-1] + "'+\n"
else:
text += "'" + line + "'+"
with open('file.txt', 'w') as txt:
txt.write(text)
Note how the last line is treated different than the others (no eol-char and no + sign).
text += ... adds more characters to the existing string.
This also works and might be a bit nicer, because it avoids the if-statement. You can remove the newline-character right at reading the content from file.md. In the end you skip the last two characters of your content, which is the + and the \n.
with open('file.md', 'r') as md:
text = ""
lines = [line.rstrip('\n') for line in md]
for line in lines:
text += "'" + line + "' +\n"
with open('file.txt', 'w') as txt:
txt.write(text[:-2])
...and with using a formatter:
text += "'{}' +\n".format(line)
...checking for empty lines as you asked in the comments:
for line in lines:
if line == '':
text += '\n'
else:
text += "'{}' +\n".format(line)
This works:
>>> a = '''This is one line of markdown
... This is another line of markdown
...
... This is another one'''
>>> lines = a.split('\n')
>>> lines = [ '"' + i + '" +' if len(i) else i for i in lines]
>>> lines[-1] = lines[-1][:-2] # drop the '+' at the end of the last line
>>> print '\n'.join( lines )
"This is one line of markdown" +
"This is another line of markdown" +
"This is another one"
You may add reading/writing to files yourself.

Is there any shortcut in Python to remove all blanks at the end of each line in a file?

I've learned that we can easily remove blank lined in a file or remove blanks for each string line, but how about remove all blanks at the end of each line in a file ?
One way should be processing each line for a file, like:
with open(file) as f:
for line in f:
store line.strip()
Is this the only way to complete the task ?
Possibly the ugliest implementation possible but heres what I just scratched up :0
def strip_str(string):
last_ind = 0
split_string = string.split(' ')
for ind, word in enumerate(split_string):
if word == '\n':
return ''.join([split_string[0]] + [ ' {} '.format(x) for x in split_string[1:last_ind]])
last_ind += 1
Don't know if these count as different ways of accomplishing the task. The first is really just a variation on what you have. The second does the whole file at once, rather than line-by-line.
Map that calls the 'rstrip' method on each line of the file.
import operator
with open(filename) as f:
#basically the same as (line.rstrip() for line in f)
for line in map(operator.methodcaller('rstrip'), f)):
# do something with the line
read the whole file and use re.sub():
import re
with open(filename) as f:
text = f.read()
text = re.sub(r"\s+(?=\n)", "", text)
You just want to remove spaces, another solution would be...
line.replace(" ", "")
Good to remove white spaces.

.split() sequence doesn't work more than one time

I'm trying to solve a problem where I need to clear text (to get rid off of all punctuation and spaces) and get it to the same register.
with open("moby_01.txt") as infile, open("moby_01_clean_3.txt", "w") as outfile:
for line in infile:
line.lower
...
cleaned_words = line.split("-")
cleaned_words = "\n".join(cleaned_words)
cleaned_words = line.strip().split()
cleaned_words = "\n".join(cleaned_words)
outfile.write(cleaned_words)
I expect the output of program be list of words as they are in text but by one in line. But it turns out in for loop only last three lines itirates and output is list of words with punctuation:
Call
me
Ishmael.
Some
years
ago--never
mind
how
long
precisely--having
...
You might want to change this. You are using the line again here.
cleaned_words = line.strip().split()
to
cleaned_words = cleaned_words.strip().split()
I finaly found how to solve this problem. Excercise book (The Quick Python Book. Third Edition. Naomi Ceder), Python documenation, and StackOverflow helped me.
with open("moby_01.txt") as infile, open("moby_01_clean.txt","w") as outfile:
for line in infile:
cleaned_line = line.lower()
cleaned_line = cleaned_line.translate(str.maketrans("-", " ", ".,?!;:'\"\n"))
words = cleaned_line.split()
cleaned_words = "\n".join(words)
outfile.write(cleaned_words + "\n")
I moved -sign from keyword argument z in str.maketrns(x[,y[,z]]) to x, because else some words with -- remained concatenated in file. For same reason I added \n in outfile.write(cleaned_words)

python chain a list from a tsv file

i have this tsv file containing some paths of links each link is seperated by a ';' i want to use:
In the example below we can se that the text in the file is seperated
and i only want to read through the last column wich is a path starting with '14th'
6a3701d319fc3754 1297740409 166 14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade NULL
3824310e536af032 1344753412 88 14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade 3
415612e93584d30e 1349298640 138 14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade
I want to somehow split the path into a chain like this:
['14th_century', 'Niger', 'Nigeria'....]
how do i read the file and remove the first 3 columns so i only got the last one ?
UPDATE:
i have tried this now:
import re
with open('test.tsv') as f:
lines = f.readlines()
for line in lines[22:len(lines)]:
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
e_line = line.split(' ')
real_line = e_line[0]
print real_line.split(';')
But the problem is that it not deleting the first 3 columns ?
If the separator betweeen first is only a space and not a serie of spaces or a tab, you could do that
with open('file_name') as f:
lines = f.readlines()
for line in lines:
e_line = line.split(' ')
real_line = e_line[3]
print real_line.split(';')
Answer to your updated question.
But the problem is that it not deleting the first 3 columns ?
There are several mistakes.
Your code:
import re
with open('test.tsv') as f:
lines = f.readlines()
for line in lines[22:len(lines)]:
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
e_line = line.split(' ')
real_line = e_line[0]
print real_line.split(';')
This line does nothing...
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
Because re.sub function doesn't change your line variable, but returns replaced string.
So you may want to do as below.
line = re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
And your regexp ^s\+ matches only string which starts with whitespaces or tabs. Because you use ^.
But I think you just want to replace consective whitespaces or tabs with one space.
So then, above code will be as below.(Just remove ^ in the regexp)
line = re.sub(r"\s+", " ", line, flags = re.MULTILINE)
Now, each string in line are separated just one space. So line.split(' ') will work as you want.
Next, e_line[0] returns first element of e_line which is 1st column of the line.
But you want to skip first 3 columns and get 4th column. You can do like this:
e_line = line.split(' ')
real_line = e_line[3]
OK. Now entire code is look like this.
for line in lines:#<---I also changed here because there is no need to skip first 22 lines in your example.
line = re.sub(r"\s+", " ", line)
e_line = line.split(' ')
real_line = e_line[3]
print real_line
output:
14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade
P.S:
This line can become more pythonic.
before:
for line in lines[22:len(lines)]:
after:
for line in lines[22:]:
And, you don't need to use flags = re.MULTILINE, because line is single-line in the for-loop.
You don't need to use regex for this. The csv module can handle tab-separated files too:
import csv
filereader = csv.reader(open('test.tsv', 'rb'), delimiter='\t')
path_list = [row[3].split(';') for row in filereader]
print(path_list)

python, trying to sort out lines from a textfile

I'm trying to sort out "good numbers" from "bad" ones.
My problem is that some of the numbers I'm getting from the textfile contain spaces (" "). These functions identify them by splitting on spaces so that all lines that contain spaces show up as bad numbers regardless of whether they are good or bad.
Anyone got any idea how to sort them out? I'm using this right now.
def showGoodNumbers():
print ("all good numbers:")
textfile = open("textfile.txt", "r")
for line in textfile.readlines():
split_line = line.split(' ')
if len(split_line) == 1:
print(split_line) # this will print as a tuple
textfile.close
def showBadNumbers():
print ("all bad numbers:")
textfile = open("textfile.txt", "r")
for line in textfile.readlines():
split_line = line.split(' ')
if len(split_line) > 1:
print(split_line) # this will print as a tuple
textfile.close
The text file looks like this (all entries with a comment are "bad"):
13513 51235
235235-23523
2352352-23 - not valid
235235 - too short
324-134 3141
23452566246 - too long
This is (yet another) classic example of where the Python re module really shines:
from re import match
with open("textfile.txt", "r") as f:
for line in f:
if match("^[0-9- ]*$", line):
print "Good Line:", line
else:
print "Bad Line:", line
Output:
Good Line: 13513 51235
Good Line: 235235-23523
Bad Line: 2352352-23 - not valid
Bad Line: 235235 - too short
Good Line: 324-134 3141
Bad Line: 23452566246 - too long
String manipulation is all you needed here.
allowed_chars = ['-', '.', ' ', '\n']
with open("textfile.txt", "r") as fp:
for line in fp:
line_check = line
for chars in allowed_chars:
line_check = line_check.replace(chars, '')
if line_check.isdigit():
print "Good line:", line
else:
print "Bad line:", line
you can add any number of characters to allowed_chars list. Just for your ease of adding characters. I added \n in the allowed_chars list so that the trailing newline character will also be handled, based on the comments.

Categories

Resources