I'm trying to sort out "good numbers" from "bad" ones.
My problem is that some of the numbers I'm getting from the textfile contain spaces (" "). These functions identify them by splitting on spaces so that all lines that contain spaces show up as bad numbers regardless of whether they are good or bad.
Anyone got any idea how to sort them out? I'm using this right now.
def showGoodNumbers():
print ("all good numbers:")
textfile = open("textfile.txt", "r")
for line in textfile.readlines():
split_line = line.split(' ')
if len(split_line) == 1:
print(split_line) # this will print as a tuple
textfile.close
def showBadNumbers():
print ("all bad numbers:")
textfile = open("textfile.txt", "r")
for line in textfile.readlines():
split_line = line.split(' ')
if len(split_line) > 1:
print(split_line) # this will print as a tuple
textfile.close
The text file looks like this (all entries with a comment are "bad"):
13513 51235
235235-23523
2352352-23 - not valid
235235 - too short
324-134 3141
23452566246 - too long
This is (yet another) classic example of where the Python re module really shines:
from re import match
with open("textfile.txt", "r") as f:
for line in f:
if match("^[0-9- ]*$", line):
print "Good Line:", line
else:
print "Bad Line:", line
Output:
Good Line: 13513 51235
Good Line: 235235-23523
Bad Line: 2352352-23 - not valid
Bad Line: 235235 - too short
Good Line: 324-134 3141
Bad Line: 23452566246 - too long
String manipulation is all you needed here.
allowed_chars = ['-', '.', ' ', '\n']
with open("textfile.txt", "r") as fp:
for line in fp:
line_check = line
for chars in allowed_chars:
line_check = line_check.replace(chars, '')
if line_check.isdigit():
print "Good line:", line
else:
print "Bad line:", line
you can add any number of characters to allowed_chars list. Just for your ease of adding characters. I added \n in the allowed_chars list so that the trailing newline character will also be handled, based on the comments.
Related
I have made my own corpus of misspelled words.
misspellings_corpus.txt:
English, enlist->Enlish
Hallowe'en, Halloween->Hallowean
I'm having an issue with my format. Thankfully, it is at least consistent.
Current format:
correct, wrong1, wrong2->wrong3
Desired format:
wrong1,wrong2,wrong3->correct
The order of wrong<N> isn't of concern,
There might be any number of wrong<N> words per line (separated by a comma: ,),
There's only 1 correct word per line (which should be to the right of ->).
Failed Attempt:
with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
for line in oldfile:
correct = line.split(', ')[0].strip()
print(correct)
W = line.split(', ')[1].strip()
print(W)
wrong_1 = W.split('->')[0] # however, there might be loads of wrong words
wrong_2 = W.split('->')[1]
newfile.write(wrong_1 + ', ' + wrong_2 + '->' + correct)
Output new.txt (isn't working):
enlist, Enlish->EnglishHalloween, Hallowean->Hallowe'en
Solution: (Inspired by #alexis)
with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
for line in oldfile:
#line = 'correct, wrong1, wrong2->wrong3'
line = line.strip()
terms = re.split(r", *|->", line)
newfile.write(",".join(terms[1:]) + "->" + terms[0] + '\n')
Output new.txt:
enlist,Enlish->English
Halloween,Hallowean->Hallowe'en
Let's assume all the commas are word separators. I'll break each line on commas and arrows, for convenience:
import re
line = 'correct, wrong1, wrong2->wrong3'
terms = re.split(r", *|->", line)
new_line = ", ".join(terms[1:]) + "->" + terms[0]
print(new_line)
You can put that back in a file-reading loop, right?
I'd suggest building up a list, rather than assuming the number of elements. When you split on the comma, the first element is the correct word, elements [1:-1] are misspellings, and [-1] is going to be the one you have to split on the arrow.
I think you're also finding that write needs a newline character as in "\n" as suggested in the comments.
I've learned that we can easily remove blank lined in a file or remove blanks for each string line, but how about remove all blanks at the end of each line in a file ?
One way should be processing each line for a file, like:
with open(file) as f:
for line in f:
store line.strip()
Is this the only way to complete the task ?
Possibly the ugliest implementation possible but heres what I just scratched up :0
def strip_str(string):
last_ind = 0
split_string = string.split(' ')
for ind, word in enumerate(split_string):
if word == '\n':
return ''.join([split_string[0]] + [ ' {} '.format(x) for x in split_string[1:last_ind]])
last_ind += 1
Don't know if these count as different ways of accomplishing the task. The first is really just a variation on what you have. The second does the whole file at once, rather than line-by-line.
Map that calls the 'rstrip' method on each line of the file.
import operator
with open(filename) as f:
#basically the same as (line.rstrip() for line in f)
for line in map(operator.methodcaller('rstrip'), f)):
# do something with the line
read the whole file and use re.sub():
import re
with open(filename) as f:
text = f.read()
text = re.sub(r"\s+(?=\n)", "", text)
You just want to remove spaces, another solution would be...
line.replace(" ", "")
Good to remove white spaces.
I'm trying to solve a problem where I need to clear text (to get rid off of all punctuation and spaces) and get it to the same register.
with open("moby_01.txt") as infile, open("moby_01_clean_3.txt", "w") as outfile:
for line in infile:
line.lower
...
cleaned_words = line.split("-")
cleaned_words = "\n".join(cleaned_words)
cleaned_words = line.strip().split()
cleaned_words = "\n".join(cleaned_words)
outfile.write(cleaned_words)
I expect the output of program be list of words as they are in text but by one in line. But it turns out in for loop only last three lines itirates and output is list of words with punctuation:
Call
me
Ishmael.
Some
years
ago--never
mind
how
long
precisely--having
...
You might want to change this. You are using the line again here.
cleaned_words = line.strip().split()
to
cleaned_words = cleaned_words.strip().split()
I finaly found how to solve this problem. Excercise book (The Quick Python Book. Third Edition. Naomi Ceder), Python documenation, and StackOverflow helped me.
with open("moby_01.txt") as infile, open("moby_01_clean.txt","w") as outfile:
for line in infile:
cleaned_line = line.lower()
cleaned_line = cleaned_line.translate(str.maketrans("-", " ", ".,?!;:'\"\n"))
words = cleaned_line.split()
cleaned_words = "\n".join(words)
outfile.write(cleaned_words + "\n")
I moved -sign from keyword argument z in str.maketrns(x[,y[,z]]) to x, because else some words with -- remained concatenated in file. For same reason I added \n in outfile.write(cleaned_words)
I've a text file with following text:
The process runs very well||
It starts at 6pm and ends at 7pm||
The user_id is 23456||
This task runs in a daily schedule!!
I'm trying to see extract all the lines that have the string "user_id". Basically I want to extract this:
The user_id is 23456
My current python code only identify if the desired string exists (or not) in the text file:
word = 'user_id'
if word in open('text.txt').read():
print(word)
else:
print("Not found")
How can I print all the sentences with that contains the word?
Thanks!
You'll want to iterate over the lines to find what you want
word = 'user_id'
with open('text.txt', 'r') as fh:
for line in fh:
if word in line:
print(line)
You are not printing the line, only the word that you are trying to match. Note, the with open() is a nicer way to handle opening and closing files and is functionally similar (but not the same) to
fh = open('text.txt', 'r')
# for loop here
fh.close()
Just do a for loop and iterate through every line, checking if the word is in the line.
word = 'user_id'
for line in open('mean_temp.txt'):
if word in line:
print(line)
output:
The user_id is 23456||
Try this.
word = 'user_id'
not_found = True
with open('text.txt', 'r') as infile:
lines = infile.readlines()
for line in lines:
if word in line:
print(line)
if not_found:
print("Not found")
This is exactly what regular expressions are built for:
import re
with open('text.txt','r') as f:
text = f.read()
sentences = re.findall(r'(.*user.*)',text)
if len(sentences) > 0:
for sentence in sentences:
print(sentence)
else:
print('Not found')
I'm trying to format a tab delimited text file where there are negative values in the data set. I'm trying to ignore lines of data where there is any negative values showing up. I want to write to an output file only rows of data that have positive values. Is there anyway to do this with a wild character that looks for "-" in the strings? I would rather not convert the list to floats if I can get away with it.
Here's the code (without any mention of negative values yet):
import sys, os
inputFileName = sys.argv[1]
outputFileName = os.path.splitext(inputFileName)[0]+"_edited.txt"
try:
infile = open(inputFileName,'r')
outfile = open(outputFileName, 'w')
line = infile.readline()
outfile.write(line)
for line in infile:
line = line.strip()
lineList = line.split('\t')
lineList = [line for line in lineList if line != '']
#print lineList
#print len(lineList)
if len(lineList) == 9:
#print lineList
line = '\t'.join(lineList)
line = line + '\n'
outfile.write(line)
infile.close()
outfile.close()
except IOError:
print inputFileName, "does not exist."
I've already (with help) gotten rid of any empty values in the above data file that has nine columns. Now I'm trying to get rid of any rows with negative values.
you can use a regular expression in your script to surpress anything with "-" in the beginning before outputting it. Or pipe the whole output of this script into grep -v "-" and it should surpress any lines with a negative in it.
has_negative = any(float(n) < 0 for n in re.findall(r'\-?\d+', line))