I'm trying to solve a problem where I need to clear text (to get rid off of all punctuation and spaces) and get it to the same register.
with open("moby_01.txt") as infile, open("moby_01_clean_3.txt", "w") as outfile:
for line in infile:
line.lower
...
cleaned_words = line.split("-")
cleaned_words = "\n".join(cleaned_words)
cleaned_words = line.strip().split()
cleaned_words = "\n".join(cleaned_words)
outfile.write(cleaned_words)
I expect the output of program be list of words as they are in text but by one in line. But it turns out in for loop only last three lines itirates and output is list of words with punctuation:
Call
me
Ishmael.
Some
years
ago--never
mind
how
long
precisely--having
...
You might want to change this. You are using the line again here.
cleaned_words = line.strip().split()
to
cleaned_words = cleaned_words.strip().split()
I finaly found how to solve this problem. Excercise book (The Quick Python Book. Third Edition. Naomi Ceder), Python documenation, and StackOverflow helped me.
with open("moby_01.txt") as infile, open("moby_01_clean.txt","w") as outfile:
for line in infile:
cleaned_line = line.lower()
cleaned_line = cleaned_line.translate(str.maketrans("-", " ", ".,?!;:'\"\n"))
words = cleaned_line.split()
cleaned_words = "\n".join(words)
outfile.write(cleaned_words + "\n")
I moved -sign from keyword argument z in str.maketrns(x[,y[,z]]) to x, because else some words with -- remained concatenated in file. For same reason I added \n in outfile.write(cleaned_words)
Related
I have made my own corpus of misspelled words.
misspellings_corpus.txt:
English, enlist->Enlish
Hallowe'en, Halloween->Hallowean
I'm having an issue with my format. Thankfully, it is at least consistent.
Current format:
correct, wrong1, wrong2->wrong3
Desired format:
wrong1,wrong2,wrong3->correct
The order of wrong<N> isn't of concern,
There might be any number of wrong<N> words per line (separated by a comma: ,),
There's only 1 correct word per line (which should be to the right of ->).
Failed Attempt:
with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
for line in oldfile:
correct = line.split(', ')[0].strip()
print(correct)
W = line.split(', ')[1].strip()
print(W)
wrong_1 = W.split('->')[0] # however, there might be loads of wrong words
wrong_2 = W.split('->')[1]
newfile.write(wrong_1 + ', ' + wrong_2 + '->' + correct)
Output new.txt (isn't working):
enlist, Enlish->EnglishHalloween, Hallowean->Hallowe'en
Solution: (Inspired by #alexis)
with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
for line in oldfile:
#line = 'correct, wrong1, wrong2->wrong3'
line = line.strip()
terms = re.split(r", *|->", line)
newfile.write(",".join(terms[1:]) + "->" + terms[0] + '\n')
Output new.txt:
enlist,Enlish->English
Halloween,Hallowean->Hallowe'en
Let's assume all the commas are word separators. I'll break each line on commas and arrows, for convenience:
import re
line = 'correct, wrong1, wrong2->wrong3'
terms = re.split(r", *|->", line)
new_line = ", ".join(terms[1:]) + "->" + terms[0]
print(new_line)
You can put that back in a file-reading loop, right?
I'd suggest building up a list, rather than assuming the number of elements. When you split on the comma, the first element is the correct word, elements [1:-1] are misspellings, and [-1] is going to be the one you have to split on the arrow.
I think you're also finding that write needs a newline character as in "\n" as suggested in the comments.
I searched around a bit, but I couldn't find a solution that fits my needs.
I'm new to python, so I'm sorry if what I'm asking is pretty obvious.
I have a .txt file (for simplicity I will call it inputfile.txt) with a list of names of folder\files like this:
camisos\CROWDER_IMAG_1.mov
camisos\KS_HIGHENERGY.mov
camisos\KS_LOWENERGY.mov
What I need is to split the first word (the one before the \) and write it to a txt file (for simplicity I will call it outputfile.txt).
Then take the second (the one after the \) and write it in another txt file.
This is what i did so far:
with open("inputfile.txt", "r") as f:
lines = f.readlines()
with open("outputfile.txt", "w") as new_f:
for line in lines:
text = input()
print(text.split()[0])
This in my mind should print only the first word in the new txt, but I only got an empty txt file without any error.
Any advice is much appreciated, thanks in advance for any help you could give me.
You can read the file in a list of strings and split each string to create 2 separate lists.
with open("inputfile.txt", "r") as f:
lines = f.readlines()
X = []
Y = []
for line in lines:
X.append(line.split('\\')[0] + '\n')
Y.append(line.split('\\')[1])
with open("outputfile1.txt", "w") as f1:
f1.writelines(X)
with open("outputfile2.txt", "w") as f2:
f2.writelines(Y)
This code allows the user to input a sentence. It then converts the words into positions. e.g bob is great bob becomes 0,1,2,0 it then converts those positions back into the original array so 0,1,2,0 becomes bob is great bob. However whenever I use punctuation on a sentence with a repeated word like bob is great bob! it prints the positions 0,1,2,0,4 and a syntax error shows saying "list index out of range". I have no idea why this code works fine with punctuation when there are no repeated words and so require help to fix this problem.
import re
sentence=('bob is great bob!')
punctuation=(re.findall(r"[\w'\"]+|[.,!?;:_-]", sentence))
print(punctuation)
positions = [punctuation.index(x) for x in punctuation]
print(positions)
test1=(",".join(str(i) for i in positions))
words=" ".join(sorted(set(punctuation), key=punctuation.index))
print(words)
with open('1.txt', 'w') as f:#this creates the name of the file called task 2 which will contain my data ad allows the computer to locate the fle and add in the data. The 'w' makes the file writable so i can add in my data, without it displaying an error, using f as a variable.
f.write(str(words))
with open('2.txt', 'w') as f:
f.write(str(test1))
f.close()
openfile=('1.txt')
openfile = open(openfile, "r")
print ( openfile.read())
openfile=('2.txt')
openfile = open(openfile, "r")
print ( openfile.read())
openfile = open('1.txt', "r")
test=openfile.read()
#print(test)
blankarray=[]
for i in test.split(" "):
blankarray.append(i)
#print(blankarray)
openfile.close
openfile = open('2.txt', "r")
test=openfile.read()
#print(test)
blankarray2=[]
for i in test.split(","):
blankarray2.append(int(i))
#print(blankarray2)
openfile.close
blankstring=""
for i in range(len(blankarray2)):
#print(i)
blankstring=blankstring+blankarray[blankarray2[i]] + " "
print(blankstring)
Here is the question:
I have a file with these words:
hey how are you
I am fine and you
Yes I am fine
And it is asked to find the number of words, lines and characters.
Below is my program, but the number of counts for the characters without space is not correct.
The number of words is correct and the number of line is correct.
What is the mistake in the same loop?
fname = input("Enter the name of the file:")
infile = open(fname, 'r')
lines = 0
words = 0
characters = 0
for line in infile:
wordslist = line.split()
lines = lines + 1
words = words + len(wordslist)
characters = characters + len(line)
print(lines)
print(words)
print(characters)
The output is:
lines=3(Correct)
words=13(correct)
characters=47
I've looked on the site with multiple answers and I am confused because I didn't learn some other functions in Python. How do I correct the code as simple and basic as it is in the loop I've done?
Whereas the number of characters without space is 35 and with space is 45.
If possible, I want to find the number of characters without space. Even if someone know the loop for the number of characters with space that's fine.
Sum up the length of all words in a line:
characters += sum(len(word) for word in wordslist)
The whole program:
with open('my_words.txt') as infile:
lines=0
words=0
characters=0
for line in infile:
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lines)
print(words)
print(characters)
Output:
3
13
35
This:
(len(word) for word in wordslist)
is a generator expression. It is essentially a loop in one line that produces the length of each word. We feed these lengths directly to sum:
sum(len(word) for word in wordslist)
Improved version
This version takes advantage of enumerate, so you save two lines of code, while keeping the readability:
with open('my_words.txt') as infile:
words = 0
characters = 0
for lineno, line in enumerate(infile, 1):
wordslist = line.split()
words += len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lineno)
print(words)
print(characters)
This line:
with open('my_words.txt') as infile:
opens the file with the promise to close it as soon as you leave indentation.
It is always good practice to close file after your are done using it.
Remember that each line (except for the last) has a line separator.
I.e. "\r\n" for Windows or "\n" for Linux and Mac.
Thus, exactly two characters are added in this case, as 47 and not 45.
A nice way to overcome this could be to use:
import os
fname=input("enter the name of the file:")
infile=open(fname, 'r')
lines=0
words=0
characters=0
for line in infile:
line = line.strip(os.linesep)
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters=characters+ len(line)
print(lines)
print(words)
print(characters)
To count the characters, you should count each individual word. So you could have another loop that counts characters:
for word in wordslist:
characters += len(word)
That ought to do it. The wordslist should probably take away newline characters on the right, something like wordslist = line.rstrip().split() perhaps.
I found this solution very simply and readable:
with open("filename", 'r') as file:
text = file.read().strip().split()
len_chars = sum(len(word) for word in text)
print(len_chars)
This is too long for a comment.
Python 2 or 3? Because it really matters. Try out the following in your REPL for both:
Python 2.7.12
>>>len("taña")
5
Python 3.5.2
>>>len("taña")
4
Huh? The answer lies in unicode. That ñ is an 'n' with a combining diacritical. Meaning its 1 character, but not 1 byte. So unless you're working with plain ASCII text, you'd better specify which version of python your character counting function is for.
How's this? It uses a regular expression to match all non-whitespace characters and returns the number of matches within a string.
import re
DATA="""
hey how are you
I am fine and you
Yes I am fine
"""
def get_char_count(s):
return len(re.findall(r'\S', s))
if __name__ == '__main__':
print(get_char_count(DATA))
Output
35
The image below shows this tested on RegExr:
It is probably counting new line characters. Subtract characters with (lines+1)
Here is the code:
fp = open(fname, 'r+').read()
chars = fp.decode('utf8')
print len(chars)
Check the output. I just tested it.
A more Pythonic solution than the others:
with open('foo.txt') as f:
text = f.read().splitlines() # list of lines
lines = len(text) # length of the list = number of lines
words = sum(len(line.split()) for line in text) # split each line on spaces, sum up the lengths of the lists of words
characters = sum(len(line) for line in text) # sum up the length of each line
print(lines)
print(words)
print(characters)
The other answers here are manually doing what str.splitlines() does. There's no reason to reinvent the wheel.
You do have the correct answer - and your code is completely correct. The thing that I think it is doing is that there is an end of line character being passed through, which includes your character count by two (there isn't one on the last line, as there is no new line to go to). If you want to remove this, the simple fudge would be to do as Loaf suggested
characters = characters - (lines - 1)
See csl's answer for the second part...
Simply skip unwanted characters while calling len,
import os
characters=characters+ len([c for c in line if c not in (os.linesep, ' ')])
or sum the count,
characters=characters+ sum(1 for c in line if c not in (os.linesep, ' '))
or build a str from the wordlist and take len,
characters=characters+ len(''.join(wordlist))
or sum the characters in the wordlist. I think this is the fastest.
characters=characters+ sum(1 for word in wordlist for char in word)
You have two problems. One is the line endings and the other is the spaces in between.
Now there are many people who posted pretty good answers, but I find this method easier to understand:
characters = characters + len(line.strip()) - line.strip().count(' ')
line.strip() removes the trailing and leading spaces. Then I'm subtracting the number of spaces from the total length.
It's very simple:
f = open('file.txt', 'rb')
f.seek(0) # Move to the start of file
print len(f.read())
Here i got smallest program with less memory usage for your problem
with open('FileName.txt') as f:
lines = f.readlines()
data = ''.join(lines)
print('lines =',len(lines))
print('Words = ',len(data.split()))
data = ''.join(data.split())
print('characters = ',len(data))
lines will be list of lines,so length of lines is nothing but number of lines.Next step data contains a string of your file contents(each word separated by a whitespace), so if we split data gives list of words in your file. thus, length of that list gives number of words. again if we join the words list you will get all characters as a single string. thus length of that gives number of characters.
taking the input as file name i.e files.txt from the input parameter and then counting the total number of characters in the file and save to the variable
char
fname = input("Enter the name of the file:")
infile = open(fname, 'r') # connection of the file
lines = 0
words = 0
char = 0 # init as zero integer
for line in infile:
wordslist = line.split() # splitting line to word
lines = lines + 1 # counter up the word
words = words + len(wordslist) # splitting word to charac
char = char + len(line) # counter up the character
print("lines are: " + str(lines))
print("words are: " + str(words))
print("chars are: " + str(char)) # printing beautify
num_lines = sum(1 for line in open('filename.txt'))
num_words = sum(1 for word in open('filename.txt').read().split())
num_chars = sum(len(word) for word in open('filename.txt').read().split())
How would I do this? I want to iterate through each word and see if it fits certain parameters (for example is it longer than 4 letters..etc. not really important though).
The text file is literally a rambling of text with punctuation and white spaces, much like this posting.
Try split()ing the string.
f = open('your_file')
for line in f:
for word in line.split():
# do something
If you want it without punctuation:
f = open('your_file')
for line in f:
for word in line.split():
word = word.strip('.,?!')
# do something
You can simply content.split()
f = open(filename,"r");
lines = f.readlines();
for i in lines:
thisline = i.split(" ");
data=open("file").read().split()
for item in data:
if len(item)>4:
print "longer than 4: ",item