If the text file is /n/n Hello world!/n I like python./n
How do I get the first word from that text?
I tried to code:
def word_file(file):
files = open(file, 'r')
l = files.readlines()
for i in range(len(l)):
a = l[i].rstrip("\n")
line = l[0]
word = line.strip().split(" ")[0]
return word
There is space in front Hello.
The result I get is NONE. How should I correct it?
Can anybody help?
Assuming there is a word in the file:
def word_file(f):
with open(f) as file:
return file.read().split()[0]
file.read reads the entire file as a string. Do a split with no parameters on that string (i.e. sep=None). Then according to the Python manual "runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace." So the splitting will be done on consecutive white space and there will be no empty strings returned as a result of the split. Therefore the first element of the returned list will be the first word in the file.
If there is a possibility that the file is empty or contains nothing but white space, then you would need to check the return value from file.read().split() to ensure it is not an empty list.
If you need to avoid having to read the entire file into memory at once, then the following, less terse code can be used:
def word_file(f):
with open(f) as file:
for line in file:
words = line.split()
if words:
return words[0]
return None # No words found
Edit: #Booboo answer is far better than my answer
This should work:
def word_file(file):
with open(file, 'r') as f:
for line in f:
for index, character in enumerate(line):
if not character.isspace():
line = line[index:]
for ind, ch in enumerate(line):
if ch.isspace():
return line[:ind]
return line # could not find whitespace character at end
return None # no words found
output:
Hello
Related
i have written this program, just want to know if ti has been written properly! because I am new to this and dont know much!!
def read_words(words_file):
""" (file open for reading) -> list of str
Return a list of all words (with newlines removed) from open file
words_file.
Precondition: Each line of the file contains a word in uppercase characters
from the standard English alphabet.
"""
line = words_file.readline()
while line != '':
line= words_file.readline()
words_list.append(line.rstrip('\n'))
words_file.close()
return words_file
Instead of using while loop you can use for loop like this.
words_list = []
with open('path/to/file.txt', 'r') as word_file
lines = words_file.readline()
for line in lines:
words_list.append(line.rstrip('\n'))
After defining two keywords, my goal is to:
read full contents of an unstructured text file (1000+ lines of text)
loop through contents, fetch 60 characters to the left of keyword each time it is hit
append each 60 character string in a separate line of a new text file
I have the code to read unstructured text file and write to the new text file.
I am having trouble creating code which will seek each keyword, fetch contents, then loop through end of file.
Very simply, here is what I have so far:
#read file, store in variable
content=open("demofile.txt", "r")
#seek "KW1" or "KW2", take 60 characters to the left, append to text file, loop
#open a text file, write variable contents, close file
file=open("output.txt","w")
file.writelines(content)
file.close()
I need help with the middle portion of this code. For example, if source text file says:
"some text, some text, some text, KEYWORD"
I would like to return:
"some text, some text, some text, "
In a new row for each keyword found.
Thank you.
result = []
# Open the file
with open('your_file') as f:
# Iterate through lines
for line in f.readlines():
# Find the start of the word
index = line.find('your_word')
# If the word is inside the line
if index != -1:
if index < 60:
result.append(line[:index])
else:
result.append(line[index-60:index])
After it you can write result to a file
If you have several words, you can modify your code like this:
words = ['waka1', 'waka2', 'waka3']
result = []
# Open the file
with open('your_file') as f:
# Iterate through lines
for line in f.readlines():
for word in words:
# Find the start of the word
index = line.find(word)
# If the word is inside the line
if index != -1:
if index < 60:
result.append(line[:index])
else:
result.append(line[index-60:index])
You could go for a regex based solution as well!
import re
# r before the string makes it a raw string so the \'s aren't used as escape chars.
# \b indicates a word border to regex. like a new line, space, tab, punctuation, etc...
kwords = [r"\bparameter\b", r"\bpointer\b", r"\bfunction\b"]
in_file = "YOUR_IN_FILE"
out_file = "YOUR_OUT_FILE"
patterns = [r"([\s\S]{{0,60}}?){}".format(i) for i in kwords]
# patterns is now a list of regex pattern strings which will match between 0-60
# characters (as many as possible) followed by a word boder, followed by your
# keyword, and finally followed by another word border. If you don't care about
# the word borders then remove both the \b from each string. The actual object
# matched will only be the 0-60 characters before your parameter and not the
# actual parameter itself.
# This WILL include newlines when trying to scan backwards 60 characters.
# If you DON'T want to include newlines, change the `[\s\S]` in patterns to `.`
with open(in_file, "r") as f:
data = f.read()
with open(out_file, "w") as f:
for pattern in patterns:
matches = re.findall(pattern, data)
# The above will find all occurences of your pattern and return a list of
# occurences, as strings.
matches = [i.replace("\n", " ") for i in matches]
# The above replaces any newlines we found with a space.
# Now we can print the messages for you to see
print("Matches for " + pattern + ":", end="\n\t")
for match in matches:
print(match, end="\n\t")
# and write them to a file
f.write(match + "\r\n")
print("\n")
Depending on the specifics of what you need captured, you should have enough information here to adapt it to your problem. Leave a comment if you have any questions about regex.
I want to be able to strip the \n character ( .rstrip('\n') ) from a text file (dictionary.txt) that contains 120,000+ words. then counts each line and returns the amount of words in the txt file (each word is on its own line).
then finally want all the words to be stored into a list.
at the moment, the code below returns the amount of lines but doesn't strip the \n character so it can be stored into the list.
def lines_count():
with open('dictionary.txt') as file:
print (len(file.readlines()))
If you want the list of lines without the trailing new-line character you can use str.splitlines() method, which in this case you can read the file as string using file_obj.read() then use splitlines() over the whole string. Although, there is no need for such thing when the open function is already returned a generator from your lines (you can simply strip the trailing new-line while processing the lines) or just call the str.strip() with a map to create an iterator of striped lines:
with open('dictionary.txt'):
striped_lines = map(str.strip, f)
But if you just want to count the words as a pythonic way you can use a generator expression within sum function like following:
with open('dictionary.txt') as f:
word_count = sum(len(line.split()) for line in f)
Note that there is no need to strip the new lines while you're splitting the line.
e.g.
In [14]: 'sd f\n'.split()
Out[14]: ['sd', 'f']
But if you still want all the words in a list you can use a list comprehension instead of a generator expression:
with open('dictionary.txt') as f:
all_words = [word for line in f for word in line.split()]
word_count = len(all_words)
if you want to return a list of lines without \n and then print the length of this list:
def line_list(fname):
with open(fname) as file:
return file.read().splitlines()
word_list = line_list('dictionary.txt') # 1 word per line
print(len(word_list))
Here is the question:
I have a file with these words:
hey how are you
I am fine and you
Yes I am fine
And it is asked to find the number of words, lines and characters.
Below is my program, but the number of counts for the characters without space is not correct.
The number of words is correct and the number of line is correct.
What is the mistake in the same loop?
fname = input("Enter the name of the file:")
infile = open(fname, 'r')
lines = 0
words = 0
characters = 0
for line in infile:
wordslist = line.split()
lines = lines + 1
words = words + len(wordslist)
characters = characters + len(line)
print(lines)
print(words)
print(characters)
The output is:
lines=3(Correct)
words=13(correct)
characters=47
I've looked on the site with multiple answers and I am confused because I didn't learn some other functions in Python. How do I correct the code as simple and basic as it is in the loop I've done?
Whereas the number of characters without space is 35 and with space is 45.
If possible, I want to find the number of characters without space. Even if someone know the loop for the number of characters with space that's fine.
Sum up the length of all words in a line:
characters += sum(len(word) for word in wordslist)
The whole program:
with open('my_words.txt') as infile:
lines=0
words=0
characters=0
for line in infile:
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lines)
print(words)
print(characters)
Output:
3
13
35
This:
(len(word) for word in wordslist)
is a generator expression. It is essentially a loop in one line that produces the length of each word. We feed these lengths directly to sum:
sum(len(word) for word in wordslist)
Improved version
This version takes advantage of enumerate, so you save two lines of code, while keeping the readability:
with open('my_words.txt') as infile:
words = 0
characters = 0
for lineno, line in enumerate(infile, 1):
wordslist = line.split()
words += len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lineno)
print(words)
print(characters)
This line:
with open('my_words.txt') as infile:
opens the file with the promise to close it as soon as you leave indentation.
It is always good practice to close file after your are done using it.
Remember that each line (except for the last) has a line separator.
I.e. "\r\n" for Windows or "\n" for Linux and Mac.
Thus, exactly two characters are added in this case, as 47 and not 45.
A nice way to overcome this could be to use:
import os
fname=input("enter the name of the file:")
infile=open(fname, 'r')
lines=0
words=0
characters=0
for line in infile:
line = line.strip(os.linesep)
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters=characters+ len(line)
print(lines)
print(words)
print(characters)
To count the characters, you should count each individual word. So you could have another loop that counts characters:
for word in wordslist:
characters += len(word)
That ought to do it. The wordslist should probably take away newline characters on the right, something like wordslist = line.rstrip().split() perhaps.
I found this solution very simply and readable:
with open("filename", 'r') as file:
text = file.read().strip().split()
len_chars = sum(len(word) for word in text)
print(len_chars)
This is too long for a comment.
Python 2 or 3? Because it really matters. Try out the following in your REPL for both:
Python 2.7.12
>>>len("taña")
5
Python 3.5.2
>>>len("taña")
4
Huh? The answer lies in unicode. That ñ is an 'n' with a combining diacritical. Meaning its 1 character, but not 1 byte. So unless you're working with plain ASCII text, you'd better specify which version of python your character counting function is for.
How's this? It uses a regular expression to match all non-whitespace characters and returns the number of matches within a string.
import re
DATA="""
hey how are you
I am fine and you
Yes I am fine
"""
def get_char_count(s):
return len(re.findall(r'\S', s))
if __name__ == '__main__':
print(get_char_count(DATA))
Output
35
The image below shows this tested on RegExr:
It is probably counting new line characters. Subtract characters with (lines+1)
Here is the code:
fp = open(fname, 'r+').read()
chars = fp.decode('utf8')
print len(chars)
Check the output. I just tested it.
A more Pythonic solution than the others:
with open('foo.txt') as f:
text = f.read().splitlines() # list of lines
lines = len(text) # length of the list = number of lines
words = sum(len(line.split()) for line in text) # split each line on spaces, sum up the lengths of the lists of words
characters = sum(len(line) for line in text) # sum up the length of each line
print(lines)
print(words)
print(characters)
The other answers here are manually doing what str.splitlines() does. There's no reason to reinvent the wheel.
You do have the correct answer - and your code is completely correct. The thing that I think it is doing is that there is an end of line character being passed through, which includes your character count by two (there isn't one on the last line, as there is no new line to go to). If you want to remove this, the simple fudge would be to do as Loaf suggested
characters = characters - (lines - 1)
See csl's answer for the second part...
Simply skip unwanted characters while calling len,
import os
characters=characters+ len([c for c in line if c not in (os.linesep, ' ')])
or sum the count,
characters=characters+ sum(1 for c in line if c not in (os.linesep, ' '))
or build a str from the wordlist and take len,
characters=characters+ len(''.join(wordlist))
or sum the characters in the wordlist. I think this is the fastest.
characters=characters+ sum(1 for word in wordlist for char in word)
You have two problems. One is the line endings and the other is the spaces in between.
Now there are many people who posted pretty good answers, but I find this method easier to understand:
characters = characters + len(line.strip()) - line.strip().count(' ')
line.strip() removes the trailing and leading spaces. Then I'm subtracting the number of spaces from the total length.
It's very simple:
f = open('file.txt', 'rb')
f.seek(0) # Move to the start of file
print len(f.read())
Here i got smallest program with less memory usage for your problem
with open('FileName.txt') as f:
lines = f.readlines()
data = ''.join(lines)
print('lines =',len(lines))
print('Words = ',len(data.split()))
data = ''.join(data.split())
print('characters = ',len(data))
lines will be list of lines,so length of lines is nothing but number of lines.Next step data contains a string of your file contents(each word separated by a whitespace), so if we split data gives list of words in your file. thus, length of that list gives number of words. again if we join the words list you will get all characters as a single string. thus length of that gives number of characters.
taking the input as file name i.e files.txt from the input parameter and then counting the total number of characters in the file and save to the variable
char
fname = input("Enter the name of the file:")
infile = open(fname, 'r') # connection of the file
lines = 0
words = 0
char = 0 # init as zero integer
for line in infile:
wordslist = line.split() # splitting line to word
lines = lines + 1 # counter up the word
words = words + len(wordslist) # splitting word to charac
char = char + len(line) # counter up the character
print("lines are: " + str(lines))
print("words are: " + str(words))
print("chars are: " + str(char)) # printing beautify
num_lines = sum(1 for line in open('filename.txt'))
num_words = sum(1 for word in open('filename.txt').read().split())
num_chars = sum(len(word) for word in open('filename.txt').read().split())
I've got txt file with list of words, something like this:
adsorbowanie
adsorpcje
adular
adwena
adwent
adwentnio
adwentysta
adwentystka
adwersarz
adwokacjo
And I want to delete the last letter in every word, if that letter is "a" or "o".
I'm very new to this, so please explain this simply.
re.sub(r"[ao]$","",word)
This should do it for you.
Try this:
import re
import os
# Read the file, split the contents into a list of lines,
# removing line separators
with open('input.txt') as infile:
lines = infile.read().splitlines()
# Remove any whitespace around the word.
# If you are certain the list doesn't contain whitespace
# around the word, you can leave this out...
# (this is called a "list comprehansion", by the way)
lines = [line.strip() for line in lines]
# Remove letters if necessary, using regular expressions.
outlines = [re.sub('[ao]$', '', line) for line in lines]
# Join the output with appropriate line separators
outdata = os.linesep.join(outlines)
# Write the output to a file
with open('output.txt', 'w') as outfile:
outfile.write(outdata)
First read the file and split the lines. After that cut off the last char if your condition is fulfilled and append the new string to a list containing the analysed and modified strings/lines:
#!/usr/bin/env python3
# coding: utf-8
# open file, read lines and store them in a list
with open('words.txt') as f:
lines = f.read().splitlines()
# analyse lines read from file
new_lines = []
for s in lines:
# analyse last char of string,
# get rid of it if condition is fulfilled and append string to new list
s = s[:-1] if s[-1] in ['a', 'o'] else s
new_lines.append(s)
print(new_lines)