Im reading a text files that identifies specific characteristics in the text. Everything turns out fine until it reaches the spaces part where it displays that there are 15 spaces instead of 6.
The text file is
Hello
Do school units regularly
Attend seminars
Study 4 tests
Bye
and the script is
def main():
lower_case = 0
upper_case = 0
numbers = 0
whitespace = 0
with open("text.txt", "r") as in_file:
for line in in_file:
lower_case += sum(1 for x in line if x.islower())
upper_case += sum(1 for x in line if x.isupper())
numbers += sum(1 for x in line if x.isdigit())
whitespace += sum(1 for x in line if x.isspace())
print 'Lower case Letters: %s' % lower_case
print 'Upper case Letters: %s' % upper_case
print 'Numbers: %s' % numbers
print 'Spaces: %s' % whitespace
main()
Is there anything that should be changed so the number of spaces will turn up as 6?
The reason this happens is because line breaks are also considered to be spaces. Now, the file you are opening was probably created on Windows, and on Windows a line break is two characters (the actual line break, and a caret return). Since you have five lines, you get extra 10 whitespaces, totalling 16 (one gets lost somewhere, I can only guess that one of the lines has a different line break at the end, which lacks a caret return).
To fix it, just strip the line when you count whitespaces.
whitespace += sum(1 for x in line.strip() if x.isspace())
This will, however, also strip out any trailing and leading spaces which are not line breaks. To only strip out linebreaks from the end, you can do
whitespace += sum(1 for x in line.rstrip("\r\n") if x.isspace())
Another possibility is to not use isspace() but rather check for the characters you want, e.g.
whitespace += line.count(' ') + line.count('\t')
Related
I just wrote a function which prints character percent in a text file. However, I got a problem. My program is counting uppercase characters as a different character and also counting spaces. That's why the result is wrong. How can i fix this?
def count_char(text, char):
count = 0
for character in text:
if character == char:
count += 1
return count
filename = input("Enter the file name: ")
with open(filename) as file:
text = file.read()
for char in "abcdefghijklmnopqrstuvwxyz":
perc = 100 * count_char(text, char) / len(text)
print("{0} - {1}%".format(char, round(perc, 2)))
You should try making the text lower case using text.lower() and then to avoid spaces being counted you should split the string into a list using: text.lower().split(). This should do:
def count_char(text, char):
count = 0
for word in text.lower().split(): # this iterates returning every word in the text
for character in word: # this iterates returning every character in each word
if character == char:
count += 1
return count
filename = input("Enter the file name: ")
with open(filename) as file:
text = file.read()
totalChars = sum([len(i) for i in text.lower().split()]
for char in "abcdefghijklmnopqrstuvwxyz":
perc = 100 * count_char(text, char) / totalChars
print("{0} - {1}%".format(char, round(perc, 2)))
Notice the change in perc definition, sum([len(i) for i in text.lower().split()] returns the number of characters in a list of words, len(text) also counts spaces.
You can use a counter and a generator expression to count all letters like so:
from collections import Counter
with open(fn) as f:
c=Counter(c.lower() for line in f for c in line if c.isalpha())
Explanation of generator expression:
c=Counter(c.lower() for line in f # continued below
^ create a counter
^ ^ each character, make lower case
^ read one line from the file
# continued
for c in line if c.isalpha())
^ one character from each line of the file
^ iterate over line one character at a time
^ only add if a a-zA-Z letter
Then get the total letter counts:
total_letters=float(sum(c.values()))
Then the total percent of any letter is c[letter] / total_letters * 100
Note that the Counter c only has letters -- not spaces. So the calculated percent of each letter is the percent of that letter of all letters.
The advantage here:
You are reading the entire file anyway to get the total count of the character in question and the total of all characters. You might as well just count the frequency of all character as you read them;
You do not need to read the entire file into memory. That is fine for smaller files but not for larger ones;
A Counter will correctly return 0 for letters not in the file;
Idiomatic Python.
So your entire program becomes:
from collections import Counter
with open(fn) as f:
c=Counter(c.lower() for line in f for c in line if c.isalpha())
total_letters=float(sum(c.values()))
for char in "abcdefghijklmnopqrstuvwxyz":
print("{} - {:.2%}".format(char, c[char] / total_letters))
You want to make the text lower case before counting the char:
def count_char(text, char):
count = 0
for character in text.lower():
if character == char:
count += 1
return count
You can use the built in .count function to count the characters after converting everything to lowercase via .lower. Additionally, your current program doesn't work properly as it doesn't exclude spaces and punctuation when calling the len function.
import string
filename = input("Enter the file name: ")
with open(filename) as file:
text = file.read().lower()
chars = {char:text.count(char) for char in string.ascii_lowercase}
allLetters = float(sum(chars.values()))
for char in chars:
print("{} - {}%".format(char, round(chars[char]/allLetters*100, 2)))
I open a dictionary and pull specific lines the lines will be specified using a list and at the end i need to print a complete sentence in one line.
I want to open a dictionary that has a word in each line
then print a sentence in one line with a space between the words:
N = ['19','85','45','14']
file = open("DICTIONARY", "r")
my_sentence = #?????????
print my_sentence
If your DICTIONARY is not too big (i.e. can fit your memory):
N = [19,85,45,14]
with open("DICTIONARY", "r") as f:
words = f.readlines()
my_sentence = " ".join([words[i].strip() for i in N])
EDIT: A small clarification, the original post didn't use space to join the words, I've changed the code to include it. You can also use ",".join(...) if you need to separate the words by a comma, or any other separator you might need. Also, keep in mind that this code uses zero-based line index so the first line of your DICTIONARY would be 0, the second would be 1, etc.
UPDATE:: If your dictionary is too big for your memory, or you just want to consume as little memory as possible (if that's the case, why would you go for Python in the first place? ;)) you can only 'extract' the words you're interested in:
N = [19, 85, 45, 14]
words = {}
word_indexes = set(N)
counter = 0
with open("DICTIONARY", "r") as f:
for line in f:
if counter in word_indexes:
words[counter] = line.strip()
counter += 1
my_sentence = " ".join([words[i] for i in N])
you can use linecache.getline to get specific line numbers you want:
import linecache
sentence = []
for line_number in N:
word = linecache.getline('DICTIONARY',line_number)
sentence.append(word.strip('\n'))
sentence = " ".join(sentence)
Here's a simple one with more basic approach:
n = ['2','4','7','11']
file = open("DICTIONARY")
counter = 1 # 1 if you're gonna count lines in DICTIONARY
# from 1, else 0 is used
output = ""
for line in file:
line = line.rstrip() # rstrip() method to delete \n character,
# if not used, print ends with every
# word from a new line
if str(counter) in n:
output += line + " "
counter += 1
print output[:-1] # slicing is used for a white space deletion
# after last word in string (optional)
So I'm trying to do this problem
Write a program that reads a file named text.txt and prints the following to the
screen:
The number of characters in that file
The number of letters in that file
The number of uppercase letters in that file
The number of vowels in that file
I have gotten this so far but I am stuck on step 2 this is what I got so far.
file = open('text.txt', 'r')
lineC = 0
chC = 0
lowC = 0
vowC = 0
capsC = 0
for line in file:
for ch in line:
words = line.split()
lineC += 1
chC += len(ch)
for letters in file:
for ch in line:
print("Charcter Count = " + str(chC))
print("Letter Count = " + str(num))
You can do this using regular expressions. Find all occurrences of your pattern as your list and then finding the length of that list.
import re
with open('text.txt') as f:
text = f.read()
characters = len(re.findall('\S', text))
letters = len(re.findall('[A-Za-z]', text))
uppercase = len(re.findall('[A-Z]', text))
vowels = len(re.findall('[AEIOUYaeiouy]', text))
The answer above uses regular expressions, which are very useful and worth learning about if you haven't used them before. Bunji's code is also more efficient, as looping through characters in a string in Python is relatively slow.
However, if you want to try doing this using just Python, take a look at the code below. A couple of points: First, wrap your open() inside a using statement, which will automatically call close() on the file when you are finished. Next, notice that Python lets you use the in keyword in all kinds of interesting ways. Anything that is a sequence can be "in-ed", including strings. You could replace all of the string.xxx lines with your own string if you would like.
import string
chars = []
with open("notes.txt", "r") as f:
for c in f.read():
chars.append(c)
num_chars = len(chars)
num_upper = 0;
num_vowels = 0;
num_letters = 0
vowels = "aeiouAEIOU"
for c in chars:
if c in vowels:
num_vowels += 1
if c in string.ascii_uppercase:
num_upper += 1
if c in string.ascii_letters:
num_letters += 1
print(num_chars)
print(num_letters)
print(num_upper)
print(num_vowels)
I'm trying to split a variable length string across different but predefined line lengths. I've thrown together some code below which fails on key error 6 when I plonk it into Python Tutor (I don't have access to a proper python IDE right now) I guess this means my while loop isn't working properly and it's trying to keep incrementing lineNum but I'm not too sure why. Is there a better way to do this? Or is this easily fixable?
The code:
import re
#Dictionary containing the line number as key and the max line length
lineLengths = {
1:9,
2:11,
3:12,
4:14,
5:14
}
inputStr = "THIS IS A LONG DESC 7X7 NEEDS SPLITTING" #Test string, should be split on the spaces and around the "X"
splitted = re.split("(?:\s|((?<=\d)X(?=\d)))",inputStr) #splits inputStr on white space and where X is surrounded by numbers eg. dimensions
lineNum = 1 #initialises the line number at 1
lineStr1 = "" #initialises each line as a string
lineStr2 = ""
lineStr3 = ""
lineStr4 = ""
lineStr5 = ""
#Dictionary creating dynamic line variables
lineNumDict = {
1:lineStr1,
2:lineStr2,
3:lineStr3,
4:lineStr4,
5:lineStr5
}
if len(inputStr) > 40:
print "The short description is longer than 40 characters"
else:
while lineNum <= 5:
for word in splitted:
if word != None:
if len(lineNumDict[lineNum]+word) <= lineLengths[lineNum]:
lineNumDict[lineNum] += word
else:
lineNum += 1
else:
if len(lineNumDict[lineNum])+1 <= lineLengths[lineNum]:
lineNumDict[lineNum] += " "
else:
lineNum += 1
lineOut1 = lineStr1.strip()
lineOut2 = lineStr2.strip()
lineOut3 = lineStr3.strip()
lineOut4 = lineStr4.strip()
lineOut5 = lineStr5.strip()
I've taken a look at this answer but don't have any real understanding of C#: Split large text string into variable length strings without breaking words and keeping linebreaks and spaces
It doesn't work because you have the for words in splitted loop inside your loop with the lineLen condition. You have to do this:
if len(inputStr) > 40:
print "The short description is longer than 40 characters"
else:
for word in splitted:
if lineNum > 5:
break
if word != None:
if len(lineNumDict[lineNum]+word) <= lineLengths[lineNum]:
lineNumDict[lineNum] += word
else:
lineNum += 1
else:
if len(lineNumDict[lineNum])+1 <= lineLengths[lineNum]:
lineNumDict[lineNum] += " "
else:
lineNum += 1
Also lineStr1, lineStr2 and so on won't be changed, you have to access the dict directly (strings are immutable). I tried it and got the results working:
print("Lines: %s" % lineNumDict)
Gives:
Lines: {1: 'THIS IS A', 2: 'LONG DESC 7', 3: '7 NEEDS ', 4: '', 5: ''}
for word in splitted:
...
lineNum += 1
your code increments lineNum by the number of words in splitted, i.e. 16 times.
I wonder if a properly commented regular expression wouldn't be easier to understand?
lineLengths = {1:9,2:11,3:12,4:14,5:14}
inputStr = "THIS IS A LONG DESC 7X7 NEEDS SPLITTING"
import re
pat = """
(?: # non-capture around the line as we want to drop leading spaces
\s* # drop leading spaces
(.{{1,{max_len}}}) # up to max_len characters, will be added through 'format'
(?=[\b\sX]|$) # and using word breaks, X and string ending as terminators
# but without capturing as we need X to go into the next match
)? # and ignoring missing matches if not all lines are necessary
"""
# build a pattern matching up to 5 lines with the corresponding max lengths
pattern = ''.join(pat.format(max_len=x) for x in lineLengths.values())
re.match(pattern, inputStr, re.VERBOSE).groups()
# Out: ('THIS IS A', 'LONG DESC 7', '7 NEEDS', 'SPLITTING', None)
Also, there is no real point in using a dict for line_lengths, a list would do nicely.
char1= "P"
length=5
f = open("wl.txt", 'r')
for line in f:
if len(line)==length and line.rstrip() == char1:
z=Counter(line)
print z
I want to output only lines where length=5 and contains character p.So far
f = open("wl.txt", 'r')
for line in f:
if len(line)==length :#This one only works with the length
z=Counter(line)
print z
Any guess someone?
Your problem is:
if len(line)==length and line.rstrip() == char1:
If a line is 5 characters long, then after removing trailing whitespace, you're then comparing to see if it's equal to a string of length 1... 'abcde' is never going to equal 'p' for instance, and your check will never run if your line contains 'p' as it's not 5 characters...
I'm not sure what you're trying to do with Counter
Corrected code is:
# note in capitals to indicate 'constants'
LENGTH = 5
CHAR = 'p'
with open('wl.txt') as fin:
for line in fin:
# Check length *after* any trailing whitespace has been removed
# and that CHAR appears anywhere **in** the line
if len(line.rstrip()) == LENGTH and CHAR in line:
print 'match:', line