Analyze a text file for certain attributes in Python 3.x

Analyze a text file for certain attributes in Python 3.x - python

For an assignment in Python 3.x, I have to create a program that reads a text file and outputs the total number of characters, lines, vowels, capital letters, numeric digits, and words. The user has to provide the file and path of the text file. Asking for the file is easy:
file = input("Please provide the file path and file name. \nFor example C:\\Users\\YourName\\Documents\\books\\book.txt \n:")
f = open(file, 'r')
text = f.read()
I tried to use simple functions like:
numberOfCharacters = len(text)
...but reading farther into the assignment reveals that I have to use a for loop to analyze each character in the string, and then use a multi-way if statement to check whether it is a vowel, digit, etc.
I know I can count the number of line by counting the number of \n's and I can use the .split() functions for wordsl but I am rather lost on how to get going.
I want to format the output like this, though I think I can figure this out after I get the program to work.
------------width=35---------|--width=8----
|number of characters : #####|
|number of lines : #####|
|number of vowels : #####|
|number of capital letters : #####|
|number of numeric digits : #####|
|number of words : #####|
Any help getting going and showing me what to do would be greatly appreciated.

You can use the NLTK toolkit (http://www.nltk.org/) to get the info you want.

Related

Weird behavior when writing a string to a file

I am trying to make an AutoHotKey script that removes the letter 'e' from most words you type. To do this, I am going to put a list of common words in a text file and have a python script add the proper syntax to the AHK file for each word. For testing purposes, my word list file 'words.txt' contains this:
apple
dog
tree
I want the output in the file 'wordsOut.txt' (which I will turn into the AHK script) to end up like this after I run the python script:
::apple::appl
::tree::tr
As you can see, it will exclude words without the letter 'e' and removes 'e' from everything else. But when I run my script which looks like this...
f = open('C:\\Users\\jpyth\\Desktop\\words.txt', 'r')
while True:
word = f.readline()
if not word: break
if 'e' in word:
sp_word = word.strip('e')
outString = '::{}::{}'.format(word, sp_word)
p = open('C:\\Users\\jpyth\\Desktop\\wordsOut.txt', 'a+')
p.write(outString)
p.close()
f.close()
The output text file ends up like this:
::apple
::apple
::tree::tr
The weirdest part is that, while it never gets it right, the text in the output file can change depending on the number of lines in the input file.

I'm making this an official answer and not a comment because it's worth pointing out how strip works, and to be weary of hidden characters like new line characters.
f.readline() returns each line, including the '\n'. Because strip() only removes the character from beginning and end of string, not from the middle, it's not actually removing anything from most words with 'e'. In fact even a word that ends in 'e' doesn't get that 'e' removed, since you have a new line character to the right of it. It also explains why ::apple is printed over two lines.
'hello\n'.strip('o') outputs 'hello\n'
whereas 'hello'.strip('o') outputs 'hell'
As pointed out in the comments, just do sp_word = word.strip().replace('\n', '').replace('e', '')

lhay's answer is right about the behavior of strip() but I'm not convinced that list comprehension really qualifies as "simple".
I would instead go with replace():
>>> 'elemental'.replace('e', '')
'lmntal'
(Also side note: for word in f: does the same thing as the first three lines of your code.)

Filter special characters, count them, and rewrite on another csv

I am trying to see if there are special characters in csv. This file consist of one column with about 180,000 rows. Since my file contains Korean, English, and Chinese, I added 가-힣``A-Z``0-9 but I do not know what I should to not filter Chinese letters. Or is there any better way to do this?
Special letters I am looking for are : ■ , △, ?, etc
Special letters I do not want to count are : Unit (ex : ㎍, ㎥, ℃), (), ' etc.
Searching on stackflow, many questions considered designating special letters to find out first. But in my case, that is difficult since I have 180,000 records and I do not know what letters are actually in there. As far as I am concerned, there are only three languages ; Korean, English, and Chinese.
This is my code so far :
with open("C:/count1.csv",'w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
with open(file,'r') as fi:
for line in fi:
x=not('가-힣','A-Z','0-9')
if x in line :
sub=re.sub(x,'*',line.rstrip())
count=len(sub)
lst=[fi]+[count]
csv_writer.writerow(lst)
Using import re
regex=not'[가-힣]','[a-z]','[0-9]'
file="C:/kd/fields.csv"
with open("C:/specialcharacter.csv",'w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
with open(file,'r') as fi:
for line in fi:
search_target = line
result=re.findall(regex,search_target)
print("\n".join(result))

I do not know why you consider not filtering chinese characters when you are only looking for some special letters. This library can filter chinese.
filter Chinese on top of your filtered list of Korean, English and number: regex = "[^가-힣a-zA-Z0-9]" result=re.findall(regex,search_target)
filter either 1) a list of special characters that you seek or 2) a list of special characters you want to avoid.
Choose wisely which fits your case better to avoid as much exceptions as possible so that you do not have to add more filters everytime.
Make the list as regex.
Then, loop through your 180,000 rows using regex to filter out the rows.
Update your regex-list until you filter everything.

Where i am wrong? Count total words excluding header and footer in python?

This is the file i am trying to read and count the total no of words in this file test.txt
I have written a code for it:
def create_wordlist(filename, is_Gutenberg=True):
words = 0
wordList = []
data = False
regex = re.compile('[%s]' % re.escape(string.punctuation))
file1 = open("temp",'w+')
with open(filename, 'r') as file:
if is_Gutenberg:
for line in file:
if line.startswith("*** START "):
data = True
continue
if line.startswith("End of the Project Gutenberg EBook"):
#data = False
break
if data:
line = line.strip().replace("-"," ")
line = line.replace("_"," ")
line = regex.sub("",line)
for word in line.split():
wordList.append(word.lower())
#print(wordList)
#words = words + len(wordList)
return len(wordList)
#return wordList
create_wordlist('test.txt', True)
Here are few rules to be followed:
1. Strip off whitespace, and punctuation
2. Replace hyphens with spaces
3.skip the file header and footer. Header ends with a line that starts with "*** START OF THIS" and footer starts with "End of the Project".
My answer: 60513 but the actual answer is 60570. This answer came with the question itself. It may be correct or wrong. Where I am doing it wrong.

You give a number for the actual answer -- the answer you consider correct, that you want your code to output.
You did not tell us how you got that number.
It looks to me like the two numbers come from different definitions of "word".
For example, you have in your example text several numbers in the form:
140,000,000
Is that one word or three?
You are replacing hyphens with spaces, so a hyphenated word will be counted as two. Other punctuation you are removing. That would make the above number (and there are other, similar, examples in your text) into one word. Is that what you intended? Is that what was done to get your "correct" number? I suspect this is all or part of your difference.
At a quick glance, I see three numbers in the form above (counted as either 3 or 9, difference 6)
I see 127 apostrophes (words like wife's, which could be counted as either one word or two) for a difference of 127.
Your difference is 57, so the answer is not quite so simple, but I still strongly suspect different definitions of what is a word, for specific corner cases.
By the way, I am not sure why you are collecting all the words into a huge list and then getting the length. You could skip the append loop and just accumulate a sum of len(line.split()). This would remove complexity, which lessens the possibility of bugs (and probably make the program faster, if that matters in this case)
Also, you have a line:
if line.startswith("*** START " in"):
When I try that in my python interpreter, I get a syntax error. Are you sure the code you posted here is what you are running? I would have expected:
if line.startswith("*** START "):

Without an example text file that shows this behaviour it is difficult to guess what goes wrong. But there is one clue: your number is less than what you expect. That seems to imply that you somehow glue together separate words, and count them as a single word. And the obvious candidate for this behaviour is the statement line = regex.sub("",line): this replaces any punctuation character with an empty string. So if the text contains that's, your program changes this to thats.
If that is not the cause, you really need to provide a small sample of text that shows the behaviour you get.
Edit: if your intention is to treat punctuation as word separators, you should replace the punctuation character with a space, so: line = regex.sub(" ",line).

Matching words with Regex (Python 3)

I had been staring at this problem for hours, I don't know what regex format to use to solve this problem.
Problem:
Given the following input strings, find all possible output words 5 characters or longer.
qwertyuytresdftyuioknn
gijakjthoijerjidsdfnokg
Your program should find all possible words (5+ characters) that can be derived from the strings supplied.
Use http://norvig.com/ngrams/enable1.txt as your search dictionary.
The order of the output words doesn't matter.
queen question
gaeing garring gathering gating geeing gieing going
goring
Assumptions about the input strings:
QWERTY keyboard
Lowercase a-z only, no whitespace or punctuation
The first and last characters of the input string will always match
the first and last characters of the desired output word.
Don't assume users take the most efficient path between letters
Every letter of the output word will appear in the input string
Attempted solution:
First I downloaded the the words from that webpage and store them in a file in my computer ('words.txt'):
import requests
res = requests.get('http://norvig.com/ngrams/enable1.txt')
res.raise_for_status()
fp = open('words.txt', 'wb')
for chunk in res.iter_content(100000):
fp.write(chunk)
fp.close()
I'm then trying to match the words I need using regex. The problem is that I don't know how to format my re.compile() to achieve this.
import re
input = 'qwertyuytresdftyuioknn' #example
fp= open('words.txt')
string = fp.read()
regex = re.compile(input[0]+'\w{3,}'+input[-1]) #wrong need help here
regex.findall(string)
As it's obvious, it's wrong since I need to match letters from my input string going form left to right, not any letters which I'm mistakenly doing with \w{3,}. Any help into this would be greatly appreciated.

This feels a bit like a homework problem. Thus, I won't post the full answer, but will try to give some hints: Character groups to match are given between square brackets [adfg] will match any of the letters a, d, f or g. [adfg]{3,} will match any part with at least 3 of these letters. Looking at your list of words, you only want to match whole lines. If you pass re.MULTILINE as the second argument to re.compile, ^ will match the beginning and $ the end of a line.
Addition:
If the characters can only appear in the order given and assuming that each character can appear any number of times: 'qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*n'. However, we will also have to have at least 5 characters in total. A positive lookbehind assertion (?<=\w{5}) added to the end will ensure that.

First line not capitalizing correctly in Python 3

I'm trying to capitalize the first letter of every name in a file, so I wrote the following code:
with open('C:/Users/Nishesh/Documents/updated_firstnames.txt', 'r+', encoding='utf-8') as updated_fnames_file:
with open('C:/Users/Nishesh/Documents/capitalized.txt', 'w', encoding='utf-8') as new_fnames:
for line in updated_fnames_file:
new_fnames.write(line.capitalize())
I'm new to Python, so I'm well aware that this is probably poor formatting/logic (and I'd appreciate suggestions to improve it), but for my purposes, this did manage to correctly capitalize every item in the file other than the very first one, as far as I can tell. Actually, the first name in the original file was already capitalized, but after I ran this it ended up lower case in the resulting file. The other items in the first file which were already capitalized were not made lower case however - just this one. Why is this happening?

capitalize() :
It returns a copy of the string with only its first character capitalized.
You probably need capwords() from string lib.
string.capwords() :
Split the argument into words using str.split(), capitalize each word using str.capitalize(), and join the capitalized words using str.join().
Or you can do the same method by hand
new_fnames.write(' '.join(map(str.capitalize, line.split())))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.