Detect text connected

Detect text connected - python

I'm trying to detect how many times a word appears in a txt file but the word is connected with other letters.
Detecting Hello
Text: Hellooo, how are you?
Expected output: 1
Here is the code I have now:
total = 0
with open('text.txt') as f:
for line in f:
finded = line.find('Hello')
if finded != -1 and finded != 0:
total += 1
print total´
Do you know how can I fix this problem?

As suggested in the comment by #SruthiV, you can use re.findall from re module,
import re
pattern = re.compile(r"Hello")
total = 0
with open('text.txt', 'r') as fin:
for line in fin:
total += len(re.findall(pattern, line))
print total
re.compile creates a pattern for regex to use, here "Hello". Using re.compile improves programs performance and is (by some) recommended for repeated usage of the same pattern. More here.
Remaining part of the program opens the file, reads it line by line, and looks for occurrences of the pattern in every line using re.findall. Since re.findall returns a list of matches, total is updated with the length of that list, i.e. number of matches in a given line.
Note: this program will count all occurrences of Hello- as separate words or as part of other words. Also, it is case sensitive so hello will not be counted.

For every line, you can iterate through every word by splitting the line on spaces which makes the line into a list of words. Then, iterate through the words and check if the string is in the word:
total = 0
with open('text.txt') as f:
# Iterate through lines
for line in f:
# Iterate through words by splitting on spaces
for word in line.split(' '):
# Match string in word
if 'Hello' in word:
total += 1
print total

Related

Finding specific words and add them into a dictionary

I want to find the words that start with "CHAPTER" and add them to a dictionary.
I have written some but It gives me 0 as an output all the time:
def wordcount(filename, listwords):
try:
file = open(filename, "r")
read = file.readlines()
file.close()
for word in listwords:
lower = word.lower()
count = 0
for sentence in read:
line = sentence.split()
for each in line:
line2=each.lower()
line2=line2.strip("")
if lower == line2:
count += 1
print(lower, ":", count)
except FileExistError:
print("The file is not there ")
wordcount("dad.txt", ["CHAPTER"])
the txt file is here
EDİT*
The problem was encoding type and I solved it but the new question is that How can I add these words into a dictionary?
and How can I make this code case sensitive I mean when I type wordcount("dad.txt", ["CHAPTER"]) I want it to find only CHAPTER words with upper case.

It cannot work because of this line:
if lower == line2:
you can use this line to find the words that start with "CHAPTER"
if line2.startswith(lower):

I notice that you need to check if a word starts with a certain words from listwords rather than equality (lower == line2). Hence, you should use startswith method.
You can have a simpler code, something like this.
def wordcount(filename, listwords):
listwords = [s.lower() for s in listwords]
wordCount = {s:0 for s in listwords} # A dict to store the counts
with open(filename,"r") as f:
for line in f.readlines():
for word in line.split():
for s in listwords:
if word.lower().startswith(s):
wordCount[s]+=1
return wordCount

If the goal is to find chapters and paragraphs, don't try and count words or split any line
For example, start simpler. Since chapters are in numeric order, you only need a list, not a dictionary
chapters = [] # list of chapters
chapter = "" # store one chapter
with open(filename, encoding="UTF-8") as f:
for line in f.readlines():
# TODO: should skip to the first line that starts with "CHAPTER", otherwise 'chapters' variable gets extra, header information
if line.startswith("CHAPTER"):
print("Found chapter: " + line)
# Save off the most recent, non empty chapter text, and reset
if chapter:
chapters.append(chapter)
chapter = ""
else:
# up to you if you want to skip empty lines
chapter += line # don't manipulate any data yet
# Capture the last chapter at the end of the file
if chapter:
chapters.append(chapter)
del chapter # no longer needed
# del chapters[0] if you want to remove the header information before the first chapter header
# Done reading the file, now work with strings in your lists
print(len(chapters)) # find how many chapters there are
If you actually did want the text following "CHAPTER", then you can split that line in the first if statement, however note that the chapter numbers repeat between volumes, and this solution assumes the volume header is part of a chapter
If you want to count the paragraphs, start with finding the empty lines (for example split each element on '\n\n')

Total number of first words in a txt file

I need a program that counts the top 5 most common first words of the lines in a file and which does not include lines where the first word is followed by a "DM" or an "RT"?
I don't have any code as of so far because I'm completely lost.
f = open("C:/Users/Joe Simpleton/Desktop/talking.txt", "r")
?????

Read each line of your text in. For each line, split it into words using a regular expression, this will return a list of the words. If there are at least two words, test the second word to make sure it is not in your list. Then use a Counter() to keep track of all of the word counts. Store the lowercase of each word so that uppercase and lowercase versions of the same word are not counted separately:
from collections import Counter
import re
word_counts = Counter()
with open('talking.txt') as f_input:
for line in f_input:
words = re.findall(r'\w+', line)
if (len(words) > 1 and words[1] not in ['DM', 'RT']) or len(words) == 1:
word_counts.update(word.lower() for word in words)
print(word_counts.most_common(5))
The Counter() has a useful feature in being able to show the most common values.

Not tested, but should work roughly like that:
from collections import Counter
count = Counter()
with open("path") as f:
for line in f:
parts = line.split(" ")
if parts[1] not in ["DM", "RT"]:
count[parts[0]] += 1
print(count.most_common(5))
You should also add a check that ensures that parts has > 2 elements.

Find the number of characters in a file using Python

Here is the question:
I have a file with these words:
hey how are you
I am fine and you
Yes I am fine
And it is asked to find the number of words, lines and characters.
Below is my program, but the number of counts for the characters without space is not correct.
The number of words is correct and the number of line is correct.
What is the mistake in the same loop?
fname = input("Enter the name of the file:")
infile = open(fname, 'r')
lines = 0
words = 0
characters = 0
for line in infile:
wordslist = line.split()
lines = lines + 1
words = words + len(wordslist)
characters = characters + len(line)
print(lines)
print(words)
print(characters)
The output is:
lines=3(Correct)
words=13(correct)
characters=47
I've looked on the site with multiple answers and I am confused because I didn't learn some other functions in Python. How do I correct the code as simple and basic as it is in the loop I've done?
Whereas the number of characters without space is 35 and with space is 45.
If possible, I want to find the number of characters without space. Even if someone know the loop for the number of characters with space that's fine.

Sum up the length of all words in a line:
characters += sum(len(word) for word in wordslist)
The whole program:
with open('my_words.txt') as infile:
lines=0
words=0
characters=0
for line in infile:
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lines)
print(words)
print(characters)
Output:
3
13
35
This:
(len(word) for word in wordslist)
is a generator expression. It is essentially a loop in one line that produces the length of each word. We feed these lengths directly to sum:
sum(len(word) for word in wordslist)
Improved version
This version takes advantage of enumerate, so you save two lines of code, while keeping the readability:
with open('my_words.txt') as infile:
words = 0
characters = 0
for lineno, line in enumerate(infile, 1):
wordslist = line.split()
words += len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lineno)
print(words)
print(characters)
This line:
with open('my_words.txt') as infile:
opens the file with the promise to close it as soon as you leave indentation.
It is always good practice to close file after your are done using it.

Remember that each line (except for the last) has a line separator.
I.e. "\r\n" for Windows or "\n" for Linux and Mac.
Thus, exactly two characters are added in this case, as 47 and not 45.
A nice way to overcome this could be to use:
import os
fname=input("enter the name of the file:")
infile=open(fname, 'r')
lines=0
words=0
characters=0
for line in infile:
line = line.strip(os.linesep)
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters=characters+ len(line)
print(lines)
print(words)
print(characters)

To count the characters, you should count each individual word. So you could have another loop that counts characters:
for word in wordslist:
characters += len(word)
That ought to do it. The wordslist should probably take away newline characters on the right, something like wordslist = line.rstrip().split() perhaps.

I found this solution very simply and readable:
with open("filename", 'r') as file:
text = file.read().strip().split()
len_chars = sum(len(word) for word in text)
print(len_chars)

This is too long for a comment.
Python 2 or 3? Because it really matters. Try out the following in your REPL for both:
Python 2.7.12
>>>len("taña")
5
Python 3.5.2
>>>len("taña")
4
Huh? The answer lies in unicode. That ñ is an 'n' with a combining diacritical. Meaning its 1 character, but not 1 byte. So unless you're working with plain ASCII text, you'd better specify which version of python your character counting function is for.

How's this? It uses a regular expression to match all non-whitespace characters and returns the number of matches within a string.
import re
DATA="""
hey how are you
I am fine and you
Yes I am fine
"""
def get_char_count(s):
return len(re.findall(r'\S', s))
if __name__ == '__main__':
print(get_char_count(DATA))
Output
35
The image below shows this tested on RegExr:

It is probably counting new line characters. Subtract characters with (lines+1)

Here is the code:
fp = open(fname, 'r+').read()
chars = fp.decode('utf8')
print len(chars)
Check the output. I just tested it.

A more Pythonic solution than the others:
with open('foo.txt') as f:
text = f.read().splitlines() # list of lines
lines = len(text) # length of the list = number of lines
words = sum(len(line.split()) for line in text) # split each line on spaces, sum up the lengths of the lists of words
characters = sum(len(line) for line in text) # sum up the length of each line
print(lines)
print(words)
print(characters)
The other answers here are manually doing what str.splitlines() does. There's no reason to reinvent the wheel.

You do have the correct answer - and your code is completely correct. The thing that I think it is doing is that there is an end of line character being passed through, which includes your character count by two (there isn't one on the last line, as there is no new line to go to). If you want to remove this, the simple fudge would be to do as Loaf suggested
characters = characters - (lines - 1)
See csl's answer for the second part...

Simply skip unwanted characters while calling len,
import os
characters=characters+ len([c for c in line if c not in (os.linesep, ' ')])
or sum the count,
characters=characters+ sum(1 for c in line if c not in (os.linesep, ' '))
or build a str from the wordlist and take len,
characters=characters+ len(''.join(wordlist))
or sum the characters in the wordlist. I think this is the fastest.
characters=characters+ sum(1 for word in wordlist for char in word)

You have two problems. One is the line endings and the other is the spaces in between.
Now there are many people who posted pretty good answers, but I find this method easier to understand:
characters = characters + len(line.strip()) - line.strip().count(' ')
line.strip() removes the trailing and leading spaces. Then I'm subtracting the number of spaces from the total length.

It's very simple:
f = open('file.txt', 'rb')
f.seek(0) # Move to the start of file
print len(f.read())

Here i got smallest program with less memory usage for your problem
with open('FileName.txt') as f:
lines = f.readlines()
data = ''.join(lines)
print('lines =',len(lines))
print('Words = ',len(data.split()))
data = ''.join(data.split())
print('characters = ',len(data))
lines will be list of lines,so length of lines is nothing but number of lines.Next step data contains a string of your file contents(each word separated by a whitespace), so if we split data gives list of words in your file. thus, length of that list gives number of words. again if we join the words list you will get all characters as a single string. thus length of that gives number of characters.

taking the input as file name i.e files.txt from the input parameter and then counting the total number of characters in the file and save to the variable
char
fname = input("Enter the name of the file:")
infile = open(fname, 'r') # connection of the file
lines = 0
words = 0
char = 0 # init as zero integer
for line in infile:
wordslist = line.split() # splitting line to word
lines = lines + 1 # counter up the word
words = words + len(wordslist) # splitting word to charac
char = char + len(line) # counter up the character
print("lines are: " + str(lines))
print("words are: " + str(words))
print("chars are: " + str(char)) # printing beautify

num_lines = sum(1 for line in open('filename.txt'))
num_words = sum(1 for word in open('filename.txt').read().split())
num_chars = sum(len(word) for word in open('filename.txt').read().split())

How to include white space when searching for a string in text file

Each user is marked in one of the following two ways: [donorStatus] => donor or [donorStatus] => notADonor.
The string notADonor is unique so I am able to use the following function to count instances of it successfully. However, donor appears within other, longer strings in the file, so I'd like to search for a more specific string like => donor.
Searching for this yields 0 each time so I'm thinking it's the white space throwing it off and I can't figure out how to work around that. Any help would be appreciated. Thanks!
from collections import Counter;
count = Counter();
for line in open ('data.txt', 'r'):
for word in line.split():
count[word] += 1
print count['=> donor']

The problem is that split() splits every whitespace, including the one between > and donor. To split at every whitespace EXCEPT the ones with > before it, use re.split('(?<!>)\s+', line):
import re
from collections import Counter
count = Counter()
for line in open ('data.txt', 'r'):
for word in re.split('(?<!>)\s+', line):
count[word] += 1
print count['=> donor']
Regular expression explained:
(?<!a)b is the expression for at negative lookbehind matching every b not preceded by a. Therefore, (?<!>)\s+ matches every every whitespace characters (\s+) not preceded by >.

Use regular expressions.
import re
from collections import Counter;
count = Counter();
for line in open ('data.txt', 'r'):
for word in line.split():
if re.search('=> donor', line, re.I):
count[word] += 1

If you're only doing this for this particular list and want to keep things fast, I'd first check if "=>" is used anywhere else in the file.
If it isn't, save yourself the time and just use donor_count = count['=>'] - count['notADonor'] for a constant-time solution.
Otherwise, you may want to change your for loop to:
for line in open ('data.txt', 'r'):
if '=> donor' in line:
count['=> donor'] += 1
// split and continue counting as needed, etc.
or use a regex, if you're going to use regular expressions for other things in the parsing. Otherwise, it's not likely to be worth the import just for this check.

This should get you the results you're after
def count(word):
counter = 0
for line in open ('c:\\data.txt', 'r'):
if word in line:
counter += 1
return counter
print count('=> donor')

Use split, count and sum;
with open('data.txt') as f:
lines = f.readlines()
Select only the lines that interest us
possible = [ln.strip().split() for ln in lines if '[donorStatus]' in ln]
Now find the donors;
print sum(ln.count('donor') for ln in possible)

Python unicode search not giving correct answer

I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found.
This is the code:
import codecs
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []
for counter, line in enumerate(hypernyms):
count_arr.append(0)
for word in words:
if line.find(word) >=0:
count_arr[counter] +=1
for iterator, count in enumerate(count_arr):
if count>0:
print iterator, ' ', count
This is finding some words, but ignoring some others
The input files are:
File-1:
पौधा
वनस्पति
File-2:
वनस्पति, पेड़-पौधा
वस्तु-भाग, वस्तु-अंग, वस्तु_भाग, वस्तु_अंग
पादप_समूह, पेड़-पौधे, वनस्पति_समूह
पेड़-पौधा
This gives output:
0 1
3 1
Clearly, it is ignoring वनस्पति and searching for पौधा only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?

I think the problem is here:
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
.readlines() will leave the line break at the end, so you're not searching for पौधा, you're searching for पौधा\n, and you'll only match at the end of a line. If I use .read().split() instead, I get
0 2
2 1
3 1

That because You don't remove the "\n" charactor at the end of lines.
So you don't search "some_pattern\n", not "some_pattern".
Use strip() function to chop them off like this:
import codecs
words = [word.strip() for word in codecs.open("hypernyms_en2hi.txt", "r", "utf-8")]
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8")
count_arr = []
for line in hypernyms:
count_arr.append(0)
for word in words:
count_arr[-1] += (word in line)
for count in enumerate(count_arr):
if count:
print iterator, ' ', count

Put this code and you will see why that happens,because of the spaces:
in file 1 the first word is पौधा[space]....
for i in hypernyms:
print "file1",i
for i in words:
print "file2",i
After count_arr = [] and before for counter, line...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Detect text connected - python

Related

Finding specific words and add them into a dictionary

Total number of first words in a txt file

Find the number of characters in a file using Python

How to include white space when searching for a string in text file

Python unicode search not giving correct answer

Categories

Resources