unique words in lines of text read from standard input - python

I am trying to see how many unique words there is in standard input.
import sys
s = sys.stdin.readlines()
seen = []
for lines in s:
if lines not in seen:
seen = seen + (lines.split())
seen.append(lines)
print (len(seen))
I know I am on right track but if Tree and tree should not be counted as separate unique words.
Also Monday and 1 are words but – is not.

seen = []
for line in s:
for word in line.strip().split():
if word.isalnum() and word.lower() not in (x.lower() for x in seen):
seen.append(word)
print(len(seen))
Or better (if you want only the length, but not the words themselves):
print(len(set(word.lower() for line in s for word in line.strip().split() if word.isalnum()))

I guess this code snippet can help you in few lines. Basicly the idea is to use set.
st = set([])
for lines in s.split('\n'):
print(lines)
st=set(lines.split()).union(st)
print(st)

Related

How can I merge two snippets of text that both contain a desired keyword?

I have a program that pulls out text around a specific keyword. I'm trying to modify it so that if two keywords are close enough together, it just shows one longer snippet of text instead of two individual snippets.
My current code, below, adds words after the keyword to a list and resets the counter if another keyword is found. However, I've found two problems with this. The first is that the data rate limit in my spyder notebook is exceeded, and I haven't been able to deal with that. The second is that though this would make a longer snippet, it wouldn't get rid of the duplicate.
Does anyone know a way to get rid of the duplicate snippet, or know how to merge the snippets in a way that doesn't exceed the data rate limit (or know how to change the spyder rate limit)? Thank you!!
def occurs(word1, word2, file, filewrite):
import os
infile = open(file,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(file, 'w')
f.write("start")
f.write(os.linesep)
g = open(filewrite,'w')
g.write("start")
g.write(os.linesep)
for item in wordlist: #multiple words
matches = [i for i, w in enumerate(words) if w.lower().find(item) != -1]
#above line goes through lines, finds the specific words we want
for m in matches: #next three lines find each instance of the word, print out surrounding words
list = []
s = ""
l = " ".join(words[m-20:m+1])
j = 0
while j < 20:
list.append(words[m+i])
j = j+1
if words[m+i] == word1 or words[m+i] == word2:
j = 0
print (list)
k = " ".join(list)
f.write(f"{s}...{l}{k}...") #writes the data to the external file
f.write(os.linesep)
g.write(str(m))
g.write(os.linesep)
f.close
g.close

Total number of first words in a txt file

I need a program that counts the top 5 most common first words of the lines in a file and which does not include lines where the first word is followed by a "DM" or an "RT"?
I don't have any code as of so far because I'm completely lost.
f = open("C:/Users/Joe Simpleton/Desktop/talking.txt", "r")
?????
Read each line of your text in. For each line, split it into words using a regular expression, this will return a list of the words. If there are at least two words, test the second word to make sure it is not in your list. Then use a Counter() to keep track of all of the word counts. Store the lowercase of each word so that uppercase and lowercase versions of the same word are not counted separately:
from collections import Counter
import re
word_counts = Counter()
with open('talking.txt') as f_input:
for line in f_input:
words = re.findall(r'\w+', line)
if (len(words) > 1 and words[1] not in ['DM', 'RT']) or len(words) == 1:
word_counts.update(word.lower() for word in words)
print(word_counts.most_common(5))
The Counter() has a useful feature in being able to show the most common values.
Not tested, but should work roughly like that:
from collections import Counter
count = Counter()
with open("path") as f:
for line in f:
parts = line.split(" ")
if parts[1] not in ["DM", "RT"]:
count[parts[0]] += 1
print(count.most_common(5))
You should also add a check that ensures that parts has > 2 elements.

Detect text connected

I'm trying to detect how many times a word appears in a txt file but the word is connected with other letters.
Detecting Hello
Text: Hellooo, how are you?
Expected output: 1
Here is the code I have now:
total = 0
with open('text.txt') as f:
for line in f:
finded = line.find('Hello')
if finded != -1 and finded != 0:
total += 1
print total´
Do you know how can I fix this problem?
As suggested in the comment by #SruthiV, you can use re.findall from re module,
import re
pattern = re.compile(r"Hello")
total = 0
with open('text.txt', 'r') as fin:
for line in fin:
total += len(re.findall(pattern, line))
print total
re.compile creates a pattern for regex to use, here "Hello". Using re.compile improves programs performance and is (by some) recommended for repeated usage of the same pattern. More here.
Remaining part of the program opens the file, reads it line by line, and looks for occurrences of the pattern in every line using re.findall. Since re.findall returns a list of matches, total is updated with the length of that list, i.e. number of matches in a given line.
Note: this program will count all occurrences of Hello- as separate words or as part of other words. Also, it is case sensitive so hello will not be counted.
For every line, you can iterate through every word by splitting the line on spaces which makes the line into a list of words. Then, iterate through the words and check if the string is in the word:
total = 0
with open('text.txt') as f:
# Iterate through lines
for line in f:
# Iterate through words by splitting on spaces
for word in line.split(' '):
# Match string in word
if 'Hello' in word:
total += 1
print total

printing 5 words before and after a specific word in a file in python

I have a folder which contains some other folders and these folders contain some text files. (The language is Persian). I want to print 5 words before and after a keyword with the keyword in the middle of them. I wrote the code, but it gives the 5 words in the start and the end of the line and not the words around the keyword. How can I fix it?
Hint: I just write the end of the code which relates to the question above. The start of the code is about the opening and normalizing the files.
def c ():
y = "آرامش"
text= normal_text(folder_path) # the first function to open and normalize the files
for i in text:
for line in i:
if y in line:
z = line.split()
print (z[-6], z[-5],
z[-4], z[-3],
z[-2], z[-1], y,
z[+1], z[+2],
z[+3], z[+4],
z[+5], z[+6])
what I expect is something like this:
word word word word word keyword word word word word word
Each sentence in a new line.
Try this. It splits the words. Then it calculates the amount to show before and after (with a minimum of however much is left, and a maximum of 5) and shows it.
words = line.split()
if y in words:
index = words.index(y)
before = index - min(index, 5)
after = index + min( len(words) - 1 - index, 5) + 1
print (words[before:after])
You need to get the words indices based on your keyword's index. You can use list.index() method in order to get the intended index, then use a simple indexing to get the expected words:
for f in normal_text(folder_path):
for line in f:
if keyword in line:
words = line.split()
ins = words.index(keyword)
print words[max(0, ind-5):min(ind+6, len(words))]
Or as a more optimized approach you can use a generator function in order to produce the words as an iterator which is very much optimized in terms of memory usage.
def get_words(keyword):
for f in normal_text(folder_path):
for line in f:
if keyword in line:
words = line.split()
ins = words.index(keyword)
yield words[max(0, ind-5):min(ind+6, len(words))]
Then you can simply loop over the result for print or etc.
y = "آرامش"
for words in get_words(y):
# do stuff
def c():
y = "آرامش"
text= normal_text(folder_path) # the first function to open and normalize the files
for i in text:
for line in i:
split_line = line.split()
if y in split_line:
index = split_line.index(y)
print (' '.join(split_line[max(0,index-5):min(index+6,le
n(split_line))]))
Assuming the keyword must be an exact word.

python file manipulation

I have a file with entries such as:
26 1
33 2
.
.
.
and another file with sentences in english
I have to write a script to print the 1st word in sentence number 26
and the 2nd word in sentence 33.
How do I do it?
The following code should do the task. With assumptions that files are not too large. You may have to do some modification to deal with edge cases (like double space, etc)
# Get numers from file
num = []
with open('1.txt') as file:
num = file.readlines()
# Get text from file
text = []
with open('2.txt') as file:
text = file.readlines()
# Parse text into words list.
data = []
for line in text: # For each paragraoh in the text
sentences = l.strip().split('.') # Split it into sentences
words = []
for sentence in sentences: # For each sentence in the text
words = sentence.split(' ') # Split it into words list
if len(words) > 0:
data.append(words)
# get desired result
for i = range(0, len(num)/2):
print data[num[i+1]][num[i]]
Here's a general sketch:
Read the first file into a list (a numeric entry in each element)
Read the second file into a list (a sentence in each element)
Iterate over the entry list, for each number find the sentence and print its relevant word
Now, if you show some effort of how you tried to implement this in Python, you will probably get more help.
The big issue is that you have to decide what separates "sentences". For example, is a '.' the end of a sentence? Or maybe part of an abbreviation, e.g. the one I've just used?-) Secondarily, and less difficult, what separates "words", e.g., is "TCP/IP" one word, or two?
Once you have sharply defined these rules, you can easily read the file of text into a a list of "sentences" each of which is a list of "words". Then, you read the other file as a sequence of pairs of numbers, and use them as indices into the overall list and inside the sublist thus identified. But the problem of sentence and word separation is really the hard part.
In the following code, I am assuming that sentences end with '. '. You can modify it easily to accommodate other sentence delimiters as well. Note that abbreviations will therefore be a source of bugs.
Also, I am going to assume that words are delimited by spaces.
sentences = []
queries = []
english = ""
for line in file2:
english += line
while english:
period = english.find('.')
sentences += english[: period+1].split()
english = english[period+1 :]
q=""
for line in file1:
q += " " + line.strip()
q = q.split()
for i in range(0, len(q)-1, 2):
sentence = q[i]
word = q[i+1]
queries.append((sentence, query))
for s, w in queries:
print sentences[s-1][w-1]
I haven't tested this, so please let me know (preferably with the case that broke it) if it doesn't work and I will look into bugs
Hope this helps

Categories

Resources