Finding specific words and add them into a dictionary - python

I want to find the words that start with "CHAPTER" and add them to a dictionary.
I have written some but It gives me 0 as an output all the time:
def wordcount(filename, listwords):
try:
file = open(filename, "r")
read = file.readlines()
file.close()
for word in listwords:
lower = word.lower()
count = 0
for sentence in read:
line = sentence.split()
for each in line:
line2=each.lower()
line2=line2.strip("")
if lower == line2:
count += 1
print(lower, ":", count)
except FileExistError:
print("The file is not there ")
wordcount("dad.txt", ["CHAPTER"])
the txt file is here
EDİT*
The problem was encoding type and I solved it but the new question is that How can I add these words into a dictionary?
and How can I make this code case sensitive I mean when I type wordcount("dad.txt", ["CHAPTER"]) I want it to find only CHAPTER words with upper case.

It cannot work because of this line:
if lower == line2:
you can use this line to find the words that start with "CHAPTER"
if line2.startswith(lower):

I notice that you need to check if a word starts with a certain words from listwords rather than equality (lower == line2). Hence, you should use startswith method.
You can have a simpler code, something like this.
def wordcount(filename, listwords):
listwords = [s.lower() for s in listwords]
wordCount = {s:0 for s in listwords} # A dict to store the counts
with open(filename,"r") as f:
for line in f.readlines():
for word in line.split():
for s in listwords:
if word.lower().startswith(s):
wordCount[s]+=1
return wordCount

If the goal is to find chapters and paragraphs, don't try and count words or split any line
For example, start simpler. Since chapters are in numeric order, you only need a list, not a dictionary
chapters = [] # list of chapters
chapter = "" # store one chapter
with open(filename, encoding="UTF-8") as f:
for line in f.readlines():
# TODO: should skip to the first line that starts with "CHAPTER", otherwise 'chapters' variable gets extra, header information
if line.startswith("CHAPTER"):
print("Found chapter: " + line)
# Save off the most recent, non empty chapter text, and reset
if chapter:
chapters.append(chapter)
chapter = ""
else:
# up to you if you want to skip empty lines
chapter += line # don't manipulate any data yet
# Capture the last chapter at the end of the file
if chapter:
chapters.append(chapter)
del chapter # no longer needed
# del chapters[0] if you want to remove the header information before the first chapter header
# Done reading the file, now work with strings in your lists
print(len(chapters)) # find how many chapters there are
If you actually did want the text following "CHAPTER", then you can split that line in the first if statement, however note that the chapter numbers repeat between volumes, and this solution assumes the volume header is part of a chapter
If you want to count the paragraphs, start with finding the empty lines (for example split each element on '\n\n')

Related

Compound if condition inside list comprehension doesn't seem to work

I have this code:
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open(ticket_file, 'r') as f:
tickets = [word for line in f for word in line.split() if jira_regex.match(word) and word not in tickets]
ticket_file contains this:
PRJ1-2333
PRJ1-2333
PRJ1-2333
PRJ2-2333
PRJ2-2333
MISC-5002
After the code runs, the tickets list contains these:
['PRJ1-2333', 'PRJ1-2333', 'PRJ1-2333', 'PRJ2-2333', 'PRJ2-2333', 'MISC-5002']
I expected this:
['PRJ1-2333', 'PRJ2-2333', 'MISC-5002']
Why is word not in tickets condition not eliminating duplicates? The regex filter is working fine, however.
You can use a set:
Sets can only contain unique values
I've used set(...) to be explicit, but set(...) can be replace with {...}.
This implementation builds a generator inside set()
Don't use a list-comprehension inside (e.g. set([...])), because the list can potentially use a lot of memory.
word not in tickets causes NameError: name 'tickets' is not defined because, from the perspective of the list comprehension, tickets does not exist.
If you're not getting a NameError, it's because tickets exists in memory already, or tickets is assigned in your code, but not this example.
Given the example code, if you clear the environment, and run the code, you'll get an error.
.match returns something like <re.Match object; span=(0, 9), match='PRJ1-2333'> or None
Where match = jira_regex.match(t), if there's a match, get the value with match[0].
word for line in f for word in line.split() if jira_regex.match(word) assumes that if jira_regex.match(word) isn't None that the match is always equal to word. Based on the sample data, this is the case, but I don't know if that's the case with the real data.
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open('test.txt', 'r') as f:
tickets = set(word for line in f for word in line.split() if jira_regex.match(word))
print(tickets)
{'MISC-5002', 'PRJ1-2333', 'PRJ2-2333'}
Without .split():
It seems as if line.split() is being used to get rid of the newline, which can be accomplished with line.strip()
Option 1:
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open('test.txt', 'r') as f:
tickets = set(jira_regex.match(word.strip())[0] for word in f) # assumes .match will never be None
print(tickets)
{'MISC-5002', 'PRJ1-2333', 'PRJ2-2333'}
Option 2:
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open('test.txt', 'r') as f:
tickets = set(word.strip() for word in f if jira_regex.match(word.strip()))
print(tickets)
{'MISC-5002', 'PRJ1-2333', 'PRJ2-2333'}
For the code to be explicit:
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
tickets = list()
with open('test.txt', 'r') as f:
for t in f:
t = t.strip() # remove space from beginning and end and remove newlines
match = jira_regex.match(t) # assign .match to a variable
if match != None: # check if a match was found
match = match[0] # extract the match value, depending on the data, this may not be the same as 't'
if match not in tickets: # check if match is in tickets
tickets.append(match) # if match is not in tickets, add it to tickets
print(tickets)
['PRJ1-2333', 'PRJ2-2333', 'MISC-5002']
Why is word not in tickets condition not eliminating duplicates?
It is because the variable tickets does not exist yet until the list comprehension is finished.
You can do a set comprehension like this (not tested):
tickets = {word for line in f for word in line.split() if jira_regex.match(word)}
I'm assuming you predefined tickets in your code. The reason the if statement is not working is because although you are adding more and more values into tickets, the tickets in your if statement will always be empty, so word is always not in.
I believe this is what you are trying to do:
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open(ticket_file, 'r') as f:
[tickets.append(word) for line in f for word in line.split() if jira_regex.match(word) and word not in tickets]
Why is word not in tickets condition not eliminating duplicates? This is not working because when you are using
tickets = [word for line in f for word in line.split() if jira_regex.match(word) and word not in tickets]
This is a list comprehension & hence will assign value to variable 'tickets' after reading all content from your file. Hence, in short, the condition, word not in tickets is literally adding nothing to the code as 'tickets' won't be assigned until every text is being read. What you can do is
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open(ticket_file, 'r') as f:
tickets = [word for line in f for word in line.split() if jira_regex.match(word)]
tickets=set(tickets)
This will remove all your duplicate values

Detect text connected

I'm trying to detect how many times a word appears in a txt file but the word is connected with other letters.
Detecting Hello
Text: Hellooo, how are you?
Expected output: 1
Here is the code I have now:
total = 0
with open('text.txt') as f:
for line in f:
finded = line.find('Hello')
if finded != -1 and finded != 0:
total += 1
print total´
Do you know how can I fix this problem?
As suggested in the comment by #SruthiV, you can use re.findall from re module,
import re
pattern = re.compile(r"Hello")
total = 0
with open('text.txt', 'r') as fin:
for line in fin:
total += len(re.findall(pattern, line))
print total
re.compile creates a pattern for regex to use, here "Hello". Using re.compile improves programs performance and is (by some) recommended for repeated usage of the same pattern. More here.
Remaining part of the program opens the file, reads it line by line, and looks for occurrences of the pattern in every line using re.findall. Since re.findall returns a list of matches, total is updated with the length of that list, i.e. number of matches in a given line.
Note: this program will count all occurrences of Hello- as separate words or as part of other words. Also, it is case sensitive so hello will not be counted.
For every line, you can iterate through every word by splitting the line on spaces which makes the line into a list of words. Then, iterate through the words and check if the string is in the word:
total = 0
with open('text.txt') as f:
# Iterate through lines
for line in f:
# Iterate through words by splitting on spaces
for word in line.split(' '):
# Match string in word
if 'Hello' in word:
total += 1
print total

Replace words of a long document in Python

I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work
A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)

How do you check for the presence of a string in a text file based on the elements of an array?

I have an array containing strings.
I have a text file.
I want to loop through the text file line by line.
And check whether each element of my array is present or not.
(they must be whole words and not substrings)
I am stuck because my script only checks for the presence of the first array element.
However, I would like it to return results with each array element and a note as to whether this array element is present in the entire file or not.
#!/usr/bin/python
with open("/home/all_genera.txt") as file:
generaA=[]
for line in file:
line=line.strip('\n')
generaA.append(line)
with open("/home/config/config2.cnf") as config_file:
counter = 0
for line in config_file:
line=line.strip('\n')
for part in line .split():
if generaA[counter]in part:
print (generaA[counter], "is -----> PRESENT")
else:
continue
counter += 1
If I understand correctly, you want a sequence of words that are in both files. If yes, set is your friend:
def parse(f):
return set(word for line in f for word in line.strip().split())
with open("path/to/genera/file") as f:
source = parse(f)
with open("path/to/conf/file" as f:
conf = parse(f)
# elements that are common to both sets
common = conf & source
print(common)
# elements that are in `source` but not in `conf`
print(source - conf)
# elements that are in `conf` but not in `source`
print(conf - source)
So to answer "I would like it to return results with each array element and a note as to whether this array element is present in the entire file or not", you can use either common elements or the source - conf difference to annotate your source list:
# using common elements
common = conf & source
result = [(word, word in common) for word in source]
print(result)
# using difference
diff = source - conf
result = [(word, word not in diff) for word in source]
Both will yeld the same result and since set lookup is O(1) perfs should be similar too, so I suggest the first solution (positive assertions are easier to the brain than negative ones).
You can of course apply further cleaning / normalisation when building the sets, ie if you want case insensitive search:
def parse(f):
return set(word.lower() for line in f for word in line.strip().split())
from collection import Counter
import re
#first normalize the text (lowercase everything and remove puncuation(anything not alphanumeric)
normalized_text = re.sub("[^a-z0-9 ]","",open("some.txt","rb").read().lower())
# note that this normalization is subject to the rules of the language/alphabet/dialect you are using, and english ascii may not cover it
#counter will collect all the words into a dictionary of [word]:count
words = Counter(normalized_text.split())
# create a new set of all the words in both the text and our word_list_array
set(my_word_list_array).intersection(words.keys())
the counter is not increasing because it's outside the for loops.
with open("/home/all_genera.txt") as myfile: # don't use 'file' as variable, is a reserved word! use myfile instead
generaA=[]
for line in myfile: # use .readlines() if you want a list of lines!
generaA.append(line)
# if you just need to know if string are present in your file, you can use .read():
with open("/home/config/config2.cnf") as config_file:
mytext = config_file.read()
for mystring in generaA:
if mystring in mytext:
print mystring, "is -----> PRESENT"
# if you want to check if your string in line N is present in your file in the same line, you can go with:
with open("/home/config/config2.cnf") as config_file:
for N, line in enumerate(config):
if generaA[N] in line:
print "{0} is -----> PRESENT in line {1}".format(generaA[N], N)
I hope that everything is clear.
This code could be improved in many ways, but i tried to have it as similar as yours so it will be easier to understand

python file manipulation

I have a file with entries such as:
26 1
33 2
.
.
.
and another file with sentences in english
I have to write a script to print the 1st word in sentence number 26
and the 2nd word in sentence 33.
How do I do it?
The following code should do the task. With assumptions that files are not too large. You may have to do some modification to deal with edge cases (like double space, etc)
# Get numers from file
num = []
with open('1.txt') as file:
num = file.readlines()
# Get text from file
text = []
with open('2.txt') as file:
text = file.readlines()
# Parse text into words list.
data = []
for line in text: # For each paragraoh in the text
sentences = l.strip().split('.') # Split it into sentences
words = []
for sentence in sentences: # For each sentence in the text
words = sentence.split(' ') # Split it into words list
if len(words) > 0:
data.append(words)
# get desired result
for i = range(0, len(num)/2):
print data[num[i+1]][num[i]]
Here's a general sketch:
Read the first file into a list (a numeric entry in each element)
Read the second file into a list (a sentence in each element)
Iterate over the entry list, for each number find the sentence and print its relevant word
Now, if you show some effort of how you tried to implement this in Python, you will probably get more help.
The big issue is that you have to decide what separates "sentences". For example, is a '.' the end of a sentence? Or maybe part of an abbreviation, e.g. the one I've just used?-) Secondarily, and less difficult, what separates "words", e.g., is "TCP/IP" one word, or two?
Once you have sharply defined these rules, you can easily read the file of text into a a list of "sentences" each of which is a list of "words". Then, you read the other file as a sequence of pairs of numbers, and use them as indices into the overall list and inside the sublist thus identified. But the problem of sentence and word separation is really the hard part.
In the following code, I am assuming that sentences end with '. '. You can modify it easily to accommodate other sentence delimiters as well. Note that abbreviations will therefore be a source of bugs.
Also, I am going to assume that words are delimited by spaces.
sentences = []
queries = []
english = ""
for line in file2:
english += line
while english:
period = english.find('.')
sentences += english[: period+1].split()
english = english[period+1 :]
q=""
for line in file1:
q += " " + line.strip()
q = q.split()
for i in range(0, len(q)-1, 2):
sentence = q[i]
word = q[i+1]
queries.append((sentence, query))
for s, w in queries:
print sentences[s-1][w-1]
I haven't tested this, so please let me know (preferably with the case that broke it) if it doesn't work and I will look into bugs
Hope this helps

Categories

Resources