How to find required word in novel in python? - python

I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue

Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].

This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question

Related

How can I split concatenated strings that contain no delimiters in python?

Let's say I have a list of concatenated firstname + lastname combinations like this:
["samsmith","sallyfrank","jamesandrews"]
I also have lists possible_firstnames and possible_lastnames.
If I want to split those full name strings based on values that appear in possible_firstnames and possible_lastnames, what is the best way of doing so?
My initial strategy was to compare characters between full name strings and each possible_firstnames/possible_lastnames value one by one, where I would split the full name string on discovery of a match. However, I realize that I would encounter a problem if, for example, "Sal" was included as a possible first name (my code would try to turn "sallyfrank" into "Sal Lyfrank" etc).
My next step would be to crosscheck what remains in the string after "sal" to values in possible_lastnames before finalizing the split, but this is starting to approach the convoluted and so I am left wondering if there is perhaps a much simpler option that I have been overlooking from the very beginning?
The language that I am working in is Python.
If you are getting similar names, like sam, samantha and saman, put them in reverse order so that the shortest is last
full_names = ["samsmith","sallyfrank","jamesandrews", "samanthasang", "samantorres"]
first_name = ["sally","james", "samantha", "saman", "sam"]
matches = []
for name in full_names:
for first in first_name:
if name.startswith(first):
matches.append(f'{first} {name[len(first):]}')
break
print(*matches, sep='\n')
Result
sam smith
sally frank
james andrews
samantha sang
saman torres
This won't pick out a name like Sam Antony. It would show this as *Saman Tony", in which case, your last name idea would work.
It also won't pick out Sam Anthanei. This could be Samantha Nei, Saman Thanei or Sam Anthanei if all three surnames were in your surname list.
Is this what u wanted
names = ["samsmith","sallyfrank","jamesandrews"]
pos_fname = ["sally","james"]
pos_lname = ["smith","frank"]
matches = []
for i in names:
for n in pos_fname:
if i.startswith(n):
break
else:
continue
for n in pos_lname:
if i.endswith(n):
matches.append(f"{i[:-len(n)].upper()} {n.upper()}")
break
else:
continue
print(matches)

Create a list of alphabetically sorted UNIQUE words and display the first N words in python

I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.

Replace a specific word given its position in a text file (Python)

I have a list of tuples, each on contains a word-to-be-replaced, its line and column number positions from a given text file. I want to go through the text file and replace that specific word of that specific position with a character (e.g. [('word1', 1, 1), ('word2', 1, 9), ... ]).
In other words, given a specific word, its line and column numbers inside a text file I am trying to find and replace that word with a character, for example:
given that the text file contains the following (assuming its position is as it is displayed -not written- here)
Excited him now natural saw passage offices you minuter. At by stack
being court hopes. Farther so friends am to detract. Forbade concern
do private be. Offending residence but men engrossed shy. Pretend am
stack earnest arrived company so on. Felicity informed yet had to is
admitted strictly how stack you.
and given that the word to replace is stack with position in the text to be line 3 and column 16, to replace it with the character *,
so, after the replace takes place, the text file would now have the contents:
Excited him now natural saw passage offices you minuter. At by stack
being court hopes. Farther so friends am to detract. Forbade concern
do private be. Offending residence but men engrossed shy. Pretend am
* earnest arrived company so on. Felicity informed yet had to is
admitted strictly how stack you.
I have considered linecache but it seems very inefficient for large text files. Also, given the fact that I already have the line and column numbers, I hoped there was a way to go directly to that position and perform the replace.
Does anyone know a way to do this in Python?
EDIT
The initial solution proposed using numpy's genfromtxt is (most likely) not suitable following the discussion in the follow-up issue since there is a need for every line of the text file to be present and not skipped (e.g. empty lines, strings beginning with 'w' and strings inside '/*.. /').
Try a recipe like this:
import numpy as np
import os
def changethis(pos):
# Notice file is in global scope
appex = file[pos[1]-1][:pos[2]] + '*' + file[pos[1]-1][pos[2]+len(pos[0]):]
file[pos[1]-1] = appex
pos = ('stack', 3, 16)
file = np.array([i for i in open('in.txt','r')]) #BEFORE EDIT: np.genfromtxt('in.txt',dtype='str',delimiter=os.linesep)
changethis(pos)
print(file)
The result is this:
[ 'Excited him now natural saw passage offices you minuter. At by stack being court hopes. Farther'
'so friends am to detract. Forbade concern do private be. Offending residence but men engrossed'
'shy. Pretend am * earnest arrived company so on. Felicity informed yet had to is admitted'
'strictly how stack you.']
Notice this is a bit of an hack to put a bunch of long strings into a numpy array, and somehow change them, but it should be efficient when inserting in a longer loop for position tuples.
EDIT: As #user2357112 made me realize the choice for file reader was not the most appropriate (although it worked for the exercise in question), so I've edited this answer to provide the same solution given in the follow up question.
Consider a single line:
word1 a word2 a word3 a word4
If you have these changes:
[('word1', 1, 1), ('word2', 1, 9), ... ]
And you process them in order:
* a word2 a word3 a word4
You will fail, because you are changing the positions of the words when you replace 'word1' with '*', a shorter string.
Instead, you will have to sort the list of changes by line, reversed by column:
changes = sorted(changes, key=lambda t: (t[1], -t[2]))
You can then process the changes as you iterate through the file, shown in the link referenced by #JRajan:
with open("file", "r") as fp:
fpline_text = enumerate(fp)
fpline,text = next(fpline_text)
for edit in changes:
word,line,offset = edit
line -=1 # 0 based
while fpline < line:
print(text)
fpline,text = next(fpline_text)
offset -= 1 # 0-based
cand = text[offset:offset+len(word)]
if cand != word:
print("OOPS! Word '{}' not found at ({}, {})".format(*edit))
else:
text = text[0:offset]+'*'+text[offset+len(word):]
# Rest of file
try:
while True:
print(text)
fpline,text = next(fpline_text)
except StopIteration:
pass

Regex to help split up list into two-tuples

Given a list of actors, with their their character name in brackets, separated by either a semi-colon (;) or comm (,):
Shelley Winters [Ruby]; Millicent Martin [Siddie]; Julia Foster [Gilda];
Jane Asher [Annie]; Shirley Ann Field [Carla]; Vivien Merchant [Lily];
Eleanor Bron [Woman Doctor], Denholm Elliott [Mr. Smith; abortionist];
Alfie Bass [Harry]
How would I parse this into a list of two-typles in the form of [(actor, character),...]
--> [('Shelley Winters', 'Ruby'), ('Millicent Martin', 'Siddie'),
('Denholm Elliott', 'Mr. Smith; abortionist')]
I originally had:
actors = [item.strip().rstrip(']') for item in re.split('\[|,|;',data['actors'])]
data['actors'] = [(actors[i], actors[i + 1]) for i in range(0, len(actors), 2)]
But this doesn't quite work, as it also splits up items within brackets.
You can go with something like:
>>> re.findall(r'(\w[\w\s\.]+?)\s*\[([\w\s;\.,]+)\][,;\s$]*', s)
[('Shelley Winters', 'Ruby'),
('Millicent Martin', 'Siddie'),
('Julia Foster', 'Gilda'),
('Jane Asher', 'Annie'),
('Shirley Ann Field', 'Carla'),
('Vivien Merchant', 'Lily'),
('Eleanor Bron', 'Woman Doctor'),
('Denholm Elliott', 'Mr. Smith; abortionist'),
('Alfie Bass', 'Harry')]
One can also simplify some things with .*?:
re.findall(r'(\w.*?)\s*\[(.*?)\][,;\s$]*', s)
inputData = inputData.replace("];", "\n")
inputData = inputData.replace("],", "\n")
inputData = inputData[:-1]
for line in inputData.split("\n"):
actorList.append(line.partition("[")[0])
dataList.append(line.partition("[")[2])
togetherList = zip(actorList, dataList)
This is a bit of a hack, and I'm sure you can clean it up from here. I'll walk through this approach just to make sure you understand what I'm doing.
I am replacing both the ; and the , with a newline, which I will later use to split up every pair into its own line. Assuming your content isn't filled with erroneous ]; or ], 's this should work. However, you'll notice the last line will have a ] at the end because it didn't have a need a comma or semi-colon. Thus, I splice it off with the third line.
Then, just using the partition function on each line that we created within your input string, we assign the left part to the actor list, the right part to the data list and ignore the bracket (which is at position 1).
After that, Python's very useful zip funciton should finish the job for us by associating the ith element of each list together into a list of matched tuples.

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources