Find Substring Matches Between Two Files - python

I have a list of movie titles and a list of names.
Movies:
Independence Day
Who Framed Roger Rabbit
Rosemary's Baby
Ghostbusters
There's Something About Mary
Names:
Roger
Kyle
Mary
Sam
I want to make a new list of all the movies that match a name from the names list.
Who Framed Roger Rabbit (matched "roger")
Rosemary's Baby (matched "mary")
There's Something About Mary (matched "mary")
I've tried to do this in Python, but for some reason it isn't working. The resulting file is empty.
with open("movies.csv", "r") as movieList:
movies = movieList.readlines()
with open("names.txt", "r") as namesToCheck:
names = namesToCheck.readlines()
with open("matches.csv", "w") as matches:
matches.truncate(0)
for i in range(len(movies)):
for j in range(len(names)):
if names[j].lower() in movies[i].lower():
matches.write(movies[i])
break
matches.close();
What am I missing here?

The reason that you aren't getting any results is likely that when you call readlines() on a file in Python it gives you a list of each line with a newline character, \n, attached to the end. Therefore your program would be checking if "roger\n" is in a line in the movies files rather than just "roger".
To fix this, you could simply add a [:-1] to your if statement to only check the name and not the newline:
if names[j].lower()[:-1] in movies[i].lower():
You could also change the way you read the names file by using read().splitlines() to get rid of the newline character like this:
names = namesToCheck.read().splitlines()

This works ....
Movies="""Independence Day
Who Framed Roger Rabbit
Rosemary's Baby
Ghostbusters
There's Something About Mary
"""
Names="""Roger
Kyle
Mary
Sam"""
with StringIO(Movies) as movie_file:
movies=[n.strip().lower() for n in movie_file.readlines()]
with StringIO(Names) as name_file:
names=[n.strip().lower() for n in name_file.readlines()]
for name in names:
for film in movies:
if film.find(name) is not -1:
print("{:20s} {:40s}".format(name,film))
Output:
roger who framed roger rabbit
mary rosemary's baby
mary there's something about mary

Related

New to Python: How to keep the first letter of each word capitalized?

I was practicing with this tiny program with the hopes to capitalize the first letter of each word in: john Smith.
I wanted to capitalize the j in john so I would have an end result of John Smith and this is the code I used:
name = "john Smith"
if (name[0].islower()):
name = name.capitalize()
print(name)
Though, capitalizing the first letter caused an output of: John smith where the S was converted to a lowercase. How can I capitalize the letter j without messing with the rest of the name?
I thank you all for your time and future responses!
I appreciate it very much!!!
As #j1-lee pointed out, what you are looking for is the title method, which will capitalize each word (as opposed to capitalize, which will capitalize only the first word, as if it was a sentence).
So your code becomes
name = "john smith"
name = name.title()
print(name) #> John Smith
Of course you should be using str.title(). However, if you want to reinvent that functionality then you could do this:
name = 'john paul smith'
r = ' '.join(w[0].upper()+w[1:].lower() for w in name.split())
print(r)
Output:
John Paul Smith
Note:
This is not strictly equivalent to str.title() as it assumes all whitespace in the original string is replaced with a single space

Python - Applying a function to separate string in column every two words

I want to add a separator (,) every two words capture/better delineate the full names of the row.
For example df['Names'] is currently:
John Smith David Smith Golden Brown Austin James
and I would like to be:
John Smith, David Smith, Golden Brown, Austin James
I was able to find some code which splits the string every x words which would be perfect for my purposes shown below:
def splitText(string):
words = string.split()
grouped_words = [' '.join(words[i: i + 2]) for i in range(0, len(words), 2)]
return grouped_words
However I'm not sure how to apply this to the column of choice.
I tried the following:
df['Names'].apply(splitText())
This gives me a missing positional argument.
Asking for any advice on either modifying the function or my application of it to a column dataframe. I'm pretty new to this stuff so any advice would be great!
Cheers
You can pass only function without ():
df['Names'].apply(splitText)
Working like using lambda function:
df['Names'].apply(lambda x: splitText(x))

Store a string as a single line in a text file

I have many big strings with many characters (about 1000-1500 characters) and I want to write the string to a text file using python. However, I need the strings to occupy only a single line in a text file.
For example, consider two strings:
string_1 = "Mary had a little lamb
which was as white
as snow"
string_2 = "Jack and jill
went up a hill
to fetch a pail of
water"
When I write them to a text file, I want the strings to occupy only one line and not multiple lines.
text file eg:
Mary had a little lamb which was as white as snow
Jack and Jill went up a hill to fetch a pail of water
How can this be done?
If you want all the strings to be written out on one line in a file without a newline separator between them there are a number of ways as others here have shown.
The interesting issue is how you get them back into a program again if that is needed, and getting them back into appropriate variables.
I like to use json (docs here) for this kind of thing and you can get it to output all onto one line. This:
import json
string_1 = "Mary had a little lamb which was as white as snow"
string_2 = "Jack and jill went up a hill to fetch a pail of water"
strs_d = {"string_1": string_1, "string_2": string_2}
with open("foo.txt","w") as fh:
json.dump(strs_d, fh)
would write out the following into a file:
{"string_1": "Mary had a little lamb which was as white as snow", "string_2": "Jack and jill went up a hill to fetch a pail of water"}
This can be easily reloaded back into a dictionary and the oroginal strings pulled back out.
If you do not care about the names of the original string variable, then you can use a list like this:
import json
string_1 = "Mary had a little lamb which was as white as snow"
string_2 = "Jack and jill went up a hill to fetch a pail of water"
strs_l = [string_1, string_2]
with open("foo.txt","w") as fh:
json.dump(strs_l, fh)
and it outputs this:
["Mary had a little lamb which was as white as snow", "Jack and jill went up a hill to fetch a pail of water"]
which when reloaded from the file will get the strings all back into a list which can then be split into individual strings.
This all assumes that you want to reload the strings (and so do not mind the extra json info in the output to allow for the reloading) as opposed to just wanting them output to a file for some other need and cannot have the extra json formatting in the output.
Your example output does not have this, but your example output also is on more than one line and the question wanted it all on one line, so your needs are not entirely clear.
In [36]: string_1 = "Mary had a little lamb which was as white as snow"
...:
...: string_2 = "Jack and jill went up a hill to fetch a pail of water"
In [37]: s = [string_1, string_2]
In [38]: with open("a.txt","w") as f:
...: f.write(" ".join(s))
...:
Construct single line from multiline string and then write to file as normal. Your example really should use triple quotes to allow for multi-line strings
string_1 = """Mary had a little lamb
which was as white
as snow"""
string_2 = """Jack and jill
went up a hill
to fetch a pail of
water"""
with open("myfile.txt", "w") as f:
f.write(" ".join(string_1.split("\n")) + "\n")
f.write(" ".join(string_2.split("\n")) + "\n")
with open("myfile.txt") as f:
print(f.read())
output
Mary had a little lamb which was as white as snow
Jack and jill went up a hill to fetch a pail of water
You can split the string to lines using parenthesis:
s = (
"First line "
"second line "
"third line"
)
You can also use triple quotes and remove the newline characters using strip and replace:
s = """
First line
Second line
Third line
""".strip().replace("\n", " ")
total_str = [string_1,string_2]
with open(file_path+"file_name.txt","w") as fp:
for i in total_str:
fp.write(i+'\n')
fp.close()

How to replace every second space with a comma the Pythonic way

I have a string with first and last names all separated with a space.
For example:
installers = "Joe Bloggs John Murphy Peter Smith"
I now need to replace every second space with ', ' (comma followed by a space) and output this as string.
The desired output is
print installers
Joe Bloggs, John Murphy, Peter Smith
You should be a able to do this with a regex that that finds the spaces and replaces the last one:
import re
installers = "Joe Bloggs John Murphy Peter Smith"
re.sub(r'(\s\S*?)\s', r'\1, ',installers)
# 'Joe Bloggs, John Murphy, Peter Smith'
This says, find a space followed by some non-spaces followed by a space and replace it with the found space followed by some non-spaces and ", ". You could add installers.strip() if there's a possibility of trailing spaces on the string.
One way to do this is to split the string into a space-separated list of names, get an iterator for the list, then loop over the iterator in a for loop, collecting the first name and then advancing to loop iterator to get the second name too.
names = installers.split()
it = iter(names)
out = []
for name in it:
next_name = next(it)
full_name = '{} {}'.format(name, next_name)
out.append(full_name)
fixed = ', '.join(out)
print fixed
'Joe Bloggs, John Murphy, Peter Smith'
The one line version of this would be
>>> ', '.join(' '.join(s) for s in zip(*[iter(installers.split())]*2))
'Joe Bloggs, John Murphy, Peter Smith'
this works by creating a list that contains the same iterator twice, so the zip function returns both parts of the name. See also the grouper recipe from the itertools recipes.

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!
As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond

Categories

Resources