Cut a field and LIKE search the parts - python

I have a list of names (actually, authors) stored in a sqlite database. Here is an example:
João Neres, Ruben C. Hartkoorn, Laurent R. Chiarelli, Ramakrishna Gadupudi, Maria Rosalia Pasca, Giorgia Mori, Alberto Venturelli, Svetlana Savina, Vadim Makarov, Gaelle S. Kolly, Elisabetta Molteni, Claudia Binda, Neeraj Dhar, Stefania Ferrari, Priscille Brodin, Vincent Delorme, Valérie Landry, Ana Luisa de Jesus Lopes Ribeiro, Davide Farina, Puneet Saxena, Florence Pojer, Antonio Carta, Rosaria Luciani, Alessio Porta, Giuseppe Zanoni, Edda De Rossi, Maria Paola Costi, Giovanna Riccardi, Stewart T. Cole
It's a string. My goal is to write an efficient "analyser" of name. So I basically perform a LIKE query:
' ' || replace(authors, ',', ' ') || ' ' LIKE '{0}'.format(my_string)
I basically replace all the commas with a space, and insert a space at the end and at the beginning of the string. So if I look for:
% Rossi %
I'll get all the items, where one of the authors has "Rossi" as a family name. "Rossi", not "Rossignol" or "Trossi".
It's an efficient way to look for an author with his family name, because I'm sure the string stored in the database contains the family names of the authors, unaltered.
But the main problem lies here: "Rossi" is, for example, a very common family name. So if I want to look for a very particular person, I will add his first name. Let's assume it is "Jean-Philippe". "Jean-Philippe" can be stored in the database under many forms: "J.P Rossi", "Jean-Philippe Rossi", "J. Rossi", "Jean P. Rossi", etc.
So I tried this:
% J%P Rossi %
But of course, It matches everything containing a J, then a P, and finally rossi. It matches the string I gave as an example. (Edda De Rossi)
So I wonder if there is a way to cut the string in the query, on a delimiter, and then match each piece against the search pattern.
Of course I'm open to any other solution. My goal is to match the search pattern against each author name.

Related

Regular Expressions. cant understand how to get the entire name if it start with Mr / Mrs / Ms

I cant understand how does re module work. I performed many attempts to get the entire name if there is only one name or multiple names (surname).
This is the re.compile() format that I'm using to get the name if the string has the the surname optionally:
the_formmat = re.compile(r"Mr?s?\.?\s[A-Z][a-z]+\s[A-Z][a-z]+")
the_string = "this is Mr Samantha Rajapaksa and his wife Mrs. Chalani Rajapaksa. his fathers name is Mr Prabath and his mothers name is Mrs Karunarathnage Dayawathi Bandara Peiris "
print(the_formmat.findall(the_string))
I know the use case of the ? modifier but I don't know where to put it to get the surname if there is one or more.
From the above example I get this output:
['Mr Samantha Rajapaksa', 'Mrs. Chalani Rajapaksa', 'Mrs Karunarathnage Dayawathi']
The output that I want is:
['Mr Samantha Rajapaksa', 'Mrs. Chalani Rajapaksa', 'Mr Prabath', 'Mrs Karunarathnage Dayawathi Bandara Peiris']
Try this Regex:
/(?:Mr|Ms|Mrs)\.?(?: [A-Z][a-z]+)+/
Edited thanks to #treuss.
So change your the_formmat variable to:
the_formmat = re.compile(r"(?:Mr|Ms|Mrs)\.?(?: [A-Z][a-z]+)+")
What is does it it checks for Mr/Ms/Mrs, then when there's a space it will keep checking for words starting with an uppercase letter followed by a space until it doesn't match anymore.
You could check this RegExr link to learn more.

Insert quotes in the string using index python

I want to insert quotes("") around the date and text in the string (which is in the file input.txt). Here is my input file:
created_at : October 9, article : ISTANBUL — Turkey is playing a risky game of chicken in its negotiations with NATO partners who want it to join combat operations against the Islamic State group — and it’s blowing back with violence in Turkish cities. As the Islamic militants rampage through Kurdish-held Syrian territory on Turkey’s border, Turkey says it won’t join the fight unless the U.S.-led coalition also goes after the government of Syrian President Bashar Assad.
created_at : October 9, article : President Obama chairs a special meeting of the U.N. Security Council last month. (Timothy A. Clary/AFP/Getty Images) When it comes to President Obama’s domestic agenda and his maneuvers to (try to) get things done, I get it. I understand what he’s up to, what he’s trying to accomplish, his ultimate endgame. But when it comes to his foreign policy, I have to admit to sometimes thinking “whut?” and agreeing with my colleague Ed Rogers’s assessment on the spate of books criticizing Obama’s foreign policy stewardship.
I want to put quotes around the date and text as follows:
created_at : "October 9", article : "ISTANBUL — Turkey is playing a risky game of chicken in its negotiations with NATO partners who want it to join combat operations against the Islamic State group — and it’s blowing back with violence in Turkish cities. As the Islamic militants rampage through Kurdish-held Syrian territory on Turkey’s border, Turkey says it won’t join the fight unless the U.S.-led coalition also goes after the government of Syrian President Bashar Assad".
created_at : "October 9", article : "President Obama chairs a special meeting of the U.N. Security Council last month. (Timothy A. Clary/AFP/Getty Images) When it comes to President Obama’s domestic agenda and his maneuvers to (try to) get things done, I get it. I understand what he’s up to, what he’s trying to accomplish, his ultimate endgame. But when it comes to his foreign policy, I have to admit to sometimes thinking “whut?” and agreeing with my colleague Ed Rogers’s assessment on the spate of books criticizing Obama’s foreign policy stewardship".
Here is my code which finds the index for comma(, after the date) and index for the article and then by using these, I want to insert quotes around the date. Also I want to insert quotes around the text, but how to do this?
f = open("input.txt", "r")
for line in f:
article_pos = line.find("article")
print article_pos
comma_pos = line.find(",")
print comma_pos
While you can do this with low-level operations like find and slicing, that's really not the easy or idiomatic way to do it.
First, I'll show you how to do it your way:
comma_pos = line.find(", ")
first_colon_pos = line.find(" : ")
second_colon_pos = line.find(" : ", comma_pos)
line = (line[:first_colon_pos+3] +
'"' + line[first_colon_pos+3:comma_pos] + '"' +
line[comma_pos:second_colon_pos+3] +
'"' + line[second_colon_pos+3:] + '"')
But you can more easily just split the line into bits, munge those bits, and join them back together:
dateline, article = line.split(', ', 1)
key, value = dateline.split(' : ')
dateline = '{} : "{}"'.format(key, value)
key, value = article.split(' : ')
article = '{} : "{}"'.format(key, value)
line = '{}, {}'.format(dateline, article)
And then you can take the repeated parts and refactor them into a simple function so you don't have to write the same thing twice (which may come in handy if you later need to write it four times).
It's even easier using a regular expression, but that might not be as easy to understand for a novice:
line = re.sub(r'(.*?:\s*)(.*?)(\s*,.*?:\s*)(.*)', r'\1"\2"\3"\4"', line)
This works by capturing everything up to the first : (and any spaces after it) in one group, then everything from there to the first comma in a second group, and so on:
(.*?:\s*)(.*?)(\s*,.*?:\s*)(.*)
Debuggex Demo
Notice that the regex has the advantage that I can say "any spaces after it" very simply, while with find or split I had to explicitly specify that there was exactly one space on either side of the colon and one after the comma, because searching for "0 or more spaces" is a lot harder without some way to express it like \s*.
You could also take a look at the regex library re.
E.g.
>>> import re
>>> print(re.sub(r'created_at:\s(.*), article:\s(.*)',
... r'created_at: "\1", article: "\2"',
... 'created_at: October 9, article: ...'))
created_at: "October 9", article: "..."
The first param to re.sub is the pattern you are trying to match. The parens () capture the matches and can be used in the second argument with \1. The third argument is the line of text.

Python - Split String with Parenthesis based off a Pattern

I have a problem in python where I have a pattern, that can repeat anywhere from 1 to XXX times.
The pattern is I have a string of format
Author (Affiliation) Author (Affiliation) etc etc etc as many authors/affiliations that there are.
What is the best way in Python to go about splitting a string up like this when you dont know if you'll have 1 instance of Author (Affiliation) or 100?
EDIT - Viktor Leis* (Technische Universität München) Alfons Kemper (Technische Universität München) Thomas Neumann (Technische Universität München, Germany)
That is a sample string I am working with. I have tried re.split / re.findall and am having no luck. I'm assuming I am doing something with regex's wrong.
EDIT 2 - '\w+{1,3}(\w{1,10})' Is the pattern I was attempting to use.
My logic was a name is 1-3 words, then (. Then an affiliation is between 1-10 words, and a closing ).
Here is a sample. Looks like you are wanting to match text with no ) or ( and the text in between ( and ). Below is one way to do it assuming it is exactly like above.
import re
text = r'Viktor Leis* (Technische Universitt Mnchen) Alfons Kemper (Technische Universitt Mnchen) Thomas Neumann (Technische Universitt Mnchen, Germany)'
pattern = '[^\(\)]* \([^\(]+\)'
result = re.findall(pattern,s)
print result
output:
['Viktor Leis* (Technische Universitt Mnchen)', ' Alfons Kemper (Technische Universitt Mnchen)', ' Thomas Neumann (Technische Universitt Mnchen, Germany)']
You may want to removing leading and trailing spaces using strip.
This is the first thing that comes to mind
import re
s = 'Bob (ABC) Steve (XYZ) Mike (ALPHA)'
pattern = '\w+ \(\w+\)'
>>> re.findall(pattern,s)
['Bob (ABC)', 'Steve (XYZ)', 'Mike (ALPHA)']
You could do it like this:
thing="Author1 (Affiliation) Author2 (Affiliation) Author3 (Affiliation)"
s=thing.split(') ')
list=[]
for i in s:
if not i.endswith(')'):
list.append(i+')')
else:
list.append(i)

Regex to help split up list into two-tuples

Given a list of actors, with their their character name in brackets, separated by either a semi-colon (;) or comm (,):
Shelley Winters [Ruby]; Millicent Martin [Siddie]; Julia Foster [Gilda];
Jane Asher [Annie]; Shirley Ann Field [Carla]; Vivien Merchant [Lily];
Eleanor Bron [Woman Doctor], Denholm Elliott [Mr. Smith; abortionist];
Alfie Bass [Harry]
How would I parse this into a list of two-typles in the form of [(actor, character),...]
--> [('Shelley Winters', 'Ruby'), ('Millicent Martin', 'Siddie'),
('Denholm Elliott', 'Mr. Smith; abortionist')]
I originally had:
actors = [item.strip().rstrip(']') for item in re.split('\[|,|;',data['actors'])]
data['actors'] = [(actors[i], actors[i + 1]) for i in range(0, len(actors), 2)]
But this doesn't quite work, as it also splits up items within brackets.
You can go with something like:
>>> re.findall(r'(\w[\w\s\.]+?)\s*\[([\w\s;\.,]+)\][,;\s$]*', s)
[('Shelley Winters', 'Ruby'),
('Millicent Martin', 'Siddie'),
('Julia Foster', 'Gilda'),
('Jane Asher', 'Annie'),
('Shirley Ann Field', 'Carla'),
('Vivien Merchant', 'Lily'),
('Eleanor Bron', 'Woman Doctor'),
('Denholm Elliott', 'Mr. Smith; abortionist'),
('Alfie Bass', 'Harry')]
One can also simplify some things with .*?:
re.findall(r'(\w.*?)\s*\[(.*?)\][,;\s$]*', s)
inputData = inputData.replace("];", "\n")
inputData = inputData.replace("],", "\n")
inputData = inputData[:-1]
for line in inputData.split("\n"):
actorList.append(line.partition("[")[0])
dataList.append(line.partition("[")[2])
togetherList = zip(actorList, dataList)
This is a bit of a hack, and I'm sure you can clean it up from here. I'll walk through this approach just to make sure you understand what I'm doing.
I am replacing both the ; and the , with a newline, which I will later use to split up every pair into its own line. Assuming your content isn't filled with erroneous ]; or ], 's this should work. However, you'll notice the last line will have a ] at the end because it didn't have a need a comma or semi-colon. Thus, I splice it off with the third line.
Then, just using the partition function on each line that we created within your input string, we assign the left part to the actor list, the right part to the data list and ignore the bracket (which is at position 1).
After that, Python's very useful zip funciton should finish the job for us by associating the ith element of each list together into a list of matched tuples.

pywikipedia (python) regex to add string if lacking

I have a set of records like:
Name
Name Paul Berry: present
Address George Necky: not present
Name Bob van Basten: present
Name Richard Von Rumpy: not present
Name Daddy Badge: not present
Name Paul Berry: present
Street George Necky: not present
Street Bob van Basten: present
Name Richard Von Rumpy: not present
City Daddy Badge: not present
and I want that all the records beginning with Name be in the form
Name Name Surname: not present
leaving untouched the records beginnning with other word.
i.e. I want to add the string "not" to the records beginning with Name where it isn't. I'm working with python (pywikipediabot)
Trying
python replace.py -dotall -regex 'Name ((?!not ).*?)present' 'Name \1not present'
but it adds the "not" even where it is already present.
Perhaps I haven't understood the negative lookahead syntax?
Just look for : present and replace it with : not present.
Edit: Improved answer:
for line in lines:
m = re.match('^Name[^:]*: present', line)
if m:
print re.sub(': present', ': not present', line)
else:
print line
You need a "negative look-behind" expression. This substitution will work:
'Name (.*)(?<!not )present' -> 'Name \1not present'
The .* matches everything between "Name" and "present", but the whole regexp matches only if "present" is not preceded by "not".
And are you sure you need -dotall? It looks like you want .* to match within a line only.
The following will do it:
re.sub(r'(Name.*?)(not )?present$', r'\1not present', s)

Categories

Resources