Python extract information from paragraph - python

I am new to Python and right now I am trying to extract information from a set of paragraphs containing employees related statistics.
For example, the paragraph might look like:
Name Rakesh Rao Age 34 Gender Male Marital Status Single
The whole text is not separated by any comas so I am having a hard time separating this information.
Also sometimes there might be a colon after the name of the variable and sometimes there might not be. For example in row 1, it's "Name Rakesh Rao" but in row 2 it's "Name: Ramachandra Deshpande".
There are around 1400 records of this information so it would be really great if I don't have to manually separate the information.
Can anyone help with this? I would be super grateful!

Well, I suppose you could try and do that using a regular expression.
If your text is exactly this:
paragraph = 'Name Rakesh Rao Age 34 Gender Male Marital Status Single'
You could use this regular expression (you would have to import re first):
m = re.fullmatch(
(
r'Name(?:\:)? (?P<name>\D+) ' # pay attention to the space at the end
r'Age(?:\:)? (?P<age>\d+) '
r'Gender(?:\:)? (?P<gender>\D+) '
r'Marital Status(?:\:)? (?P<status>\D+)' # no space here, since the string ends
),
paragraph
)
Then you could use the names of the groups defined within the regular expression, like this:
>>> m.group('name')
'Rakesh Rao'
>>> m.group('age')
'34'
>>> m.group('gender')
'Male'
>>> m.group('status')
'Single'
If all the fields are in a single line, you just have to replace \n with a single space within the regular expression.
Note that this will support a single comma immediately after row name, like this:
Name: Rakesh Rao
but it will not support different order of the data. If you would like that as well, I could try to write a different expression.
Explanation of the expression
Let's take the first "line" of the expression:
r'Name(?:\:)? (?P<name>\D+) '
First, why the r'…' string syntax? This is just to avoid double backslashes. In the "typical" string, we would need to write the expression like this:
'Name(?:\\:)? (?P<name>\\D+) '
Now, to the actual expression. The first part, Name, is pretty obvious.
(?:\:)?
This part creates a non-capturing group ((?:…)) with a colon inside – it's \: and not just :, because the colon itself is part of a regex syntax. Non-capturing group, because this colon really doesn't matter to us.
Then, after a single space, we have this:
(?P<name>\D+)
This creates a named group, the syntax is (?P<name_of_the_group>…). I use a named group just to make it easier and nicer to extract the information later, using m.group('name'), where m is a match object.
The \D+ means "at least one non-digit character". This captures all letters, underscores, but also white spaces. That is why the order of the fields is so important to this particular expression. If you were to change the order and put Gender field between Name and Age, it would capture it as well, because the + modifier is greedy.
On the other hand, the \d+ in the next "line" means "at least one digit character", so between 0 and 9.
I hope that explanation is enough, but it might be useful to you to play with that expression here, on this very useful site:
https://regex101.com/r/N5ZJU9/2
I've already entered the regex and the test string for you.

You can match optional characters, in your case it is : with the following expression [:]?.
According to the provided information, this regex should extract the required information:
^Name[:]?\s([A-Z][-'a-zA-Z]+)\s([A-Z][-'a-zA-Z]+)$
You can check it here.
This regular expression will match two-words names. Also names containing -'.
In Python this may look like that:
regex = r"^Name[:]?\s([A-Z][-'a-zA-Z]+)\s([A-Z][-'a-zA-Z]+)$"
test_str = ("Name Rakesh Rao\n"
"Name: Ramachandra Deshpande")
matches = re.finditer(regex, test_str, re.MULTILINE)
You can also check this example by the link provided above.
Hope this helps.

If the field names are always in the string, you can split the string on those field names. For example:
str_to_split = "Name Rakesh Rao Age 34 Gender Male Marital Status Single"
splitted = str_to_split.split("Age")
name = splitted[0].replace("Name", "")
If your text still contains other chars, you can remove them with replace(":", "") for instance. Otherwise you can use the NLTK toolkit to remove all kind of special chars from your text. Be careful, because names could also have special chars in them.

Related

Regex Substring preceded by a specific number of any character

I have a list from which I want to extract a part of text from those elements which have the following pattern:
<Start of string><Less than 30 characters> advocate. versus
I only want the <Start of string><Less than 30 characters> part
The code which I think should have worked but didn't:
a = re.search('^.{,30}advocate. versus', text).group(1)
and
a = re.search('^(.{,30})advocate. versus', text).group(1)
Apart from these, I also tried
a = re.search('^(.*)advocate. versus', text).group(1)
which worked, but I only want less than 30 characters, not just any number of characters.
Examples:
Consider the list with two items:
['Mr. Rajesh Bhardwaj, Advocate ..... Appellant Through Ms. Prem Lata Bansal, Sr. Standing Counsel with Mr.Vishnu Sharma, Advocate. versus PRADEEP KUMAR SAHNI ..... Respondent Through None', 'Mr.Vishnu Sharma, Advocate. versus JYOTI APPARELS']
I want to extract the text from second element which has less than 30 characters before "advocate. versus" but not text from the first one which has more than 30 characters. Basically, I want this from the second item:
Mr.Vishnu Sharma,
Ignore the case of the text in the list, assume everything is in lowercase.
Any help would be really appreciated.
This is what you are searching for. You need the zero in the quantifier {0,30}. And As I understood you dont want to capture the advocate versus part. You can use a lookahead for that. If will check if the advocate is there, but will not capture it. Dont use ^ at the start. because it mean "start of the line", your match is not at the start of the line. Also keep in mind - regex are case sensitive. "advocate" and "Advocate" are two different patterns. I made a regex that matches the indeferent of the case
As I understood the match you want has a comma before it, we can use it to extract exactly the value you want. Basically veverything after the comma and before advocate. versus.
(?<=,)[^,]{0,30},(?= [Aa]dvocate\. versus)
demo
https://regex101.com/r/cO5wcg/3

Need a python regular expression that can verify names with special characters(Hyphens, apostrophes, etc...)

I am trying to create a python regular expression that can match any name. I am scraping a web page and looking for the <h1> tag and grabbing the name in between it. The names can include James Dean, James-Dean, Brian O'Quin, Jame Joe-Harden, etc...
This was the first regular expression I have been working with but it is not catching all the names
<h1>[A-Z]{1}[a-z]+\s[A-Z]{1}[']?[A-Z]?[-]?[A-Z]?[a-z]+
Maybe this:
<h1>(([-'\w]+\s?)+)<h1>
Explaining:
the - matches itself, \w matches letters and numbers, and the plus is to capture one or more of these occurrences. Also, is optional a space character after this, to support composed names.
Finally, the last + plus ensures that you can repeat the structure I've just described.
Hope this help.

Python get arguments from string

I wanted to grab a argument from a string in python...
I wanted to grab the city of this string: weather in <city>
How do I get the city? Into a new variable?
Use Regular Expressions!
If you haven't heard of them, it's quite simple. Simply import the re module, and away you go!
>>> import re
Ok, maybe that wasn't so exciting. But now you can use pattern matching. Simply define your pattern:
>>> pattern = r"^(?P<thing>.*?) in (?P<city>.*?)$"
and away you go!
>>> re.match(pattern, "weather in my city")
<_sre.SRE_Match object; span=(0, 18), match='weather in my city'>
Don't worry! This is actually something useful. Let's store this in a variable so we can use it:
>>> match = re.match(pattern, "weather in my city")
>>> match.group("city")
'my city'
Hooray!
Now, what was that crazy pattern thing about? It worked, but it just seems like magic. Let me explain:
r"" just makes Python treat (most) \s as literal \s. So, r"\n" will be an actual \ followed by an actual n, as opposed to a new-line character. This is because regular expressions have special meanings for \ characters, and it's awkward to have to write \\ all the time.
^ means "start of the string".
(?P<name>...) is a named group. Normal groups are represented by (...), and can be referenced by their number (e.g. match.group(0)). Named groups can also be referenced by number, but they can also be referenced by their name. The P stands for Python, because that's where the syntax originally came from. Neat!
. means "any character".
* means "repeated 0 or more times".
? means a few things, but when it's after a * or + it means "match as little as possible". This means that it will make the thing group have as few "any character"s as possible.
in means exactly what it looks like. A followed by an i followed by a n followed by a .
.*? again means "match as few of any character as possible", but... I'm not really sure why I wrote that, considering that
$ means "end of the string".
And yeah, they never really stop seeming like magic. (Unless you use Perl.) If you want to make your own regular expression or learn some more, have a look at the documentation for the re module.
If you have constant spaces in your string and your strings are not going to change, it's relatively easy. Just use split on your string.
x = "weather in <city>"
split_x = x.split(" ")
# will return you
["weather", "in", "<city>"]
city = split_x[2]
Look at split's docs. But suppose your city is something like "New York", then you'll have to look for some alternative because in that case, the list will be -
x = "weather in New York"
# O/P
["weather", "in", "New", "York"]
And then if you do this-
city = split_x[2]
You will have wrong city name
With str.lstrip():
s = "weather in Las Vegas"
city_name = s.lstrip('weather in ')
print(city_name)
Prints:
Las Vegas

Python Regex Skipping Optional Groups

I am trying to extract a doctor's name and title from a string. If "dr" is in the string, I want it to use that as the title and then use the next word as the doctor's name. However, I also want the regex to be compatible with strings that do not have "dr" in them. In that case, it should just match the first word as the doctor's name and assume no title.
I have come up with the following regex pattern:
pattern = re.compile('(DR\.? )?([A-Z]*)', re.IGNORECASE)
As I understand it, this should optionally match the letters "dr" (with or without a following period) and then a space, followed by a series of letters, case-insensitive. The problem is, it seems to only pick up the optional "dr" title if it is at the beginning of the string.
import re
pattern = re.compile('(DR\.? )?([A-Z]*)', re.IGNORECASE)
test1 = "Dr Joseph Fox"
test2 = "Joseph Fox"
test3 = "Optometry by Dr Joseph Fox"
print pattern.search(test1).groups()
print pattern.search(test2).groups()
print pattern.search(test3).groups()
The code returns this:
('Dr ', 'Joseph')
(None, 'Joseph')
(None, 'Optometry')
The first two scenarios make sense to me, but why does the third not find the optional "Dr"? Is there a way to make this work?
You're seeing this behavior because regexes tend to be greedy and accept the first possible match. As a result, your regex is accepting only the first word of your third string, with no characters matching the first group, which is optional. You can see this by using the findall regex function:
>>> print pattern.findall(test3)
[('', 'Optometry'), ('', ''), ('', 'by'), ('', ''), ('Dr ', 'Joseph'), ('', ''), ('', 'Fox'), ('', '')]
It's immediately obvious that 'Dr Joseph' was successfully found, but just wasn't the first matching part of your string.
In my experience, trying to coerce regexes to express/capture multiple cases is often asking for inscrutable regexes. Specifically answering your question, I'd prefer to run the string through one regex requiring the 'Dr' title, and if I fail to get any matches, just split on spaces and take the first word (or however you want to go about getting the first word).
Regular expression engines match greedily from left to right. In other words: there is no "best" match and the first match will always be returned. You can do a global search, though...check out re.findall().
Your regex basically accepts any word, therefore it will be difficult to choose which one is the name of the doctor even after using findall if the dr is not present.
Is the re.IGNORECASE really important? Are you only interested in the name of the doctor or both name and surname?
I would reccomend using a regex that matches two words starting with uppercase and only one space in between, maintaining the optional dr before.
If re.ignorecase is really important, maybe it is better to make first a search for dr, and if it is unsuccessful, then store the first word as the name or something like that as proposed before
Look for (?<=...) syntax: Python Regex
Your re pattern will look about like this:
(DR\.? )?(?<=DR\.? )([A-Z]*)
You are only looking for Dr when the string starts with it, you aren't searching for a string containing Dr.
try
pattern = re.compile('(.*DR\.? )?([A-Z]*)', re.IGNORECASE)

Regex for extracting name starting with Mr.|Mrs

I was trying to write regex for identifying name starting with
Mr.|Mrs.
for example
Mr. A, Mrs. B.
I tried several expressions. These regular expressions were checked on online tool at pythonregex.com. The test string used is:
"hey where is Mr A how are u Mrs. B tt`"
Outputs mentioned are of findall() function of Python, i.e.
regex.findall(string)
Their respective outputs with regex are below.
Mr.|Mrs. [a-zA-Z]+ o/p-[u'Mr ', u'Mrs']
why A and B are not appearing with Mr. and Mrs.?
[Mr.|Mrs.]+ [a-zA-Z]+ o/p-[u's Mr', u'. B']
Why s is coming with Mr. instead of A?
I tried many more combinations but these are confusing so here are they. For name part I know regex has to cover more conditions but was starting from basic.
Change your regex like below,
(?:Mr\.|Mrs\.) [a-zA-Z]+
DEMO
You need to put Mr\., Mrs\. inside a non-capturing or capturing group , so that the | (OR) applies to the group itself.
You must need to escape the dot in your regex to match a literal dot or otherwise, it would match any character. . is a special meta character in regex which matches any character except line breaks.
OR
Even shorter one,
Mrs?\. [a-zA-Z]+
? quantifier in the above makes the previous character s as an optional one.
There's a python library for parsing human names :
https://github.com/derek73/python-nameparser
Much better than writing your own regex.

Categories

Resources