I have been creating a few regex patterns to search a file. I basically need to search each line of a text file as a string of values. The issue I am having is that the regexs I have created work when used against a list of of values; however, I can not use the same regex when I search a string using the same regex. I'm not sure what I am missing. My test code is below. The regex works against the list_primary, but when I change it to string2, the regex does not find the date value I'm looking for.
import re
list_primary = ["Wi-Fi", "goat", "Access Point", "(683A1E320680)", "detected", "Access Point detected", "2/5/2021", "10:44:45 PM", "Local", "41.289227", "-72.958748"]
string1 = "Wi-Fi Access Point (683A1E320680) detected puppy Access Point detected 2/5/2021 10:44:45 PM Local 41.289227 -72.958748"
#Lattitude = re.findall("[0-9][0-9][.][0-9][0-9][0-9][0-9][0-9][0-9]")
#Longitude = re.findall("[-][0-9][0-9][.][0-9][0-9][0-9][0-9][0-9][0-9]")
string2 = string1.split('"')
# print(string2)
list1 = []
for item in string2:
data_dict = {}
date_field = re.search(r"(\d{1})[/.-](\d{1})[/.-](\d{4})$",item)
print(date_field)
if date_field is not None:
date = date_field.group()
else:
date = None
For your current expression to work on the string, you need to delete the dollar sign from the end. Also, in order to find double digit dates (meaning 11/20/2018), you need to change your repetitions (since with your regex you can only find singular digits dates like 2/5/2011):
import re
list_primary = ["Wi-Fi", "goat", "Access Point", "(683A1E320680)", "detected", "Access Point detected", "2/5/2021", "10:44:45 PM", "Local", "41.289227", "-72.958748"]
string1 = "Wi-Fi Access Point (683A1E320680) detected puppy Access Point detected 2/5/2021 10:44:45 PM Local 41.289227 -72.958748"
#Lattitude = re.findall("[0-9][0-9][.][0-9][0-9][0-9][0-9][0-9][0-9]")
#Longitude = re.findall("[-][0-9][0-9][.][0-9][0-9][0-9][0-9][0-9][0-9]")
string2 = string1.split('"')
# print(string2)
list1 = []
for item in string2:
data_dict = {}
date_field = re.search(r"(\d{1,2})[/.-](\d{1,2})[/.-](\d{4})",item)
print(date_field)
if date_field is not None:
date = date_field.group()
else:
date = None
Output:
re.Match object; span=(71, 79), match='2/5/2021'>
If you want to extract the date from your string (rather than just search if it exists), include a capturing group around your whole expression in order to see your date as one string and not as 3 different numbers:
date_field = re.findall(r"(\d{1,2}[/.-]\d{1,2}[/.-]\d{4})",string1)
print(date_field)
Output:
['2/5/2021']
Related
I have a function which extracts the company register number (German: handelsregisternummer) from a given text. Although my regex for this particular problem matches the correct format (please see demo), I can not extract the correct company register number.
I want to extract HRB 142663 B but I get HRB 142663.
Most numbers are in the format HRB 123456 but sometimes there is the letter B attached to the end.
import re
def get_handelsregisternummer(string, keyword):
# https://regex101.com/r/k6AGmq/10
reg_1 = fr'\b{keyword}[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*)(?: B)?'
match = re.compile(reg_1)
handelsregisternummer = match.findall(string) # list of matched words
if handelsregisternummer: # not empty
return handelsregisternummer[0]
else: # no match found
handelsregisternummer = ""
return handelsregisternummer
Example text scraped from website. Linebreaks make words attached to each other:
text_impressum = """"Berlin, HRB 142663 BVAT-ID.: DE283580648Tax Reference Number:"""
Apply function:
for keyword in ['HRB', 'HRA', 'HR B', 'HR A']:
handelsregisternummer = get_handelsregisternummer(text_impressum, keyword=keyword)
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = keyword + " " + handelsregisternummer
break
if not handelsregisternummer: # if list is empty
handelsregisternummer = 'not specified'
handelsregisternummer_dict = {'handelsregisternummer':handelsregisternummer}
Afterwards I get:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663'}
But I want this:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663 B'}
You need to use two capturing groups in the regex to capture the keyword and the number, and just match the rest:
reg_1 = fr'\b({keyword})[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*(?: B)?)'
# |_________| |___________________|
Then, you need to concatenate, join all the capturing groups matched and returned with findall:
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = " ".join(handelsregisternummer)
break
See the Python demo.
it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo
Intro
Hello, I'm working on a project that requires me to replace dictionary keys within a pandas column of text with values - but with potential misspellings. Specifically I am matching names within a pandas column of text and replacing them with "First Name". For example, I would be replacing "tommy" with "First Name".
However, I realize there's the issue of misspelled names and text within the column of strings that won't be replaced by my dictionary. For example 'tommmmy" has extra m's and is not a first name within my dictionary.
#Create df
d = {'message' : pd.Series(['awesome', 'my name is tommmy , please help with...', 'hi tommy , we understand your quest...'])}
names = ["tommy", "zelda", "marcon"]
#create dict
namesdict = {r'(^|\s){}($|\s)'.format(el): r'\1FirstName\2' for el in names}
#replace
d['message'].replace(namesdict, regex = True)
#output
Out:
0 awesome
1 my name is tommmy , please help with...
2 hi FirstName , we understand your quest...
dtype: object
so "tommmy" doesn't match to "tommy" in the -> I need to deal with misspellings. I thought about trying to do this prior to the actual dictionary key and value replacement, like scan through the pandas data frame and replace the words within the column of strings ("messages") with the appropriate name. I've seen a similar approach using an index on specific strings like this one
but how do you match and replace words within the sentences within a pandas df, using a list of correct spelling? Can I do this within the df.series replace argument? Should I stick with a regex string replace?*
Any suggestions appreciated.
Update , trying Yannis's answer
I'm trying Yannis's answer but I need to use a list from an outside source, specifically the US census of first names for matching. But it's not matching on the whole names with the string I download.
d = {'message' : pd.Series(['awesome', 'my name is tommy , please help with...', 'hi tommy , we understand your quest...'])}
import requests
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
#US Census first names (5000 +)
firstnamelist = re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#turn list to string, force lower case
fnstring = ', '.join('"{0}"'.format(w) for w in firstnamelist )
fnstring = ','.join(firstnamelist)
fnstring = (fnstring.lower())
##turn to list, prepare it so it matches the name preceded by either the beginning of the string or whitespace.
names = [x.strip() for x in fnstring.split(',')]
#import jellyfish
import difflib
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x, y):
names = y # just a simple replacement list
tokens = x.split()
res = best_match(tokens, y)
if res is not None:
pos, replacement = res
tokens[pos] = "FirstName"
return u" ".join(tokens)
return x
d["message"].apply(lambda x: fuzzy_replace(x, names))
Results in:
Out:
0 FirstName
1 FirstName name is tommy , please help with...
2 FirstName tommy , we understand your quest...
But if I use a smaller list like this it works:
names = ["tommy", "caitlyn", "kat", "al", "hope"]
d["message"].apply(lambda x: fuzzy_replace(x, names))
Is it something with the longer list of names that's causing a problem?
Edit:
Changed my solution to use difflib. The core idea is to tokenize your input text and match each token against a list of names. If best_match finds a match then it reports the position (and the best matching string), so then you can replace the token with "FirstName" or anything you want. See the complete example below:
import pandas as pd
import difflib
df = pd.DataFrame(data=[(0,"my name is tommmy , please help with"), (1, "hi FirstName , we understand your quest")], columns=["A", "message"])
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x):
names = ["tommy", "john"] # just a simple replacement list
tokens = x.split()
res = best_match(tokens, names)
if res is not None:
pos, replacement = res
tokens[pos] = "FirstName"
return u" ".join(tokens)
return x
df.message.apply(lambda x: fuzzy_replace(x))
And the output you should get is the following
0 my name is FirstName , please help with
1 hi FirstName , we understand your quest
Name: message, dtype: object
Edit 2
After the discussion, I decided to have another go, using NLTK for parts of speech tagging and run the fuzzy matching only for the NNP tags (proper nouns) against the name list. The problem is that sometimes the tagger doesn't get the tag right, e.g. "Hi" might be also tagged as proper noun. However if the list of names are lowercased then get_close_matches doesn't match Hi against a name but matches all other names. I recommend that df["message"] is not lowercased to increase the chances that NLTK tags the names properly. One can also play with StanfordNER but nothing will work 100%. Here is the code:
import pandas as pd
import difflib
from nltk import pos_tag, wordpunct_tokenize
import requests
import re
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
# US Census first names (5000 +)
firstnamelist = re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
# turn list to string, force lower case
# simplified things here
names = [w.lower() for w in firstnamelist]
df = pd.DataFrame(data=[(0,"My name is Tommmy, please help with"),
(1, "Hi Tommy , we understand your question"),
(2, "I don't talk to Johhn any longer"),
(3, 'Michale says this is stupid')
], columns=["A", "message"])
def match_names(token, tag):
print token, tag
if tag == "NNP":
best_match = difflib.get_close_matches(token, names, n=1)
if len(best_match) > 0:
return "FirstName" # or best_match[0] if you want to return the name found
else:
return token
else:
return token
def fuzzy_replace(x):
tokens = wordpunct_tokenize(x)
pos_tokens = pos_tag(tokens)
# Every token is a tuple (token, tag)
result = [match_names(token, tag) for token, tag in pos_tokens]
x = u" ".join(result)
return x
df['message'].apply(lambda x: fuzzy_replace(x))
And I get in the output:
0 My name is FirstName , please help with
1 Hi FirstName , we understand your question
2 I don ' t talk to FirstName any longer
3 FirstName says this is stupid
Name: message, dtype: object
I'm trying to scrape dates from URL's of blogs and the like.
Since there's no universal way to get a date, I am for now, relying
on the date to be in the URL of the resource.
The dates come for the most part, in these formats:
url1 = "foo/bar/baz/2014/01/01/more/text"
url2 = "foo/bar/baz/2014/01/more/text"
url3 = "foo/bar/baz/20140101/more/text"
url4 = "foo/bar/baz/2014-01-01/more/text"
url5 = "foo/bar/baz/2014-01more/text"
url6 = "foo/bar/baz/2014_01_01/more/text"
url7 = "foo/bar/baz/2014_01/more/text"
# forgot one
url8 = "foo/bar/baz20140101more/text"
I've written a brute force code to get what I want.
It's explicit, but not elegant and probably not very robust.
I'd tried to cover the cases where I match "\" or "-" or "_" with no luck.
So I'm curious as to how one does that.
Although my main question is:
What's the best robust way to capture dates in a URL with the intention of converting them to datetime objects.
I don't think it's common for time elements to be in the format.
Cheers !
UPDATE
I believe I have the solution from Casimer. I'd like to add one more
url-date format that I missed before and might add a little trouble:
# this one maynot have a regex solution. Maybe machine learning.
# and it's not that big a deal if I get the wrong day for this application.
# I think it's safe to assume, that a legit date with Y/M/d with have
# /Y/m/d/ trailing "/"
http://www.nakedcapitalism.com/2014/03/17-million-reasons-rent-control-efficient.html
2014/03/17 # group captured
2014-03-17 00:00:00 # date time object
http://www.nakedcapitalism.com/2014/11/200pm-water-cooler-11514.html
2014/11/20
2014-11-20 00:00:00
# i put more restrictions on the number matching, but perhaps there's a better way...?
pat = r'(20[0-1][0-5]([-_/]?)[0-1][0-9]\2[0-3][0-9])'
Existing ugly solution:
NOTE: I've restricted the year info, because I was capturing strings of numbers that do not represent a date. Plus I figured it was more robust that way.
def get_date_from_url(self, url):
#pat = "(20[0-14]{2}\w+[0-9]{2}(?!\w+[0-9]{2}))"
pat = "(20[0-1][0-5]/[0-9]{2}/[0-9]{2})"
ob1 = re.compile(pat)
pat = "(20[0-1][0-5]-[0-9]{2}-[0-9]{2})"
ob2 = re.compile(pat)
pat = "(20[0-1][0-5]_[0-9]{2}_[0-9]{2})"
ob3 = re.compile(pat)
pat = "(20[0-1][0-5]/[0-9]{2})"
ob4 = re.compile(pat)
pat = "(20[0-1][0-5]-[0-9]{2})"
ob5 = re.compile(pat)
pat = "(20[0-1][0-5]_[0-9]{2})"
ob6 = re.compile(pat)
if ob1.search(url):
grp = ob1.search(url).group()
elif ob2.search(url):
grp = ob2.search(url).group()
elif ob3.search(url):
grp = ob3.search(url).group()
elif ob4.search(url):
grp = ob4.search(url).group()
elif ob5.search(url):
grp = ob5.search(url).group()
elif ob6.search(url):
grp = ob6.search(url).group()
else:
return None
print url
print grp
grp = re.sub('_', '/', grp) # fail to match return orig string
date = to_datetime(grp)
if isinstance(date, datetime.datetime):
print date
else:
return None
You can use this:
pat = r'(20[0-1][0-5]([-_/]?)[0-9]{2}(?:\2[0-9]{2})?)'
the delimiter is captured in group 2, so I use a backreference \2 for the second delimiter. The delimiter can be - _ or / but is optional too (with the ? quantifier).
This makes the day optional too by putting it in an optional non-capturing group: (?:\2[0-9]{2})?
Note that you can add the slashes at the begining and at the end to ensure that the date are enclosed between paths.
I'm trying to execute a bunch of code only if the string I'm searching contains a comma.
Here's an example set of rows that I would need to parse (name is a column header for this tab-delimited file and the column (annoyingly) contains the name, degree, and area of practice:
name
Sam da Man J.D.,CEP
Green Eggs Jr. Ed.M.,CEP
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
My issue is that some of the rows contain a comma, which is followed by an acronym which represents an "area of practice" for the professional and some do not.
My code relies on the principle that each line contains a comma, and I will now have to modify the code in order to account for lines where there is no comma.
def parse_ieca_gc(s):
########################## HANDLE NAME ELEMENT ###############################
degrees = ['M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.', 'RN', 'B.S.Ed.', 'M.D.']
degrees_list = []
# separate area of practice from name and degree and bind this to var 'area'
split_area_nmdeg = s['name'].split(',')
area = split_area_nmdeg.pop() # when there is no area of practice and hence no comma, this pops out the name + deg and leaves an empty list, that's why 'print split_area_nmdeg' returns nothing and 'area' returns the name and deg when there's no comma
print 'split area nmdeg'
print area
print split_area_nmdeg
# Split the name and deg by spaces. If there's a deg, it will match with one of elements and will be stored deg list. The deg is removed name_deg list and all that's left is the name.
split_name_deg = re.split('\s',split_area_nmdeg[0])
for word in split_name_deg:
for deg in degrees:
if deg == word:
degrees_list.append(split_name_deg.pop())
name = ' '.join(split_name_deg)
# area of practice
category = area
re.search() and re.match() both do not work, it appears, because they return instances and not a boolean, so what should I use to tell if there's a comma?
The easiest way in python to see if a string contains a character is to use in. For example:
if ',' in s['name']:
if re.match(...) is not None :
instead of looking for boolean use that. Match returns a MatchObject instance on success, and None on failure.
You are already searching for a comma. Just use the results of that search:
split_area_nmdeg = s['name'].split(',')
if len(split_area_nmdeg) > 2:
print "Your old code goes here"
else:
print "Your new code goes here"