I'm trying to scrape dates from URL's of blogs and the like.
Since there's no universal way to get a date, I am for now, relying
on the date to be in the URL of the resource.
The dates come for the most part, in these formats:
url1 = "foo/bar/baz/2014/01/01/more/text"
url2 = "foo/bar/baz/2014/01/more/text"
url3 = "foo/bar/baz/20140101/more/text"
url4 = "foo/bar/baz/2014-01-01/more/text"
url5 = "foo/bar/baz/2014-01more/text"
url6 = "foo/bar/baz/2014_01_01/more/text"
url7 = "foo/bar/baz/2014_01/more/text"
# forgot one
url8 = "foo/bar/baz20140101more/text"
I've written a brute force code to get what I want.
It's explicit, but not elegant and probably not very robust.
I'd tried to cover the cases where I match "\" or "-" or "_" with no luck.
So I'm curious as to how one does that.
Although my main question is:
What's the best robust way to capture dates in a URL with the intention of converting them to datetime objects.
I don't think it's common for time elements to be in the format.
Cheers !
UPDATE
I believe I have the solution from Casimer. I'd like to add one more
url-date format that I missed before and might add a little trouble:
# this one maynot have a regex solution. Maybe machine learning.
# and it's not that big a deal if I get the wrong day for this application.
# I think it's safe to assume, that a legit date with Y/M/d with have
# /Y/m/d/ trailing "/"
http://www.nakedcapitalism.com/2014/03/17-million-reasons-rent-control-efficient.html
2014/03/17 # group captured
2014-03-17 00:00:00 # date time object
http://www.nakedcapitalism.com/2014/11/200pm-water-cooler-11514.html
2014/11/20
2014-11-20 00:00:00
# i put more restrictions on the number matching, but perhaps there's a better way...?
pat = r'(20[0-1][0-5]([-_/]?)[0-1][0-9]\2[0-3][0-9])'
Existing ugly solution:
NOTE: I've restricted the year info, because I was capturing strings of numbers that do not represent a date. Plus I figured it was more robust that way.
def get_date_from_url(self, url):
#pat = "(20[0-14]{2}\w+[0-9]{2}(?!\w+[0-9]{2}))"
pat = "(20[0-1][0-5]/[0-9]{2}/[0-9]{2})"
ob1 = re.compile(pat)
pat = "(20[0-1][0-5]-[0-9]{2}-[0-9]{2})"
ob2 = re.compile(pat)
pat = "(20[0-1][0-5]_[0-9]{2}_[0-9]{2})"
ob3 = re.compile(pat)
pat = "(20[0-1][0-5]/[0-9]{2})"
ob4 = re.compile(pat)
pat = "(20[0-1][0-5]-[0-9]{2})"
ob5 = re.compile(pat)
pat = "(20[0-1][0-5]_[0-9]{2})"
ob6 = re.compile(pat)
if ob1.search(url):
grp = ob1.search(url).group()
elif ob2.search(url):
grp = ob2.search(url).group()
elif ob3.search(url):
grp = ob3.search(url).group()
elif ob4.search(url):
grp = ob4.search(url).group()
elif ob5.search(url):
grp = ob5.search(url).group()
elif ob6.search(url):
grp = ob6.search(url).group()
else:
return None
print url
print grp
grp = re.sub('_', '/', grp) # fail to match return orig string
date = to_datetime(grp)
if isinstance(date, datetime.datetime):
print date
else:
return None
You can use this:
pat = r'(20[0-1][0-5]([-_/]?)[0-9]{2}(?:\2[0-9]{2})?)'
the delimiter is captured in group 2, so I use a backreference \2 for the second delimiter. The delimiter can be - _ or / but is optional too (with the ? quantifier).
This makes the day optional too by putting it in an optional non-capturing group: (?:\2[0-9]{2})?
Note that you can add the slashes at the begining and at the end to ensure that the date are enclosed between paths.
Related
I wasn't sure what to call this title, feel free to edit it if you think there is a better name.
What I am trying to do is find cases that match certain search criteria.
Specifically, I am trying to find sentences that contain the word "where" in them. Once I have identified that, I am trying to find cases where the word "SQL" command is also located within that same tag.
Let's say I have a dataframe that looks like this:
search_criteria = ['where']
df4
Q R
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example" type="SqlCommand">select id, name, from table where criteria = '5'</property><sentence>dave hates stuff>
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example">select id, name, from table where criteria = '5'</properties><sentence>dave hates stuff>
I am trying to return this:
Q R
0 file.sql <properties>version = "2", description = "example">select id, name, from table</properties>
This record should get returned because it contains both "where" and "sqlcommand".
Here is my current process:
regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<[^<]*?' + 'where' + '[^>]*?>)', re.IGNORECASE)
sql_command_regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<property[^<]*?' + 'sqlcommand' + '[^>]*?<\/property>)', re.IGNORECASE)
if not regex_stuff.empty: #if one of the search criteria is found
if not sql_command_regex_stuff.empty: #check to see if the phrase "sqlcommand" is found anywhere as well
(insert rest of code)
This does not return anything.
What am I doing wrong?
Edit #1:
It seems like I need to do something at the end, to make the regex look something like this:
<property[^<]*?SqlCommand[^(<\/property>)]*
I feel like this is the right direction, doesn't work, but I feel like this is the right step.
You could just filter with str.contains:
df[(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
0 file.sql <sentence>dave likes stuff</sentence><properti...
or use ~ to return the opposite: strings that do not contain 'sqlcommand' or 'where'
df[~(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
1 file.sql <sentence>dave likes stuff</sentence><properti...
First of all, you have to have proper XML and SQL content, so you should
make the following corrections:
As the opening tag is <properties>, the closing tag must also be
</properties>, not </property>.
version, description and type are attributes (after them
there is > closing the opening tag, so after properties there
should be a space, not >.
Remove , after version="2".
Remove , after name.
Remove ( before <properties and ) after </properties>.
To find the required rows, use str.contains as the filtering
expression.
Below you have an example program:
import pandas as pd
import re
df4 = pd.DataFrame({
'Q' : 'file.sql',
'R' : [
'<s>dave</s><properties type="SqlCommand">select id, name '
'from table where criteria=\'5\'</properties><s>dave</s>',
'<s>dave</s><properties>select id, name from table '
'where criteria=\'6\'</properties><s>dave</s>',
'<s>mike</s><properties type="SqlCommand">drop table "Xyz"'
'</properties><s>mike</s>' ]})
df5 = df4[df4.R.str.contains(
'<properties[^<>]+?sqlcommand[^<>]+?>[^<>]+?where',
flags=re.IGNORECASE)]
print(df5)
Note that the regex takes care about the proper sequence of
strings:
First match <properties.
Then a sequence of chars other than < and > ([^<>]+?).
so we are still within the just opened XML tag.
Then match sqlcommand (ignoring case).
Then another sequence of chars other than < and >
([^<>]+?).
Then >, closing the tag.
Then another sequence of chars other than < and >
([^<>]+?).
And finally where (also ignoring case).
An attempt to check for sqlcommand and where in two separate
regexes is wrong, as these words can be at other locations,
which do not meet your requirement.
I have a grammar for parsing some log files using pyparsing but am running into an issue where only the first match is being returned. Is there a way to ensure that I get exhaustive matches? Here's some code:
from pyparsing import Literal, Optional, oneOf, OneOrMore, ParserElement, Regex, restOfLine, Suppress, ZeroOrMore
ParserElement.setDefaultWhitespaceChars(' ')
dt = Regex(r'''\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) 20\d\d \d\d:\d\d:\d\d\,\d{3}''')
# TODO maybe add a parse action to make a datetime object out of the dt capture group
log_level = Suppress('[') + oneOf("INFO DEBUG ERROR WARN TRACE") + Suppress(']')
package_name = Regex(r'''(com|org|net)\.(\w+\.)+\w+''')
junk_data = Optional(Regex('\(.*?\)'))
guid = Regex('[A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12}')
first_log_line = dt.setResultsName('datetime') + \
log_level('log_level') + \
guid('guid') + \
junk_data('junk') + \
package_name('package_name') + \
Suppress(':') + \
restOfLine('message') + \
Suppress('\n')
additional_log_lines = Suppress('\t') + package_name + restOfLine
log_entry = (first_log_line + Optional(ZeroOrMore(additional_log_lines)))
log_batch = OneOrMore(log_entry)
In my mind, the last two lines are sort of equivalent to
log_entry := first_log_line | first_log_line additional_log_lines
additional_log_lines := additional_log_line | additional_log_line additional_log_lines
log_batch := log_entry | log_entry log_batch
Or something of the sort. Am I thinking about this wrong? I only see a single match with all of the expected tokens when I do print(log_batch.parseString(data).dump()).
Your scanString behavior is a strong clue. Suppose I wrote an expression to match one or more items, and erroneously defined my expression such that the second item in my list did not match. Then OneOrMore(expr) would fail, while expr.scanString would "succeed", in that it would give me more matches, but would still overlook the match I might have wanted, but just mis-parsed.
import pyparsing as pp
data = "AAA _AB BBB CCC"
expr = pp.Word(pp.alphas)
print(pp.OneOrMore(expr).parseString(data))
Gives:
['AAA']
At first glance, this looks like the OneOrMore is failing, whereas scanString shows more matches:
['AAA']
['AB'] <- really wanted '_AB' here
['BBB']
['CCC']
Here is a loop using scanString which prints not the matches, but the gaps between the matches, and where they start:
# loop to find non-matching parts in data
last_end = 0
for t,s,e in expr.scanString(data):
gap = data[last_end:s]
print(s, ':', repr(gap))
last_end = e
Giving:
0 : ''
5 : ' _' <-- AHA!!
8 : ' '
12 : ' '
Here's another way to visualize this.
# print markers where each match begins in input string
markers = [' ']*len(data)
for t,s,e in expr.scanString(data):
markers[s] = '^'
print(data)
print(''.join(markers))
Prints:
AAA _AB BBB CCC
^ ^ ^ ^
Your code would be a little more complex since your data spans many lines, but using pyparsing's line, lineno and col methods, you could do something similar.
So, there's a workaround that seems to do the trick. For whatever reason, scanString does iterate through them all appropriately, so I can very simply get my matches in a generator with:
matches = (m for m, _, _ in log_batch.scanString(data))
Still not sure why parseString isn't working exhaustively, though, and still a bit worried that I've misunderstood something about pyparsing, so more pointers are welcome here.
I'm writing a class RecurringInterval which - based on the dateutil.rrule object - represents a recurring interval in time. I have defined a custom, human-readable __str__ method for it and would like to also define a parse method which (similar to the rrulestr() function) parses the string back into an object.
Here is the parse method and some test cases to go with it:
import re
from dateutil.rrule import FREQNAMES
import pytest
class RecurringInterval(object):
freq_fmt = "{freq}"
start_fmt = "from {start}"
end_fmt = "till {end}"
byweekday_fmt = "by weekday {byweekday}"
bymonth_fmt = "by month {bymonth}"
#classmethod
def match_pattern(cls, string):
SPACES = r'\s*'
freq_names = [freq.lower() for freq in FREQNAMES] + [freq.title() for freq in FREQNAMES] # The frequencies may be either lowercase or start with a capital letter
FREQ_PATTERN = '(?P<freq>{})?'.format("|".join(freq_names))
# Start and end are required (their regular expressions match 1 repetition)
START_PATTERN = cls.start_fmt.format(start=SPACES + r'(?P<start>.+?)')
END_PATTERN = cls.end_fmt.format(end=SPACES + r'(?P<end>.+?)')
# The remaining tokens are optional (their regular expressions match 0 or 1 repetitions)
BYWEEKDAY_PATTERN = cls.optional(cls.byweekday_fmt.format(byweekday=SPACES + r'(?P<byweekday>.+?)'))
BYMONTH_PATTERN = cls.optional(cls.bymonth_fmt.format(bymonth=SPACES + r'(?P<bymonth>.+?)'))
PATTERN = SPACES + FREQ_PATTERN \
+ SPACES + START_PATTERN \
+ SPACES + END_PATTERN \
+ SPACES + BYWEEKDAY_PATTERN \
+ SPACES + BYMONTH_PATTERN \
+ SPACES + "$" # The character '$' is needed to make the non-greedy regular expressions parse till the end of the string
return re.match(PATTERN, string).groupdict()
#staticmethod
def optional(pattern):
'''Encloses the given regular expression in an optional group (i.e., one that matches 0 or 1 repetitions of the original regular expression).'''
return '({})?'.format(pattern)
'''Tests'''
def test_match_pattern_with_byweekday_and_bymonth():
string = "Weekly from 2017-11-03 15:00:00 till 2017-11-03 16:00:00 by weekday Monday, Tuesday by month January, February"
groups = RecurringInterval.match_pattern(string)
assert groups['freq'] == "Weekly"
assert groups['start'].strip() == "2017-11-03 15:00:00"
assert groups['end'].strip() == "2017-11-03 16:00:00"
assert groups['byweekday'].strip() == "Monday, Tuesday"
assert groups['bymonth'].strip() == "January, February"
def test_match_pattern_with_bymonth_and_byweekday():
string = "Weekly from 2017-11-03 15:00:00 till 2017-11-03 16:00:00 by month January, February by weekday Monday, Tuesday "
groups = RecurringInterval.match_pattern(string)
assert groups['freq'] == "Weekly"
assert groups['start'].strip() == "2017-11-03 15:00:00"
assert groups['end'].strip() == "2017-11-03 16:00:00"
assert groups['byweekday'].strip() == "Monday, Tuesday"
assert groups['bymonth'].strip() == "January, February"
if __name__ == "__main__":
# pytest.main([__file__])
pytest.main([__file__+"::test_match_pattern_with_byweekday_and_bymonth"]) # This passes
# pytest.main([__file__+"::test_match_pattern_with_bymonth_and_byweekday"]) # This fails
Although the parser works if you specify the arguments in the 'right' order, it is 'inflexible' in that it doesn't allow the optional arguments to be given in arbitrary order. This is why the second test fails.
What would be a way to make the parser parse the 'optional' fields in any order, such that both tests pass? (I was thinking of making an iterator with all permutations of the regular expressions and trying re.match on each one, but this does not seem like an elegant solution).
At this point, your language is getting complex enough that it's time to ditch regular expressions and learn how to use a proper parsing library. I threw this together using pyparsing, and I've annotated it heavily to try and explain what's going on, but if anything's unclear do ask and I'll try to explain.
from pyparsing import Regex, oneOf, OneOrMore
# Boring old constants, I'm sure you know how to fill these out...
months = ['January', 'February']
weekdays = ['Monday', 'Tuesday']
frequencies = ['Daily', 'Weekly']
# A datetime expression is anything matching this regex. We could split it down
# even further to get day, month, year attributes in our results object if we felt
# like it
datetime_expr = Regex(r'(\d{4})-(\d\d?)-(\d\d?) (\d{2}):(\d{2}):(\d{2})')
# A from or till expression is the word "from" or "till" followed by any valid datetime
from_expr = 'from' + datetime_expr.setResultsName('from_')
till_expr = 'till' + datetime_expr.setResultsName('till')
# A range expression is a from expression followed by a till expression
range_expr = from_expr + till_expr
# A weekday is any old weekday
weekday_expr = oneOf(weekdays)
month_expr = oneOf(months)
frequency_expr = oneOf(frequencies)
# A by weekday expression is the words "by weekday" followed by one or more weekdays
by_weekday_expr = 'by weekday' + OneOrMore(weekday_expr).setResultsName('weekdays')
by_month_expr = 'by month' + OneOrMore(month_expr).setResultsName('months')
# A recurring interval, then, is a frequency, followed by a range, followed by
# a weekday and a month, in any order
recurring_interval = frequency_expr + range_expr + (by_weekday_expr & by_month_expr)
# Let's parse!
if __name__ == '__main__':
res = recurring_interval.parseString('Daily from 1111-11-11 11:00:00 till 1111-11-11 12:00:00 by weekday Monday by month January February')
# Note that setResultsName causes everything to get packed neatly into
# attributes for us, so we can pluck all the bits and pieces out with no
# difficulty at all
print res
print res.from_
print res.till
print res.weekdays
print res.months
You have many options here, each with different downsides.
One approach would be to use a repeated alternation, like (by weekday|by month)*:
(?P<freq>Weekly)?\s+from (?P<start>.+?)\s+till (?P<end>.+?)(?:\s+by weekday (?P<byweekday>.+?)|\s+by month (?P<bymonth>.+?))*$
This will match strings of the form week month and month week, but also week week or month week month etc.
Another option would be use lookaheads, like (?=.*by weekday)?(?=.*by month)?:
(?P<freq>Weekly)?\s+from (?P<start>.+?)\s+till (?P<end>.+?(?=$| by))(?=.*\s+by weekday (?P<byweekday>.+?(?=$| by))|)(?=.*\s+by month (?P<month>.+?(?=$| by))|)
However, this requires a known delimiter (I used " by") to know how far to match. Also, it'll silently ignore any extra characters (meaning it'll match strings of the form by weekday [some gargabe] by month).
I have a list of tweets, from which I have to choose tweets that have terms like "sale", "discount", or "offer". Also, I need to find tweets that advertise certain deals, like a discount, by recognizing things like "%", "Rs.", "$" amongst others. I have absolutely no idea about regular expressions and the documentation isn't getting me anywhere. Here is my code. It's rather lousy, but please excuse that
import pymongo
import re
import datetime
client = pymongo.MongoClient()
db = client .PWSocial
fourteen_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=14)
id_list = [57947109, 183093247, 89443197, 431336956]
ar1 = [" deal "," deals ", " offer "," offers " "discount", "promotion", " sale ", " inr", " rs", "%", "inr ", "rs ", " rs."]
def func(ac_id):
mylist = []
newlist = []
tweets = list(db.tweets.find({'user_id' : ac_id, 'created_at': { '$gte': fourteen_days_ago }}))
for item in tweets:
data = item.get('text')
data = data.lower()
data = data.split()
flag = 0
if set(ar1).intersection(data):
flag = 1
abc = []
for x in ar1:
for y in data:
if re.search(x,y):
abc.append(x)
flag = 1
break
if flag == 1:
mylist.append(item.get('id'))
newlist.append(abc)
print mylist
print newlist
for i in id_list:
func(i)
This code soen't give me any correct results, and being a noob to regexes, I cannot figure out whats wrong with it. Can anyone suggest a better way to do this job? Any help is appreciated.
My first advice - learn regular expressions, it gives you an unlimited power of text processing.
But, to give you some working solution (and start point to further exploration) try this:
import re
re_offers = re.compile(r'''
\b # Word boundary
(?: # Non capturing parenthesis
deals? # Deal or deals
| # or ...
offers? # Offer or offers
|
discount
|
promotion
|
sale
|
rs\.? # rs or rs.
|
inr\d+ # INR then digits
|
\d+inr # Digits then INR
) # And group
\b # Word boundary
| # or ...
\b\d+% # Digits (1 or more) then percent
|
\$\d+\b # Dollar then digits (didn't care of thousand separator yet)
''',
re.I|re.X) # Ignore case, verbose format - for you :)
abc = re_offers.findall("e misio $1 is inr123 discount 1INR a 1% and deal")
print(abc)
You don't need to use a regular expression for this, you can use any:
if any(term in tweet for term in search_terms):
In your array of things to search for you don't have a comma between " offers " and "discount" which is causing them to be joined together.
Also when you use split you are getting rid of the whitespace in your input text. "I have a deal" will become ["I","have","a","deal"] but your search terms almost all contain whitespace. So remove the spaces from your search terms in array ar1.
However you might want to avoid using regular expressions and just use in instead (you will still need the chnages I suggest above though):
if x in y:
You might want to consider starting with find instead instead of a regex. You don't have complex expressions, and as you're handling a line of text you don't need to call split, instead just use find:
for token in ar1:
if data.find(token) != -1:
abc.append(data)
Your for item in tweets loop becomes:
for item in tweets:
data = item.get('text')
data = data.lower()
for x in ar1:
if data.find(x)
newlist.append(data)
mylist.append(item.get('id'))
break
Re: your comment on jonsharpe's post, to avoid including substrings, surround your tokens by spaces, e.g. " rs ", " INR "
I've got a DB chock full o' phone numbers as strings, they're all formatted like 1112223333, I'd like to display it as 111-222-3333 in my django template
I know I can do
n = contacts.objects.get(name=name)
n.phone = n.phone[:3] + '-' + n.phone[3:6] + '-' + n.phone[6:]
but is there a better / more pythonic way?
It may be overkill for your use case if all your numbers are formatted the same way, but you might consider using the phonenumbers module. It would allow you to add functionality (e.g. international phone numbers, different formatting, etc) very easily.
You can parse your numbers like this:
>>> import phonenumbers
>>> parsed_number = phonenumbers.parse('1112223333', 'US')
>>> parsed_number
PhoneNumber(country_code=1, national_number=1112223333L, extension=None, italian_leading_zero=False, country_code_source=None, preferred_domestic_carrier_code=None)
Then, to format it the way you want, you could do this:
>>> phonenumbers.format_number(parsed_number, phonenumbers.PhoneNumber())
u'111-222-3333'
Note that you could easily use other formats:
>>> phonenumbers.format_number(parsed_number, phonenumbers.PhoneNumberFormat.NATIONAL)
u'(111) 222-3333'
>>> phonenumbers.format_number(parsed_number, phonenumbers.PhoneNumberFormat.INTERNATIONAL)
u'+1 111-222-3333'
>>> phonenumbers.format_number(parsed_number, phonenumbers.PhoneNumberFormat.E164)
u'+11112223333'
Just one other solution:
n.phone = "%c%c%c-%c%c%c-%c%c%c%c" % tuple(map(ord, n.phone))
or
n.phone = "%s%s%s-%s%s%s-%s%s%s%s" % tuple(n.phone)
This is quite a bit belated, but I figured I'd post my solution anyway. It's super simple and takes advantage of creating your own template tags (for use throughout your project). The other part of this is using the parenthesis around the area code.
from django import template
register = template.Library()
def phonenumber(value):
phone = '(%s) %s - %s' %(value[0:3],value[3:6],value[6:10])
return phone
register.filter('phonenumber', phonenumber)
For the rest of your project, all you need to do is {{ var|phonenumber }}
Since we're speaking Pythonic :), it's a good habit to always use join instead of addition (+) to join strings:
phone = n.phone
n.phone = '-'.join((phone[:3],phone[3:6],phone[6:]))
def formatPhone(phone):
formatted = ''
i = 0
# clean phone. skip not digits
phone = ''.join(x for x in phone if x.isdigit())
# set pattern
if len(phone) > 10:
pattern = 'X (XXX) XXX-XX-XX'
else:
pattern = 'XXX-XXX-XX-XX'
# reverse
phone = phone[::-1]
pattern = pattern[::-1]
# scan pattern
for p in pattern:
if i >= len(phone):
break
# skip non X
if p != 'X':
formatted += p
continue
# add phone digit
formatted += phone[i]
i += 1
# reverse again
formatted = formatted[::-1]
return formatted
print formatPhone('+7-111-222-33-44')
7 (111) 222-33-44
print formatPhone('222-33-44')
222-33-44
print formatPhone('23344')
2-33-44