Xpath extract dates between certain characters AND use as dates

Xpath extract dates between certain characters AND use as dates - python

UPDATE: Regarding my 2nd question (how to convert string to date format in MySQL), I found a way and want to share it:
1) Save the "string date" data as VARCHAR (Don't use TEXT)
2) When showing MySQL data in PHP or other ways, use the function of str_to_date(string-date-column, date-format), such as the following example:
$sql = "SELECT * FROM yourtablename ORDER BY str_to_date(string-date-column, '%d %M %Y')";
I am using scrapy to collect data, write to database. From a website, the post date of each item is listed as following:
<p> #This is the last <p> within each <div>
<br>
[15 May 2015, #9789]
<br>
</p>
So the date is always behind a "[" and before a ",". I used the following xpath code to extract:
sel.xpath("p[last()]/text()[contains(., '[')]").extract()
But I will get the whole line:
[15 May 2015, #9789]
So, how to get only the part of "15 May 2015"? If this can be done, how to convert the scraped string (15 May 2015) as real DATE data, so it can be used for sorting? Thanks a bunch!

Regarding the first question, assuming that there is maximum one date at a time, you can use combination of XPath substring-after() and substring-before() functions to get 15 May 2015 part of the text node :
substring-before(substring-after(p[last()]/text()[contains(., '[')], '['), ',')
Regarding the second question, you can use datetime.strptime() to convert string to datetime :
import datetime
result = datetime.datetime.strptime("15 May 2015", "%d %b %Y")
print(result)
print(type(result))
output :
2015-05-15 00:00:00
<type 'datetime.datetime'>

A more "scrapic" approach would involve using the built-in regular expression support in the XPath expressions and/or .re().
This is with both applied:
In [1]: response.xpath("p[last()]/text()[re:test(., '\[\d+ \w+ \d{4}\, #\d+\]')]").re(r"\d+ \w+ \d{4}")
Out[1]: [u'15 May 2015']
Or, this is when you use .re() to extract the date locating the element as you did before:
In [2]: response.xpath("p[last()]/text()[contains(., '[')]").re(r"\d+ \w+ \d{4}")
Out[2]: [u'15 May 2015']

Related

Regex for date of birth with maximum age

I am looking for a regular expression in Python which mathes for date of birth is the given format: dd.mm.YYYY
For example:
31.12.1999 (31st of December)
02.07.2021
I have provided more examples in the demo.
Dates which are older as 01.01.1920 should not match!!!

Try:
^(?:[0-2][1-9]|3[01])\.(?:0[1-9]|1[12])\.(?:19[2-9]\d|[2-9]\d{3})$
See Regex Demo
Be aware that this will not catch dates like 31.02.2021, that is, it is not sophisticated enough to know how many days are in any given month and it is hopeless to try to come up with a regex that can do that because February is problematic because the regex can't compute which years are leap years.
This will also allow future dates such as 01.01.3099 (you do want this to be work for the future, no?).
Update
You really need to be using the datetime class from the datetime package and, if you want to insist that the date and month fields contain two digits, a regex just to ensure the format:
import re
from datetime import datetime, date
validated = False # assume not validated
s = '31.03.2019'
m = re.fullmatch(r'\d{2}\.\d{2}\.\d{4}', s)
if m:
# we have ensured the correct number of digits:
try:
d = datetime.strptime(s, '%d.%m.%Y').date()
if d >= date(1920, 1, 1):
validated = True
except ValueError:
pass
print(validated)

As I said, it can be done with a very convoluted regex. However, I do not actually recommend using this, I just had fun writing it as a challenge. You should in reality use a very permissive regex and validate the ranges in code.
Demo.
# Easy dates, those <= 28th, valid for all months/years.
(0[1-9]|1[0-9]|2[0-8])\.(0[1-9]|1[0-2])\.(19[2-9][0-9]|2[0-9][0-9][0-9])
|
# Validate the 29th of Februari for 1920-1999.
29\.02\.19([3579][26]|[2468][048])
|
# Validate the 29th of Februari for 2000-2999.
29\.02\.((2[0-9])(0[48]|[13579][26]|[2468][048])|2000|2400|2800)
|
# Validate 29th and 30th.
(29|30)\.(01|0[3-9]|1[0-2])\.(19[2-9][0-9]|2[0-9][0-9][0-9])
|
# Validate 31st.
31\.(01|03|05|07|08|10|12)\.(19[2-9][0-9]|2[0-9][0-9][0-9])

\d\{2}.\d\{2}.\d{4}
Validating the value of the dates should be done at the application level.

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.

You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']

import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)

The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

retrieving a date from a string

im trying to retrieve a date from a string. the problem is that the pattern of this date varies a lot (string comes from an OCR reading). These are the patterns i need to identify:
11/11/1111 (i can get this one already)
11-11-1111 (i can get this one already)
11 11 1111 (i can get this one already)
11- 11- 1111
11 11 1111
11-11 1111
23- 10-17
9 06- 17
So far, the RegEx I have is a slight adaptation (it now allows spaces instead of just - or / separating the numbers) from a stackoverflow answer :
match_date=re.search(r'(?:(?:31(\/|-|\.| )(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.| )(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.| )(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})',line)
Is there a way of building a regex for such a "fluid" date structure?

Regex: \b(?:\d{1,2}[- /]\s?){2}(?:\d{4}|\d{2})\b or ^(?:\d{1,2}[- /]\s?){2}(?:\d{4}|\d{2})$
Regex demo

You could go for
\b\d{1,2}[- /]+\d{1,2}[- /]+\d{2,4}\b
See a demo on regex101.com.

I know regex is a better answer because with one line you can match all possibilities but I prefer convert to datetime
from datetime import datetime
string = "11- 11- 1111"
for fmt in ('%Y-%m-%d', '%d- %m- %Y', '%d %m %Y', '%d- %m- %y'):
try:
datetime_object = datetime.strptime(string, '%d- %m- %y')
...

How to Extract Date From String Python

I have a String Like
"Originally Posted on 09 May, 2016. By query 3 j...."
how can i extract date using python??
I tried this code:
dStr = "Originally Posted on 09 May, 2016. By query 3 j...."
date_st = re.findall("(\d+\ \w+,)", dStr)
printing date_st, i've got:
['09 May,']
what should i do for year??

You forget to add the year after ',', Only need to add '\d+.' is well.
re.findall("(\d+\ \w+, \d+\.)", dStr)
You will got this:
'09 May, 2016.'

You were almost there. Just add 4 digits for the year after the ,. It is better to use [a-z]+ instead of \w+ to match the month names as \w matches _ and 0-9 (along with alphabets) as well.
re.findall(r'\d+\s[a-z]+,\s\d{4}',s,re.I)

Find and replace logic in Python

In python I need a logic for below scenario I am using split function to this.
I have string which contains input as show below.
"ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988."
"ID909900000 25-01-1986 hello 10 minutes."
And output should be as shown below which replace date format to "date" and time format to "time".
"ID674021384 date hello hi thanks time date."
"ID909900000 date hello time."
And also I need a count of date and time for each Id as show below
ID674021384 DATE:2 TIME:1
ID909900000 DATE:1 TIME:1

>>> import re
>>> from collections import defaultdict
>>> lines = ["ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988.", "ID909900000 25-01-1986 hello 10 minutes."]
>>> pattern = '(?P<date>\d{1,2}[/-]\d{1,2}[/-]\d{4})|(?P<time>\d+ minutes)'
>>> num_occurences = {line:defaultdict(int) for line in lines}
>>> def repl(matchobj):
num_occurences[matchobj.string][matchobj.lastgroup] += 1
return matchobj.lastgroup
>>> for line in lines:
text_id = line.split(' ')[0]
new_text = re.sub(pattern,repl,line)
print new_text
print '{0} DATE:{1[date]} Time:{1[time]}'.format(text_id, num_occurences[line])
print ''
ID674021384 date heloo hi thanks time and date.
ID674021384 DATE:2 Time:1
ID909900000 date hello time.
ID909900000 DATE:1 Time:1

For parsing similar lines of text, like log files, I often use regular expressions using the re module. Though split() would work well also for separating fields which don't contain spaces and the parts of the date, using regular expressions allows you to also make sure the format matches what you expect, and if need be warn you of a weird looking input line.
Using regular expressions, you could get the individual fields of the date and time and construct date or datetime objects from them (both from the datetime module). Once you have those objects, you can compare them to other similar objects and write new entries, formatting the dates as you like. I would recommend parsing the whole input file (assuming you're reading a file) and writing a whole new output file instead of trying to alter it in place.
As for keeping track of the date and time counts, when your input isn't too large, using a dictionary is normally the easiest way to do it. When you encounter a line with a certain ID, find the entry corresponding to this ID in your dictionary or add a new one to it if not. This entry could itself be a dictionary using dates and times as keys and whose values is the count of each encountered.
I hope this answer will guide you on the way to a solution even though it contains no code.

You could use a couple of regular expressions:
import re
txt = 'ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988.'
retime = re.compile('([0-9]+) *minutes')
redate = re.compile('([0-9]+[/-][0-9]+[/-][0-9]{4})')
# find all dates in 'txt'
dates = redate.findall(txt)
print dates
# find all times in 'txt'
times = retime.findall(txt)
print times
# replace dates and times in orignal string:
newtxt = txt
for adate in dates:
newtxt = newtxt.replace(adate, 'date')
for atime in times:
newtxt = newtxt.replace(atime, 'time')
The output looks like this:
Original string:
ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988.
Found dates:['25/01/1986', '25-01-1988']
Found times: ['5']
New string:
ID674021384 date heloo hi thanks time minutes and date.
Dates and times found:
ID674021384 DATE:2 TIME:1
Chris

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Xpath extract dates between certain characters AND use as dates - python

Related

Regex for date of birth with maximum age

Find values using regex (includes brackets)

retrieving a date from a string

How to Extract Date From String Python

Find and replace logic in Python

Categories

Resources