I need to profile some data in a bucket, and have come across a bit of a dilemma.
This is the type of line in each file:
"2018-09-08 10:34:49 10.0 MiB path/of/a/directory"
What's required is to capture everything in bold while keeping in mind that some of the separators are tabs and other times they are spaces.
To rephrase, I need everything from the moment the date and time end (excluding the tab or space preceding it)
I tried something like this:
p = re.compile(r'^[\d\d\d\d.\d\d.\d\d\s\d\d:\d\d:\d\d].*')
for line in lines:
print(re.findall(line))
How do I solve this problem?
EDIT:
What if I wanted to also create new groups into that the newly matched string? Say I wanted to recreate it to -->
10MiB engagementName/folder/file/something.xlsx engagementName extensionType something.xlsx
RE-EDIT:
The path/to/directory generally points to a file(and all files have extensions). from the reformatted string you guys have been helping me with, is there a way to keep building on the regex pattern to allow me to "create" a new group through the filtering on the fileExtensionType(I suppose by searching the end of the string for somthing along the lines of .anything) and adding that result into the formatted regex string?
Don't bother with a regular expression. You know the format of the line. Just split it:
from datetime import datetime
for l in lines:
line_date, line_time, rest_of_line = l.split(maxsplit=2)
print([line_date, line_time, rest_of_line])
# ['2018-09-08', '10:34:49', '10.0 MiB path/of/a/directory']
Take special note of the use of the maxsplit argument. This prevents it from splitting the size or the path. We can do this because we know the date has one space in the middle and one space after it.
If the size will always have one space in the middle and one space following it, we can increase it to 4 splits to separate the size, too:
for l in lines:
line_date, line_time, size_quantity, size_units, line_path = l.split(maxsplit=4)
print([line_date, line_time, size_quantity, size_units, line_path])
# ['2018-09-08', '10:34:49', '10.0', 'MiB', 'path/of/a/directory']
Note that extra contiguous spaces and spaces in the path don't screw it up:
l = "2018-09-08 10:34:49 10.0 MiB path/of/a/direct ory"
line_date, line_time, size_quantity, size_units, line_path = l.split(maxsplit=4)
print([line_date, line_time, size_quantity, size_units, line_path])
# ['2018-09-08', '10:34:49', '10.0', 'MiB', 'path/of/a/direct ory']
You can concatenate parts back together if needed:
line_size = size_quantity + ' ' + size_units
If you want the timestamp for something, you can parse it:
# 'T' could be anything, but 'T' is standard for the ISO 8601 format
timestamp = datetime.strptime(line_date + 'T' + line_time, '%Y-%m-%dT%H:%M:%S')
You might not need an expression to do so, a string split would suffice. However, if you wish to do so, you might not want to bound your expression from very beginning. You can simply use this expression:
(:[0-9]+\s+)(.*)$
You can even slightly modify it to this expression which is just a bit faster:
:([0-9]+\s+)(.*)$
Graph
The graph shows how the expression works:
Example Test:
# -*- coding: UTF-8 -*-
import re
string = "2018-09-08 10:34:49 10.0 MiB path/of/a/directory"
expression = r'(:[0-9]+\s+)(.*)$'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match 💚 ")
else:
print('🙀 Sorry! No matches! Something is not right! Call 911 👮')
Output
YAAAY! "10.0 MiB path/of/a/directory" is a match 💚
JavaScript Performance Benchmark
This snippet is a JavaScript performance test with 10 million times repetition of your input string:
repeat = 10000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = "2018-09-08 10:34:49 10.0 MiB path/of/a/directory";
var regex = /(.*)(:[0-9]+\s+)(.*)/g;
var match = string.replace(regex, "$3");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");
Edit:
You might only capture the end of a timestamp, because your expression would have less boundaries, it becomes simple and faster, and in case there was unexpected instances, it would still work:
2019/12/15 10:00:00 **desired output**
2019-12-15 10:00:00 **desired output**
2019-12-15, 10:00:00 **desired output**
2019-12 15 10:00:00 **desired output**
Related
I have a list of titles with combined dates and descriptions, but I have to reduce this to just a list of dates. Some examples of these titles are stuff like this:
1/16 Stories of Time
5/18 Cock'a'doodle'do
However, some people are really bad at typing and have forgotten the spaces between the dates and the rest of the title. I need to remove everything except for numbers and the slashes between them. Using any method, but preferably regex, is there a simple way to do this? For the record, I do understand how to split and recompile the list for any method that would work on a single string.
You're thinking about this backwards. If you want to extract the date at the start of a line, do that instead of trying to get rid of everything else.
You can use a regex like this: ^\d{1,2}/\d{1,2} which means:
^ start of line
\d digit
{1,2} repeated one or two times
For example:
import re
lines = [
'1/16 Stories of Time',
"5/18 Cock'a'doodle'do",
'6/22Bible']
for line in lines:
match = re.match(r'^\d{1,2}/\d{1,2}', line)
if match:
print(match.group(0))
Output:
1/16
5/18
6/22
(Note that re.match always starts matching from the start of the string, so the ^ is redundant here.)
This is more rigorous against titles containing numbers and slashes, like say, 4/5 The 39 Steps / The Thirty-Nine Steps -> 4/5.
However, you'll have a problem if someone forgot the space for a title starts with a number, like say, 7/8100 Years of Solitude -> 7/81.
You can import string to get easy access to a string of all digits, add the slash to it, and then compare your date string against that to drop any character from the date string that's not in there:
import string
string.digits += "/"
for character in date_string:
if not character in string.digits:
date_string = date_string.replace(character, "")
This will convert the date_string 5/18 Cock'a'doodle'do to just 5/18 without using regex at all.
Barmar on the comment of the original question had the best answer. To remove all but the numbers and a slash from the string you can use the one line of code,
string = re.sub(r'[^\d/]', '', string)
This removes all letters but ignores slashes. Thank you Barmar, if you want to post this as an answer I can take this down and flag that instead.
string = "rk3k3rr3kk____"
print("".join([letter for letter in string if not letter.isalpha()]))
But this is what you actually want, since your data seems to always have be a specific kind of format:
string.split(" ")[0]
okay,okay,okay ... this is what you want:
string[:4]
for completness sake:
string = " 2/24 4/12 333333 effee24/22"
for i, x in enumerate(string):
if len(string) <= i + 4:
break
if i > 0 and x != " " and not x.isalpha():
continue
if not string[i+1].isnumeric():
continue
if string[i+2] != "/":
continue
if not string[i+3].isnumeric():
continue
if not string[i+4].isnumeric():
continue
if len(string) == i + 6 and string[i+5] != " " and not string[i+5].isalpha():
continue
print(string[i+1:i+5])
I have 4.5 million rows to process, so I need to speed this up so bad.
I don't understand how regex works pretty well, so other answers are so hard for me to understand.
I have a column that contains IDs, e.g.:
ELI-123456789
This numeric part of this ID is contained in this string in a columnm bk_name started with a "#":
AAAAA#123456789_BBBBB;CCCCC;
Now my goal is to change that string into this string, throw the ID at the end, started with a "#", save it in new_name:
AAAAA_BBBBB;CCCCC;#123456789
Here's what I tried:
Take the ID and replace "ELI-" with "#", save as ID2:
df["ID2"] = df["ID"].str.replace("ELI-", "#")
Find ID2 in the string of bk_name and replace it with "":
df["new_name"] = df["bk_name"].replace(regex = r'(?i)' + df["ID2"].astype(str), value = "")
Take the ID again, replace "ELI-" with "#", add it at the end of the string of new_name:
df["new_name"] = df["new_name"] + df["ID"].str.replace("ELI-", "#")
Now the problem is that step 2 is the line that takes most of the time, it took 6.5 second to process 6700 rows.
But now I have 4.6 millions rows to process, it's been 7 hours and it's still running and I have no idea why.
In my opinions, regex slows down my code. But I have no deeper understanding.
So thanks for your help in advance, any suggestions would be appreciated :)
So, your approach wasn't that bad.
I recommend you look into the functions docs before using them. replace takes no keyword arguments.
Regarding step 2, don't use a regex just for replacing a string with another string ("" in this case).
df = {};
df['ID'] = 'ELI-123456789'; df['bk_name'] = 'AAAAA#123456789_BBBBB;CCCCC;';
print(df)
# Take the ID and replace "ELI-" with "#", save as ID2:
# df["ID2"] = df["ID"].str.replace("ELI-", "#")
df["ID2"] = df["ID"].replace("ELI-", "#")
print(df)
# Find ID2 in the string of bk_name and replace it with "":
# df["new_name"] = df["bk_name"].replace(regex = r'(?i)' + df["ID2"].astype(str), value = "")
df["new_name"] = df["bk_name"].replace(df["ID2"], "")
print(df)
# Take the ID again, replace "ELI-" with "#", add it at the end of the string of new_name:
# df["new_name"] = df["new_name"] + df["ID"].str.replace("ELI-", "#")
df["new_name"] = df["new_name"] + df["ID"].replace("ELI-", "#")
print(df)
Hope it helps
EDIT:
Are you sure it's this piece of code the one slowing down your project? Have you tried to isolate it and see how long it actually takes to execute?
Try this:
import timeit
code_to_test = """
df = {};
df['ID'] = 'ELI-123456789'; df['bk_name'] = 'AAAAA#123456789_BBBBB;CCCCC;';
df["new_name"] = df['bk_name'].replace('#' + df['ID'][4:],'') + '#' + df['ID'][4:] # <----- ONE LINER
# print(df['new_name'])
"""
elapsed_time = timeit.timeit(code_to_test, number=4600000)
print(elapsed_time)
I shortened it to a one-liner which ONLY WORKS IF:
ELI- (or a 4-character string) is always the string to be replaced
# is always the character to be removed and # is always the one to be added
At least in the laptop I'm coding this, it always executes the code 4600000 times in under 4 secs
Well, I would say that if you know that the ID part is always ELI- followed by a number and that this number is always behind the # at the beginning then I would consider not reading the ID at step 1. Directly work at step 2 with this regular expression and replacement:
https://regex101.com/r/QBQB7E/1
You should be able to create the content of your new field just with one single assignement with the result of the substitution of the old field. The idea to
speed up things is to avoid having 4-5 lines of code to do the operation of
getting the id, searching it and then recomposing the result. The regular
expression and the substitution pattern can do all that in one single operation.
You can generate some Python code directly from Regex101 so that you can integrate it to create your new_name field's content.
Explanation
^(.*?)#(\d+)(.*)$ is the regular expression
^ means "starting with".
$ means "ending with".
() is to capture a group and create a variable that we can then use
in the replacement pattern. \1 for the first matching group, \2 for
the second, etc.
.*? matches any character between zero and unlimited times, as few times
as possible, expanding as needed (lazy).
# matches the character # literally (case sensitive).
\d+ matches a digit (equal to [0-9]) one or more times.
.* matches any character between zero and unlimited times, as many times as possible, giving back as needed (greedy).
The replacement string is \1\3#\2, so it takes the first matching group
followed by the last one and then adds a # followed by the second matching
group, which is your id.
In terms of speed of the regex itself, it could also be changed a little bit
to find a faster version of it, depending how it's written.
Second version:
^([^#]+)#(\d+)(.*)$ where I replaced .*? by [^#]+ meaning find any
character which is not # one or more times.
Solution here: https://regex101.com/r/QBQB7E/3
Tested the code and I got it in about 4 seconds...
import timeit
code_to_test = """
import re
regex = r"^([^#]+)#(\d+)(.*)$"
subst = "\\1\\3#\\2"
df = {
"ID": "ELI-123456789",
"bk_name": "AAAAA#123456789_BBBBB;CCCCC;"
}
df["new_name"] = re.sub(regex, subst, df["bk_name"])
"""
elapsed_time = timeit.timeit(code_to_test, number=4600000)
print(elapsed_time)
I really think that if you code runs in 7 hours, it's probably somewhere else
where the problem is. The regex engine doesn't seem much slower than doing
manual search/replace operations.
I don't understand what caused the problem but I solved it.
The only thing I changed is in step 2. I used apply/lambda function and it suddenly works and I have no idea why.
Take the ID and replace "ELI-" with "#", save as ID2:
df["ID2"] = df["ID"].str.replace("ELI-", "#")
Find ID2 in the string of bk_name and replace it with "":
df["new_name"] = df[["bk_name", "ID2"]].astype(str).apply(lambda x: x["bk_name"].replace(x["ID2"], ""), axis=1)
Take the ID again, replace "ELI-" with "#", add it at the end of the string of new_name:
df["new_name"] = df["new_name"] + df["ID"].str.replace("ELI-", "#")
Probably an easy one, but I didn't find a solution by now. My text file that I import into python has a space delimited structure such as
20.06.2009 05:00:00 2.6
20.06.2009 06:00:00 21.5
I want to split this into a time and a value variable. Slicing the time component is straightforward
time = ""
value = ""
for i in lines:
time += i[0:20]
But I can't find a solution for the value component as it contains mostly 3 digits, but sometimes 4, so the number of space delimiters change between time and value (that's why the re package doesn't work here). Any solutions?
You can use rsplit(' ', 1) on your string to split based on the last occurrence of a whitespace in your string:
So you could do:
x = '20.06.2009 05:00:00 2.6'
y = '20.06.2009 06:00:00 21.5'
items = [x, y]
value = 0
for item in items:
value += float(item.rsplit(' ', 1)[1])
print(value)
Output
24.1
You can use the strip function which removes all spaces:
number += float(i[21:].strip())
This works also if you have spaces at the end of line.
There is also the .split() functions which splits
a line at every space like character or whatever you need.
You can use to .split() the entire line with list1 = x.split(' '). What you get at the end is a list.
list1 = ['20.06.2009', '05:00:00', ' 2.6']
So, you can look for spaces on the 3rd element and get rid of em with .replace()
I have an access table that has a bunch coordinate values in degrees minutes seconds and they are formatted like this:
90-12-28.15
I want to reformat it like this:
90° 12' 28.15"
essentially replacing the dashes with the degrees minutes and seconds characters and a space between the degrees and minutes and another one between the minutes and seconds.
I'm thinking about using the 'Replace' function, but I'm not sure how to replace the first instance of the dash with a degree character (°) and space and then detect the second instance of the dash and place the minute characters and a space and then finally adding the seconds character at the end.
Any help is appreciated.
Mike
While regular expressions and split() are fine solutions, doing this with replace() is rather easy.
lat = "90-12-28.15"
lat = lat.replace("-", "° ", 1)
lat = lat.replace("-", "' ", 1)
lat = lat + '"'
Or you can do it all on one line:
lat = lat.replace("-", "° ", 1).replace("-", "' ", 1) + '"'
I would just split your first string:
# -*- coding: utf-8 -*-
str = '90-12-28.15'
arr = str.split('-')
str2 = arr[0] +'° ' + arr[1] + '\'' +arr[2] +'"'
print str2
You might want to use Python's regular expressions module re, particularly re.sub(). Check the Python docs here for more information.
If you're not familiar with regular expressions, check out this tutorial here, also from the Python documentation.
import re
text = 'replace "-" in 90-12-28.15'
print(re.sub(r'(\d\d)-(\d\d)-(\d\d)\.(\d\d)', r'''\1° \2' \3.\4"''', text))
# use \d{1,2} instead of \d\d if single digits are allowed
The python "replace" string method should be easy to use. You can find the documentation here.
In your case, you can do something like this:
my_str = "90-12-28.15"
my_str = my_str.replace("-","°",1)# 1 means you are going to replace the first ocurrence only
my_str = my_str.replace("-","'",1)
my_str = my_str + "\""
I've searched but didn't quite find something for my case. Basically, I'm trying to split the following line:
(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-:INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -:RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+:RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)
You can read this as CU is NOT DIVD or WEXP or DIV- or and so on. What I'd like to do is split this line if it's over 65 characters into something more manageable like this:
(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-)
(CU!INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT-)
(CU!RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+)
(CU!RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)
They're all less than 65 characters. This can be stored in a list and I can take care of the rest. I'm starting to work on this with RegEx but I'm having a bit of trouble.
Additionally, it can also have the following conditionals:
!
<
>
=
!=
!<
!>
As of now, I have this:
def FilterParser(iteratorIn, headerIn):
listOfStrings = []
for eachItem in iteratorIn:
if len(str(eachItem.text)) > 65:
exmlLogger.error('The length of filter' + eachItem.text + ' exceeds the limit and will be dropped')
pass
else:
listOfStrings.append(rightSpaceFill(headerIn + EXUTIL.intToString(eachItem),80))
return ''.join(stringArray)
Here is a solution using regex, edited to include the CU! prefix (or any other prefix) to the beginning of each new line:
import re
s = '(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-:INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -:RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+:RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)'
prefix = '(' + re.search(r'\w+(!?[=<>]|!)', s).group(0)
maxlen = 64 - len(prefix) # max line length of 65, prefix and ')' will be added
regex = re.compile(r'(.{1,%d})(?:$|:)' % maxlen)
lines = [prefix + line + ')' for line in regex.findall(s[len(prefix):-1])]
>>> print '\n'.join(lines)
(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-)
(CU!INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -)
(CU!RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+)
(CU!RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)
First we need to grab the prefix, we do this using re.search().group(0), which returns the entire match. Each of the final lines should be at most 65 characters, the regex that we will use to get these lines will not include the prefix or the closing parentheses, which is why maxlen is 64 - len(prefix).
Now that we know the most characters we can match, the first part of the regex (.{1,<maxlen>) will match at most that many characters. The portion at the end, (?:$|:), is used to make sure that we only split the string on semi-colons or at the end of the string. Since there is only one capturing group regex.findall() will return only that match, leaving off the trailing semi-colon. Here is what it looks like for you sample string:
>>> pprint.pprint(regex.findall(s[len(prefix):-1]))
['DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-',
'INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -',
'RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+',
'RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD']
The list comprehension is used to construct a list of all of the lines by adding the prefix and the trailing ) to each result. The slicing of s is done so that the prefix and the trailing ) are stripped off of the original string before regex.findall(). Hope this helps!