Python re.sub with regex - python

Need help with regex within re.sub . In this case I am replacing with nothing ("")
My Current Code:
file_list = ['F_5500_SF_PART7_[0-9][0-9][0-9][0-9]_all.zip',
'F_5500_SF_[0-9][0-9][0-9][0-9]_All.zip',
'F_5500_[0-9][0-9][0-9][0-9]_All.zip',
'F_SCH_A_PART1_[0-9][0-9][0-9][0-9]_All.zip']
foldernames = [re.sub('(\d{4})_All.zip', '', i) for i in file_list]
The Result I am trying to achieve is:
foldernames = ['F_5500_SF_PART7','F_5500_SF','F_5500','F_SCH_A_PART1']
I think part of the complexity is the fact that there is already regex in my file_list. Hoping someone smarter could help.

You don't need a regular expression, you're removing fixed strings. So you can just use the str.replace() method.
foldernames = [i.replace('_[0-9][0-9][0-9][0-9]_All.zip', '').replace('_[0-9][0-9][0-9][0-9]_all.zip', '') for i in file_list]
The two calls to replace() are needed to handle both All and all. Or if the rest of the filename is always uppercase, you could use:
foldernames = [i.upper().replace('_[0-9][0-9][0-9][0-9]_ALL.ZIP', '') for i in file_list]

Barmar's answer is the most appropriate for your problem. But if you actually need to use regex (let's say not all the files have the same fixed "[0-9][0-9][0-9][0-9]" string), then you can use:
'_(\[[-\d]*\]){4}_[aA]ll.zip'
(the [aA]ll at the end if for capturing the lower-case "all" in your first case)

Related

Python regex to check if a substring is at the beginning or at the end of a bigger path to look for

I have a string containing words in the form word1_word2, word3_word4, word5_word1 (so a word can appear at the left or at the right). I want a regex that looks for all the occurrences of a specific word, and returns the "super word" containing it. So if I'm looking for word1, I expect my regex to return word1_word2, word5_word1. Since the word can appear on the left or on the right, I wrote this:
re.findall("( {}_)?[\u0061-\u007a\u00e0-\u00e1\u00e8-\u00e9\u00ec\u00ed\u00f2-\u00f3\u00f9\u00fa]*(_{} )?".format("w1", "w1"), string)
With the optional blocks at the beginning or at the end of the pattern. However, it takes forever to execute and I think something is not correct because I tried removing the optional blocks and writing two separate regex for looking at the beginning and at the end and they are much faster (but I don't want to use two regex). Am I missing something or is it normal?
This would be the regex solution to your problem:
re.findall(rf'\b({yourWord}_\w+?|\w+?_{yourWord})\b', yourString)
Python provides some methods to do this
a=['word1_word2', 'word3_word4', 'word5_word1']
b = [x for x in a if x.startswith("word1") or x.endswith('word1')]
print(b) # ['word1_word2', 'word5_word1']
Referenece link
s = 'word1_word2, word3_word4, word5_word1'
matches = re.finditer(r'(\w+_word1)|(word1_\w+)', s)
result = list(map(lambda x: x.group(), matches))
['word1_word2', 'word5_word1']
This is one method, but seeing #Carl his answer I voted for his. That is a faster and cleaner method. I will just leave it here as one of many regex options.
this regex will do the job for word1:
regex = (word\d_)*word1(_word\d)*
re.findall(regex, string)
you can also use this:
re.findall(rf'\b(word{number}_\w+?|\w+?_word{number})\b', string)
Try the following regex.
In the following, replace word1 with the word you're looking for. This is assuming that the word you are looking for consists of only alphanumeric characters.
([a-zA-Z0-9]*_word1)|(word1_.[a-zA-Z0-9]*)

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

Using variable in re.match in python

I am trying to create an array of things to match in a description line. So I can ignore them later on in my script. Below is a sample script that I have been working on, on the side.
Basically I am trying to take a bunch of strings and match it against a bunch of other strings.
AKA:
asdf or asfs or wrtw in string = true continue with script
if not print this.
import re
ignorelist = ['^test', '(.*)set']
def guess(a):
for ignore in ignorelist:
if re.match(ignore, a):
return('LOSE!')
else:
return('WIN!')
a = raw_input('Take a guess: ')
print guess(a)
Thanks
You have a bit of logic/flow problem.
You test the first term in the list. If it doesn't match, you go to the else and return "WIN!" without testing any of the other terms in the list.
(Also, ignorelist is outside the function.)
[EDIT: I see you edited the question to include regular expressions, so I will edit the answer back to a re context...] Note that you should use re.search instead of re.match if you want to give it actual regex since re.match only matches at the beginning of the line.
There are innumerable ways to change this, depending on how you want your program to work.
I would re-write guess along these lines. (You can also put ignorelist inside the function instead of passing it.):
ignorelist = [r'^test', r'[abc]set']
def guess(a,il):
for reg in il:
if re.search(reg,a):
return "LOSE"
return "WIN"
a = raw_input()
print guess(a,ignorelist)
In this case, it will loop through each word, exiting if it finds a match, but if it doesn't (completes the loop without returning anything) then it will finally return "WIN".
I think it would be far better using a single regex, or a set of them if only one would be to big to compile. Something like:
GUESSER = re.compile('|'.join(ignorelist))
def guess(a):
if GUESSER.search(a):
return('WIN!')
else:
return('LOSE!')
Note: Pattern in "ignorelist" should be enclosed in a pair of parentheses if they use the or "|" operator.

How to check if a string contains an element from a list in Python

I have something like this:
extensionsToCheck = ['.pdf', '.doc', '.xls']
for extension in extensionsToCheck:
if extension in url_string:
print(url_string)
I am wondering what would be the more elegant way to do this in Python (without using the for loop)? I was thinking of something like this (like from C/C++), but it didn't work:
if ('.pdf' or '.doc' or '.xls') in url_string:
print(url_string)
Edit: I'm kinda forced to explain how this is different to the question below which is marked as potential duplicate (so it doesn't get closed I guess).
The difference is, I wanted to check if a string is part of some list of strings whereas the other question is checking whether a string from a list of strings is a substring of another string. Similar, but not quite the same and semantics matter when you're looking for an answer online IMHO. These two questions are actually looking to solve the opposite problem of one another. The solution for both turns out to be the same though.
Use a generator together with any, which short-circuits on the first True:
if any(ext in url_string for ext in extensionsToCheck):
print(url_string)
EDIT: I see this answer has been accepted by OP. Though my solution may be "good enough" solution to his particular problem, and is a good general way to check if any strings in a list are found in another string, keep in mind that this is all that this solution does. It does not care WHERE the string is found e.g. in the ending of the string. If this is important, as is often the case with urls, you should look to the answer of #Wladimir Palant, or you risk getting false positives.
extensionsToCheck = ('.pdf', '.doc', '.xls')
'test.doc'.endswith(extensionsToCheck) # returns True
'test.jpg'.endswith(extensionsToCheck) # returns False
It is better to parse the URL properly - this way you can handle http://.../file.doc?foo and http://.../foo.doc/file.exe correctly.
from urlparse import urlparse
import os
path = urlparse(url_string).path
ext = os.path.splitext(path)[1]
if ext in extensionsToCheck:
print(url_string)
Just in case if anyone will face this task again, here is another solution:
extensionsToCheck = ['.pdf', '.doc', '.xls']
url_string = 'file.doc'
res = [ele for ele in extensionsToCheck if(ele in url_string)]
print(bool(res))
> True
Use list comprehensions if you want a single line solution. The following code returns a list containing the url_string when it has the extensions .doc, .pdf and .xls or returns empty list when it doesn't contain the extension.
print [url_string for extension in extensionsToCheck if(extension in url_string)]
NOTE: This is only to check if it contains or not and is not useful when one wants to extract the exact word matching the extensions.
This is a variant of the list comprehension answer given by #psun.
By switching the output value, you can actually extract the matching pattern from the list comprehension (something not possible with the any() approach by #Lauritz-v-Thaulow)
extensionsToCheck = ['.pdf', '.doc', '.xls']
url_string = 'http://.../foo.doc'
print([extension for extension in extensionsToCheck if(extension in url_string)])
['.doc']`
You can furthermore insert a regular expression if you want to collect additional information once the matched pattern is known (this could be useful when the list of allowed patterns is too long to write into a single regex pattern)
print([re.search(r'(\w+)'+extension, url_string).group(0) for extension in extensionsToCheck if(extension in url_string)])
['foo.doc']
Check if it matches this regex:
'(\.pdf$|\.doc$|\.xls$)'
Note: if you extensions are not at the end of the url, remove the $ characters, but it does weaken it slightly
This is the easiest way I could imagine :)
list_ = ('.doc', '.txt', '.pdf')
string = 'file.txt'
func = lambda list_, string: any(filter(lambda x: x in string, list_))
func(list_, string)
# Output: True
Also, if someone needs to save elements that are in a string, they can use something like this:
list_ = ('.doc', '.txt', '.pdf')
string = 'file.txt'
func = lambda list_, string: tuple(filter(lambda x: x in string, list_))
func(list_, string)
# Output: '.txt'

De-greedifying a regular expression in python

I'm trying to write a regular expression that will convert a full path filename to a short filename for a given filetype, minus the file extension.
For example, I'm trying to get just the name of the .bar file from a string using
re.search('/(.*?)\.bar$', '/def_params/param_1M56/param/foo.bar')
According to the Python re docs, *? is the ungreedy version of *, so I was expecting to get
'foo'
returned for match.group(1) but instead I got
'def_params/param_1M56/param/foo'
What am I missing here about greediness?
What you're missing isn't so much about greediness as about regular expression engines: they work from left to right, so the / matches as early as possible and the .*? is then forced to work from there. In this case, the best regex doesn't involve greediness at all (you need backtracking for that to work; it will, but could take a really long time to run if there are a lot of slashes), but a more explicit pattern:
'/([^/]*)\.bar$'
I would suggest changing your regex so that it doesn't rely on greedyness.
You want only the filename before the extension .bar and everything after the final /. This should do:
re.search(`/[^/]*\.bar$`, '/def_params/param_1M56/param/foo.bar')
What this does is it matches /, then zero or more characters (as much as possible) that are not / and then .bar.
I don't claim to understand the non-greedy operators all that well, but a solution for that particular problem would be to use ([^/]*?)
The regular expressions starts from the right. Put a .* at the start and it should work.
I like regex but there is no need of one here.
path = '/def_params/param_1M56/param/foo.bar'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/fululu'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/one.before.two.dat'
print path.rsplit('/',1)[1].rsplit('.',1)[0]
result
foo
fululu
one.before.two
Other people have answered the regex question, but in this case there's a more efficient way than regex:
file_name = path[path.rindex('/')+1 : path.rindex('.')]
try this one on for size:
match = re.search('.*/(.*?).bar$', '/def_params/param_1M56/param/foo.bar')

Categories

Resources