Python regex: Turn "ThisFileName.txt" into "This File Name.txt" - python

I'm trying to add a space before every capital letter, except the first one.
Here's what I have so far, and the output I'm getting:
>>> tex = "ThisFileName.txt"
>>> re.sub('[A-Z].', ' ', tex)
' his ile ame.txt'
I want:
'This File Name.txt'
(It'd be nice if I could also get rid of .txt, but I can do that in a separate operation.)

Key concept here is backreferences in regular expressions:
import re
text = "ThisFileName.txt"
print re.sub('([a-z])([A-Z])', r'\1 \2', text)
# Prints: "This File Name.txt"
For pulling off the '.txt' in a reliable way, I recommend os.path.splitext()
import os
filename = "ThisFileName.txt"
print os.path.splitext(filename)
# Prints: ('ThisFileName', '.txt')

Another possible regular expression using a look behind:
(?<!^)([A-Z])

re.sub('([a-z])([A-Z])', '\\1 \\2', 'TheFileName.txt')
EDIT: StackOverflow eats some \s, when not in 'code mode'... Because I forgot to add a newline after the code above, it was not interpreted in 'code mode' :-((. Since I added that text here I didn't have to change anything and it's correct now.

It is not clear what you want to do if the filename is Hello123There.txt. So, if you want a space before all capital letters regardless of what precedes them, you can:
import re
def add_space_before_caps(text):
"Add a space before all caps except at start of text"
return re.sub(r"(?<!^)(?=[A-Z])", " ", text)
>>> add_space_before_caps("Hello123ThereIBM.txt")
'Hello123 There I B M.txt'

Related

A way to remove all occurrences of words within brackets in a string?

I'm trying to find a way to delete all mentions of references in a text file.
I haven't tried much, as I am new to Python but thought that this is something that Python could do.
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt') as file:
wholefile = f.read()
for '(' in
I have no idea where to go from here or if what I've done is right. Any suggestions would be helpful!
You'll have an easier time with a text editing program that handles regular expressions, like Notepad++, than learning Python for this one task (reading in a file, correcting fundamental errors like for '(' in..., etc.). You can even use tools available online for this, such as RegExr (a regular expression tester). In RegExr, write an appropriate expression into the "expression" field and paste your text into the "text" field. Then, in the "tools" area below the text, choose the "replace" option and remove the placeholder expression. Your cleaned-up text will appear there.
You're looking for a space, then a literal opening parenthesis, then some characters, then a comma, then a year (let's just call that 3 or 4 digits), then a literal closing parenthesis, so I'd suggest the following expression:
\(.*?, \d{3,4}\)
This will preserve non-citation parenthesized text and remove the leading space before a citation.
Try re
>>> import re
>>> re.sub(r'\(.*?\)', '', 'nonsense (nonsense, 2015)')
'nonsense '
>>> re.sub(r'\(.*?\)', '', 'qwerty (qwerty) dkjah (Smith, 2018)')
'qwerty dkjah '
import re
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt', 'r') as file:
wholefile = file.read()
# Be care for use 'w', it will delete raw data.
whth open('random_text.txt', 'w') as file:
file.write(re.sub(r'\(.*?\)', '', wholefile))

How to partial search for words using regex python

I want to get all 'xlsx' files that somewhere have 'feedback report' in them. I want to make this filter very strong. So any partial matches like 'feedback_report', 'feedback report', 'Feedback Report' should all return true.
Example file names :
ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx
ZL-SA_feedback report_012844.xlsx
ASARanem-SA_Feedback Report_012844.xlsx
A futile attempt below.
regex = re.compile(r"[a-zA-Z0-0]*[fF][eE][eE][dD][bB][aA][cC][kK]\s[rR][eE][pP][oO][rR][tT][a-zA-Z0-0]*.xlsx")
This will work:
re.search("(feedback)(.*?|\s)(report)",string,re.IGNORECASE)
Tested it on the following input list with the code
import re
a=["ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx",
"ZL-SA_feedback report_012844.xlsx",
"ASARanem-SA_Feedback Report_012844.xlsx",
"some report",
"feedback-report"]
for i in a:
print(re.search("(feedback)(.*?|\s)(report)",i,re.IGNORECASE))
the output as expected by OP from the same is:
<_sre.SRE_Match object; span=(21, 36), match='FEEDBACK_REPORT'>
<_sre.SRE_Match object; span=(6, 21), match='feedback report'>
<_sre.SRE_Match object; span=(12, 27), match='Feedback Report'>
None
<_sre.SRE_Match object; span=(0, 15), match='feedback-report'>
Your regex is nearly acceptable, but the beginning and ending portions will not match correctly because you have underscores in your examples. I'm not sure how representative these are of your actual data but to match what you have here you would need:
regex = re.compile(r"[a-zA-Z0-0\_\-\s]*(feedback)[\s\_\-](report)[a-zA-Z0-0\_\-\s]*.xlsx",
flags = re.IGNORECASE)
Another thing you should probably be careful of is to make sure you're actually working with just the file name and not the file path because in that case you'd have to worry about \ and / characters. Also note that I'm only matching for the exact characters I noticed you were missing. You may want to try
regex = re.compile(r"*(feedback)*(report)*.xlsx", flags = re.IGNORECASE)
but, again, I'm not sure what your data actually looks like. Hope this helps
First of all, lowercase file names in order to minimize the number of possible options
regex = re.compile('feedback.{0,3}report.*\.xlsx?', flags=re.IGNORECASE)
looks for 'feedback', next up to 3 whatever characters, next 'report', and whatever again, ending with a dot and xls or xlsx extension
or just
filename = 'ZL-SA_feedback report_012844.xlsx'
matched = re.search('feedback.{0,3}report.*\.xlsx?', filename.lower())
Also you can use python glob module to search files in linux fashion:
import glob
glob.glob('*[fF][eE][dD][bB][aA][cC][kK]*[rR][eE][pP][oO][rR][tT]*.xlsx')
Could you use just string methods like the following?
'feedbackreport' in name.replace('_', '').replace(' ', '').lower()
And also
name.endswith('.xlsx')
Giving you something like:
fileList = [
'ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx',
'ZL-SA_feedback report_012844.xlsx',
'ASARanem-SA_Feedback Report_012844.xlsx'
]
fileNames = [name for name in fileList
if ('feedbackreport' in name.replace('_', '').replace(' ', '').lower()
and name.endswith('.xlsx'))]
If there are more characters that could cause problems such as - then you could also make a quick function to remove bad characters:
def remove_bad_chars(string, chars):
for char in chars:
string = string.replace(char, '')
return string
Amending the appropriate portion of the if statement to:
if 'feedbackreport' in remove_bad_chars(name, '.,?!\'-/:;()"\\~ ').lower()
# included a white space in the string of bad characters
I used this for my string based on all your suggestions. This works for me in 99% of the cases.
regex = re.compile(r"[a-zA-Z0-9\_\-\s]*(feedback)(\s|\_)(report)s?[a-zA-Z0-9\_\-\s]*.xlsx",flags = re.IGNORECASE)

Python regex for string up to character or end of line

I want a regex that stops at a certain character or end of the line. I currently have:
x = re.findall(r'Food: (.*)\|', text)
which selects anything between "Food:" and "|". For adding end of the line, I tried:
x = re.findall(r'Food: (.*)\||$', text)
but this would return empty if the text was 'Food: is great'. How do I make this regex stop at "|" or end of line?
You can use negation based regex [^|]* which means anything but pipe:
>>> re.findall(r'Food: ([^|]*)', 'Food: is great|foo')
['is great']
>>> re.findall(r'Food: ([^|]*)', 'Food: is great')
['is great']
A simpler alternative solution:
def text_selector(string)
remove_pipe = string.split('|')[0]
remove_food_prefix = remove_pipe.split(':')[1].strip()
return remove_food_prefix

Having trouble adding a space after a period in a python string

I have to write a code to do 2 things:
Compress more than one occurrence of the space character into one.
Add a space after a period, if there isn't one.
For example:
input> This is weird.Indeed
output>This is weird. Indeed.
This is the code I wrote:
def correction(string):
list=[]
for i in string:
if i!=" ":
list.append(i)
elif i==" ":
k=i+1
if k==" ":
k=""
list.append(i)
s=' '.join(list)
return s
strn=input("Enter the string: ").split()
print (correction(strn))
This code takes any input by the user and removes all the extra spaces,but it's not adding the space after the period(I know why not,because of the split function it's taking the period and the next word with it as one word, I just can't figure how to fix it)
This is a code I found online:
import re
def correction2(string):
corstr = re.sub('\ +',' ',string)
final = re.sub('\.','. ',corstr)
return final
strn= ("This is as .Indeed")
print (correction2(strn))
The problem with this code is I can't take any input from the user. It is predefined in the program.
So can anyone suggest how to improve any of the two codes to do both the functions on ANY input by the user?
Is this what you desire?
import re
def corr(s):
return re.sub(r'\.(?! )', '. ', re.sub(r' +', ' ', s))
s = input("> ")
print(corr(s))
I've changed the regex to a lookahead pattern, take a look here.
Edit: explain Regex as requested in comment
re.sub() takes (at least) three arguments: The Regex search pattern, the replacement the matched pattern should be replaced with, and the string in which the replacement should be done.
What I'm doing here is two steps at once, I've been using the output of one function as input of another.
First, the inner re.sub(r' +', ' ', s) searches for multiple spaces (r' +') in s to replace them with single spaces. Then the outer re.sub(r'\.(?! )', '. ', ...) looks for periods without following space character to replace them with '. '. I'm using a negative lookahead pattern to match only sections, that don't match the specified lookahead pattern (a normal space character in this case). You may want to play around with this pattern, this may help understanding it better.
The r string prefix changes the string to a raw string where backslash-escaping is disabled. Unnecessary in this case, but it's a habit of mine to use raw strings with regular expressions.
For a more basic answer, without regex:
>>> def remove_doublespace(string):
... if ' ' not in string:
... return string
... return remove_doublespace(string.replace(' ',' '))
...
>>> remove_doublespace('hi there how are you.i am fine. '.replace('.', '. '))
'hi there how are you. i am fine. '
You try the following code:
>>> s = 'This is weird.Indeed'
>>> def correction(s):
res = re.sub('\s+$', '', re.sub('\s+', ' ', re.sub('\.', '. ', s)))
if res[-1] != '.':
res += '.'
return res
>>> print correction(s)
This is weird. Indeed.
>>> s=raw_input()
hee ss.dk
>>> s
'hee ss.dk'
>>> correction(s)
'hee ss. dk.'

A pythonic way to insert a space before capital letters

I've got a file whose format I'm altering via a python script. I have several camel cased strings in this file where I just want to insert a single space before the capital letter - so "WordWordWord" becomes "Word Word Word".
My limited regex experience just stalled out on me - can someone think of a decent regex to do this, or (better yet) is there a more pythonic way to do this that I'm missing?
You could try:
>>> re.sub(r"(\w)([A-Z])", r"\1 \2", "WordWordWord")
'Word Word Word'
If there are consecutive capitals, then Gregs result could
not be what you look for, since the \w consumes the caracter
in front of the captial letter to be replaced.
>>> re.sub(r"(\w)([A-Z])", r"\1 \2", "WordWordWWWWWWWord")
'Word Word WW WW WW Word'
A look-behind would solve this:
>>> re.sub(r"(?<=\w)([A-Z])", r" \1", "WordWordWWWWWWWord")
'Word Word W W W W W W Word'
Perhaps shorter:
>>> re.sub(r"\B([A-Z])", r" \1", "DoIThinkThisIsABetterAnswer?")
Have a look at my answer on .NET - How can you split a “caps” delimited string into an array?
Edit: Maybe better to include it here.
re.sub(r'([a-z](?=[A-Z])|[A-Z](?=[A-Z][a-z]))', r'\1 ', text)
For example:
"SimpleHTTPServer" => ["Simple", "HTTP", "Server"]
Maybe you would be interested in one-liner implementation without using regexp:
''.join(' ' + char if char.isupper() else char.strip() for char in text).strip()
With regexes you can do this:
re.sub('([A-Z])', r' \1', str)
Of course, that will only work for ASCII characters, if you want to do Unicode it's a whole new can of worms :-)
If you have acronyms, you probably do not want spaces between them. This two-stage regex will keep acronyms intact (and also treat punctuation and other non-uppercase letters as something to add a space on):
re_outer = re.compile(r'([^A-Z ])([A-Z])')
re_inner = re.compile(r'(?<!^)([A-Z])([^A-Z])')
re_outer.sub(r'\1 \2', re_inner.sub(r' \1\2', 'DaveIsAFKRightNow!Cool'))
The output will be: 'Dave Is AFK Right Now! Cool'
I agree that the regex solution is the easiest, but I wouldn't say it's the most pythonic.
How about:
text = 'WordWordWord'
new_text = ''
for i, letter in enumerate(text):
if i and letter.isupper():
new_text += ' '
new_text += letter
I think regexes are the way to go here, but just to give a pure python version without (hopefully) any of the problems ΤΖΩΤΖΙΟΥ has pointed out:
def splitCaps(s):
result = []
for ch, next in window(s+" ", 2):
result.append(ch)
if next.isupper() and not ch.isspace():
result.append(' ')
return ''.join(result)
window() is a utility function I use to operate on a sliding window of items, defined as:
import collections, itertools
def window(it, winsize, step=1):
it=iter(it) # Ensure we have an iterator
l=collections.deque(itertools.islice(it, winsize))
while 1: # Continue till StopIteration gets raised.
yield tuple(l)
for i in range(step):
l.append(it.next())
l.popleft()
To the old thread - wanted to try an option for one of my requirements. Of course the re.sub() is the cool solution, but also got a 1 liner if re module isn't (or shouldn't be) imported.
st = 'ThisIsTextStringToSplitWithSpace'
print(''.join([' '+ s if s.isupper() else s for s in st]).lstrip())

Categories

Resources