How to partial search for words using regex python

How to partial search for words using regex python - python

I want to get all 'xlsx' files that somewhere have 'feedback report' in them. I want to make this filter very strong. So any partial matches like 'feedback_report', 'feedback report', 'Feedback Report' should all return true.
Example file names :
ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx
ZL-SA_feedback report_012844.xlsx
ASARanem-SA_Feedback Report_012844.xlsx
A futile attempt below.
regex = re.compile(r"[a-zA-Z0-0]*[fF][eE][eE][dD][bB][aA][cC][kK]\s[rR][eE][pP][oO][rR][tT][a-zA-Z0-0]*.xlsx")

This will work:
re.search("(feedback)(.*?|\s)(report)",string,re.IGNORECASE)
Tested it on the following input list with the code
import re
a=["ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx",
"ZL-SA_feedback report_012844.xlsx",
"ASARanem-SA_Feedback Report_012844.xlsx",
"some report",
"feedback-report"]
for i in a:
print(re.search("(feedback)(.*?|\s)(report)",i,re.IGNORECASE))
the output as expected by OP from the same is:
<_sre.SRE_Match object; span=(21, 36), match='FEEDBACK_REPORT'>
<_sre.SRE_Match object; span=(6, 21), match='feedback report'>
<_sre.SRE_Match object; span=(12, 27), match='Feedback Report'>
None
<_sre.SRE_Match object; span=(0, 15), match='feedback-report'>

Your regex is nearly acceptable, but the beginning and ending portions will not match correctly because you have underscores in your examples. I'm not sure how representative these are of your actual data but to match what you have here you would need:
regex = re.compile(r"[a-zA-Z0-0\_\-\s]*(feedback)[\s\_\-](report)[a-zA-Z0-0\_\-\s]*.xlsx",
flags = re.IGNORECASE)
Another thing you should probably be careful of is to make sure you're actually working with just the file name and not the file path because in that case you'd have to worry about \ and / characters. Also note that I'm only matching for the exact characters I noticed you were missing. You may want to try
regex = re.compile(r"*(feedback)*(report)*.xlsx", flags = re.IGNORECASE)
but, again, I'm not sure what your data actually looks like. Hope this helps

First of all, lowercase file names in order to minimize the number of possible options
regex = re.compile('feedback.{0,3}report.*\.xlsx?', flags=re.IGNORECASE)
looks for 'feedback', next up to 3 whatever characters, next 'report', and whatever again, ending with a dot and xls or xlsx extension
or just
filename = 'ZL-SA_feedback report_012844.xlsx'
matched = re.search('feedback.{0,3}report.*\.xlsx?', filename.lower())
Also you can use python glob module to search files in linux fashion:
import glob
glob.glob('*[fF][eE][dD][bB][aA][cC][kK]*[rR][eE][pP][oO][rR][tT]*.xlsx')

Could you use just string methods like the following?
'feedbackreport' in name.replace('_', '').replace(' ', '').lower()
And also
name.endswith('.xlsx')
Giving you something like:
fileList = [
'ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx',
'ZL-SA_feedback report_012844.xlsx',
'ASARanem-SA_Feedback Report_012844.xlsx'
]
fileNames = [name for name in fileList
if ('feedbackreport' in name.replace('_', '').replace(' ', '').lower()
and name.endswith('.xlsx'))]
If there are more characters that could cause problems such as - then you could also make a quick function to remove bad characters:
def remove_bad_chars(string, chars):
for char in chars:
string = string.replace(char, '')
return string
Amending the appropriate portion of the if statement to:
if 'feedbackreport' in remove_bad_chars(name, '.,?!\'-/:;()"\\~ ').lower()
# included a white space in the string of bad characters

I used this for my string based on all your suggestions. This works for me in 99% of the cases.
regex = re.compile(r"[a-zA-Z0-9\_\-\s]*(feedback)(\s|\_)(report)s?[a-zA-Z0-9\_\-\s]*.xlsx",flags = re.IGNORECASE)

Related

Replacing when a word is in another word but with special circumstances

My program replaces tokens with values when they are in a file. When reading in a certain line it gets stuck here is an example:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a
The two tokens in the example are Token100 and Token100a. I need a way to only replace Token100 with its data and not replace Token100a with Token100's data with an a afterwards. I can't look for spaces before and after because sometimes they are in the middle of lines. Any thoughts are appreciated. Thanks.

You can use regex:
import re
line = "1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a"
match = re.sub("Token100a", "data", line)
print(match)
Outputs:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1data
More about regex here:
https://www.w3schools.com/python/python_regex.asp

You can use a regular expression with a negative lookahead to ensure that the following character is not an "a":
>>> import re
>>> test = '1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a'
>>> re.sub(r'Token100(?!a)', 'data', test)
'1.1.1.1.1.1.1.1.1.1 data.1 1.1.1.1.1.1.1Token100a'

Why regex is not working?

I need to replace all occurences of normal whitespaces in «статья 1», «статьи 2» etc. with non-breaking spaces.
The construction below works fine:
re.sub('(стат.{0,4}) (\d+)', r'\1 \2', text) # 'r' in repl is important, otherwise the word is not replaced correctly, at least for texts in Russian.
however, I do not want to repeatedly use re.sub for «статья», then for «пункт», then for the names of months, I want to have a dictionary with regex expressions and replacements. Here's my code, but it does not work as expected: 'статья 1 статьи 2' should look like 'статья(non-breaking space here)1 статьи(non-breaking space here)2':
import re
text = 'статья 1 статьи 2'
dic = {'(cтат.{0,4}) (\d+)' : r'\1 \2'}
def replace():
global text
final_text = ''
for i in dic:
new_text = re.sub(str(i), str(dic[i]), text)
text = new_text
return text
print (replace())

The problem is that you copied and pasted wrong.
This pattern works:
'(стат.{0,4}) (\d+)'
This one doesn't:
'(cтат.{0,4}) (\d+)'
Why? Because in the first one, and in your search string, that first character is a U+0441, a Cyrillic small Es. But in the second one, it's a U+0063, a Latin small C. Of course the two look identical in most fonts, but they're not the same character.
So, how can you tell? Well, when I suspected this problem, here's what I did:
>>> a = '(стат.{0,4}) (\d+)' # copied and pasted from your working code
>>> b = '(cтат.{0,4}) (\d+)' # copied and pasted from your broken code
>>> print(a.encode('unicode-escape').decode('ascii'))
(\u0441\u0442\u0430\u0442.{0,4}) (\\d+)
>>> print(b.encode('unicode-escape').decode('ascii'))
(c\u0442\u0430\u0442.{0,4}) (\\d+)
And the difference is obvious: the first one has a \u0441 escape sequence where the second one has a plain ASCII c.

Parsing and reformatting CSV/text data using Python

sorry if this a bit of a beginner's question, but I haven't had much experience with python, and could really use some help in figuring this out. If there is a better programming language for tackling this, I'd be more than open to hearing it
I'm working on a small project, and I have two blocks of data, formatted differently from each other. They're all spreadsheets saved as CSV files, and I'd really like to make one group match the other without having to manually edit all the data.
What I need to do is go through a CSV, and format any data saved like this:
10W
20E
15-16N
17-18S
To a format like this (respective line to respective format):
10,W
20,E
,,15,16,N
,,17,18,S
So that they can just be copied over when opened as spreadsheets
I'm able to get the files into a string in python, but I'm unsure of how to properly write something to search for a number-hyphen-number-letter format.
I'd be immensely grateful for any help I can get. Thanks

This sounds like a good use-case for regular expressions. Once you've split the lines up into individual strings and stripped the whitespace (using s.strip()) these should work (I'm assuming those are cardinal directions; you'll need to change [NESW] to something else if that assumption is incorrect.):
>>> import re
>>> re.findall('\A(\d+)([NESW])', '16N')
[('16', 'N')]
>>> re.findall('\A(\d+)([NESW])', '15-16N')
[]
>>> re.findall('\A(\d+)-(\d+)([NESW])', '15-16N')
[('15', '16', 'N')]
>>> re.findall('\A(\d+)-(\d+)([NESW])', '16N')
[]
The first regex '\A(\d+)([NESW])' matches only a string that begins with a sequence of digits followed by a capital letter N, E, S, or W. The second matches only a string that begins with a sequence of digits followed by a hyphen, followed by another sequence of digits, followed by a capital letter N, E, S, or W. Forcing it to match at the beginning ensures that these regexes don't match a suffix of a longer string.
Then you can do something like this:
>>> vals = re.findall('\A(\d+)([NESW])', '16N')[0]
>>> ','.join(vals)
'16,N'
>>> vals = re.findall('(\d+)-(\d+)([NESW])', '15-16N')[0]
>>> ',,' + ','.join(vals)
',,15,16,N'

This is a whole solution that uses regexs. #senderle has beat me to the answer, so feel free to tick his response. This is just added here as I know how difficult it was to wrap my head around re in my code at first.
import re
dash = re.compile('(\d{2})-(\d{2})([WENS])')
no_dash = re.compile( '(\d{2})([WENS])' )
raw = '''10W
20E
15-16N
17-18S'''
lines = raw.split('\n')
data = []
for l in lines:
if '-' in l:
match = re.search(dash, l).groups()
data.append( ',,%s,%s,%s' % (match[0], match[1], match[2] ) )
else:
match = re.search(no_dash, l).groups()
data.append( '%s,%s' % (match[0], match[1] ) )
print '\n'.join(data)

In your case, I think the quick solution would involve regexps
You can either use the match method to extract your different tokens when they match a given regular expression, or the split method to split your string into tokens given a separator.
However, in your case, the separator would be a single character, so you can use the split method from the str class.

Python : How to ignore a delimited part of a sentence?

I have the following line :
CommonSettingsMandatory = #<Import Project="[\\.]*Shared(\\vc10\\|\\)CommonSettings\.targets," />#,true
and i want the following output:
['commonsettingsmandatory', '<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />', 'true'
If i do a simple regex with the comma, it will split the value if there's a value in it, like i wrote a comma after targets, it will split here.
So i want to ignore the text between the ## to make sure there's no splitting there.
I really don't know how to do!

http://docs.python.org/library/re.html#re.split
import re
string = 'CommonSettingsMandatory = #toto,tata#, true'
splitlist = re.split('\s?=\s?#(.*?)#,\s?', string)
Then splitlist contains ['CommonSettingsMandatory', 'toto,tata', 'true'].

While you might be able to use split with a lookbehind, I would use the groups captured by this expression.
(\S+)\s*=\s*##([^#]+)##,\s*(.*)
m = re.Search(expression, myString). use m.group(1) for the first string, m.group(2) for the second, etc.

If I understand you correctly, you're trying to split the string using spaces as delimiters, but you want to also remove any text between pound signs?
If that's correct, why not simply remove the pound sign-delimited text before splitting the string?
import re
myString = re.sub(r'#.*?#', '', myString)
myArray = myString.split(' ')
EDIT: (based on revised question)
import re
myArray = re.findall(r'^(.*?) = #(.*?)#,(.*?)$', myString)
That will actually return an array of tuples including your matches, in the form of:
[
(
'commonsettingsmandatory',
'<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />',
'true'
)
]
(spacing added to illustrate the format better)

Python regex: Turn "ThisFileName.txt" into "This File Name.txt"

I'm trying to add a space before every capital letter, except the first one.
Here's what I have so far, and the output I'm getting:
>>> tex = "ThisFileName.txt"
>>> re.sub('[A-Z].', ' ', tex)
' his ile ame.txt'
I want:
'This File Name.txt'
(It'd be nice if I could also get rid of .txt, but I can do that in a separate operation.)

Key concept here is backreferences in regular expressions:
import re
text = "ThisFileName.txt"
print re.sub('([a-z])([A-Z])', r'\1 \2', text)
# Prints: "This File Name.txt"
For pulling off the '.txt' in a reliable way, I recommend os.path.splitext()
import os
filename = "ThisFileName.txt"
print os.path.splitext(filename)
# Prints: ('ThisFileName', '.txt')

Another possible regular expression using a look behind:
(?<!^)([A-Z])

re.sub('([a-z])([A-Z])', '\\1 \\2', 'TheFileName.txt')
EDIT: StackOverflow eats some \s, when not in 'code mode'... Because I forgot to add a newline after the code above, it was not interpreted in 'code mode' :-((. Since I added that text here I didn't have to change anything and it's correct now.

It is not clear what you want to do if the filename is Hello123There.txt. So, if you want a space before all capital letters regardless of what precedes them, you can:
import re
def add_space_before_caps(text):
"Add a space before all caps except at start of text"
return re.sub(r"(?<!^)(?=[A-Z])", " ", text)
>>> add_space_before_caps("Hello123ThereIBM.txt")
'Hello123 There I B M.txt'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to partial search for words using regex python - python

I used this for my string based on all your suggestions. This works for me in 99% of the cases. regex = re.compile(r"[a-zA-Z0-9\_\-\s](feedback)(\s|\_)(report)s?[a-zA-Z0-9\_\-\s].xlsx",flags = re.IGNORECASE)

Related

Replacing when a word is in another word but with special circumstances

Why regex is not working?

Parsing and reformatting CSV/text data using Python

Python : How to ignore a delimited part of a sentence?

Python regex: Turn "ThisFileName.txt" into "This File Name.txt"

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to partial search for words using regex python - python

I used this for my string based on all your suggestions. This works for me in 99% of the cases. regex = re.compile(r"[a-zA-Z0-9\_\-\s]*(feedback)(\s|\_)(report)s?[a-zA-Z0-9\_\-\s]*.xlsx",flags = re.IGNORECASE)

Related

Replacing when a word is in another word but with special circumstances

Why regex is not working?

Parsing and reformatting CSV/text data using Python

Python : How to ignore a delimited part of a sentence?

Python regex: Turn "ThisFileName.txt" into "This File Name.txt"

Categories

Resources

I used this for my string based on all your suggestions. This works for me in 99% of the cases. regex = re.compile(r"[a-zA-Z0-9\_\-\s](feedback)(\s|\_)(report)s?[a-zA-Z0-9\_\-\s].xlsx",flags = re.IGNORECASE)