automating regex to process multiple files

automating regex to process multiple files - python

I'm trying to process some data - specifically I have to
Delete any decimals from all numbers in the file, eg 4.0 -> 4
Add a dash between any dates and any times, eg 2014-01-01 23:45:52 -> 2014-01-01-23:45:52
I've wrote some regexes in sublime text to do this using the find and replace function:
Find : "\.\d", Replace : ""
Find : "(\d{2})\s(\d)", Replace : "$1-$2"
This all works fine and gives me the right results. The problem is that I have to process hundreds of csv files in this way, I've tried to do it in python but it isn't working the way I'd expect. Here's the code used:
for file in csv_list: # csv_list is the list of all the files I need to process
with open(file, "r") as infile:
with open("{}EDIT.csv".format(file.split(".")[0]), "w", newline="") as outfile: # Save the processed version
writer = csv.writer(outfile, delimiter=",")
reader = csv.reader(infile)
for line in reader:
writer.writerow([re.sub("(\d{2})\s(\d)",
"$1-$2", re.sub("\.\d", "", string)) for string in line])
I'm not too confident with regex, so I can't see why this isn't working the way I'd expect. If anyone could help me out that'd be great. Thanks in advance!
As requested, here is an input row, what output I was expecting, and what the actual output is:
input : 0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active
desired output : 0,2013-01-01-20:59:39,5737,english,2013-01-01-21:01:07,active
actual output : 0, 2013-01-$1-$20:59:39,5737,english,2013-01-$1-$21:01:07

You could solve your issue by replacing the first regex pattern with r"\1-\2":
import re
rx = r"(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub("(\d{2})\s(\d)", r"\1-\2", re.sub(r"\.\d", "", s))
print (result)
See the Python demo. See the re.sub reference:
Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.
Or, to avoid that fuss with string replacement backreferences, use a single regex for that task and modify the matches inside a lambda expression:
import re
pat = r"\.\d|(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub(pat, lambda m: r"{}-{}".format(m.group(1),m.group(2)) if m.group(1) else "", s)
print (result)
See another Python demo.
Note that perhaps, for better safety, you could use r'\.\d+\b' as the pattern to remove decimal parts (\d+ matches one or more digits, and \b requires a char other than letter, digit or _ after it, or the end of string). The second pattern can be spelled out for the same purpose as r'(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})'.

Related

Search for sentences containing characters using Python regular expressions

I am searching for sentences containing characters using Python regular expressions.
But I can't find the sentence I want.
Please help me
regex.py
opfile = open(file.txt, 'r')
contents = opfile.read()
opfile.close()
index = re.findall(r'\[start file\](?:.|\n)*\[end file\]', contents)
item = re.search(r'age.*', str(index))
file.txt(example)
[start file]
name: steve
age: 23
[end file]
result
<re.Match object; span=(94, 738), match='age: >
The age is not printed

There are several issues here:
The str(index) returns the string literal representation of the string list, and it makes it difficult to further process the result
(?:.|\n)* is a very resource consuming construct, use a mere . with the re.S or re.DOTALL option
If you plan to find a single match, use re.search, not re.findall.
Here is a possible solution:
match = re.search(r'\[start file].*\[end file]', contents, re.S)
if match:
match2 = re.search(r"\bage:\s*(\d+)", match.group())
if match2:
print(match2.group(1))
Output:
23
If you want to get age in the output, use match2.group().

If you want to match the age only once between the start and end file markers, you could use a single pattern with a capture group and in between match all lines that do not start with age: or the start or end marker.
^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]
Regex demo
Example
import re
regex = r"^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]"
s = ("[start file]\n" "name: steve \n" "age: 23\n" "[end file]")
m = re.search(regex, s)
if m:
print(m.group(1))
Output
23

The example input looks like a list of key, value pairs enclosed between some start/end markers. For this use-case, it might be more efficient and readable to write the parsing stage as:
re.search to locate the document
splitlines() to isolate individual records
split() to extract the key and value of each record
Then, in a second step, access the extracted records.
Doing this allows to separate the parsing and exploitation parts and makes the code easier to maintain.
Additionally, a good practice is to wrap access to a file in a "context manager" (the with statement) to guarantee all resources are correctly cleaned on error.
Here is a full standalone example:
import re
# 1: Load the raw data from disk, in a context manager
with open('/tmp/file.txt') as f:
contents = f.read()
# 2: Parse the raw data
fields = {}
if match := re.search(r'\[start file\]\n(.*)\[end file\]', contents, re.S):
for line in match.group(1).splitlines():
k, v = line.split(':', 1)
fields[k.strip()] = v.strip()
# 3: Actual data exploitation
print(fields['age'])

How can I find some words in files with regex?

I have many files and need to categorize them into the words that come up there.
ex) [..murder..murderAttempted..] or [murder, murderAttempted] etc..
I tried this code. but not all came out. so I want "murder" and "murderAttmpted" in files surrounded by "[ ]".
def func(root_dir):
for files in os.listdir(root_dir):
pattern = r'\[.+murder.+murderAttempted.+'
if "txt" in files:
f = open(root_dir + files, 'rt', encoding='UTF8')
for i, line in enumerate(f):
for match in re.finditer(pattern, line):
print(match.group())

This appears to work for me: pattern = r'\[.*murder.*murderAttempted.*\]' instead of pattern = r'\[.+murder.+murderAttempted.+'. I believe it returns all occurrences of "murder" and "murderAttempted" in files surrounded by "[]". The + requires 1 or more occurrence whereas * could have 0. Also note the addition of the end \]. This ensures you only capture strings that are enclosed in brackets.

How to find/replace non printable / non-ascii characters using Python 3?

I have a file, some lines in a .csv file that are jamming up a database import because of funky characters in some field in the line.
I have searched, found articles on how to replace non-ascii characters in Python 3, but nothing works.
When I open the file in vi and do :set list, there is a $ at the end of a line where there should not be, and ^I^I at the beginning of the next line. The two lines should be one joined line and no ^I there. I know that $ is end of line '\n' and have tried to replace those, but nothing works.
I don't know what the ^I represents, possibly a tab.
I have tried this function to no avail:
def remove_non_ascii(text):
new_text = re.sub(r"[\n\t\r]", "", text)
new_text = ''.join(new_text.split("\n"))
new_text = ''.join([i if ord(i) < 128 else ' ' for i in new_text])
new_text = "".join([x for x in new_text if ord(x) < 128])
new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text)
new_text = new_text.rstrip('\r\n')
new_text = new_text.strip('\n')
new_text = new_text.strip('\r')
new_text = new_text.strip('\t')
new_text = new_text.replace('\n', '')
new_text = new_text.replace('\r', '')
new_text = new_text.replace('\t', '')
new_text = filter(lambda x: x in string.printable, new_text)
new_text = "".join(list(new_text))
return new_text
Is there some tool that will show me exactly what this offending character is, and a then find a method to replace it?
I am opening the file like so (the .csv was saved as UTF-8)
f_csv_in = open(csv_in, "r", encoding="utf-8")
Below are two lines that should be one with the problem non-ascii characters visible.
These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I.
Part of the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. This should be one line.
37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 # 11:45am$
38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$

A simple way to remove non-ascii chars could be doing:
new_text = "".join([c for c in text if c.isascii()])
NB: If you are reading this text from a file, make sure you read it with the correct encoding

In the case of non-printable characters, the built-in string module has some ways of filtering out non-printable or non-ascii characters, eg. with the isprintable() functionality.
A concise way of filtering the whole string at once is presented below
>>> import string
>>>
>>> str1 = '\nsomestring'
>>> str1.isprintable()
False
>>> str2 = 'otherstring'
>>> str2.isprintable()
True
>>>
>>> res = filter(lambda x: x in string.printable, '\x01mystring')
>>> "".join(list(res))
'mystring'
This question has had some discussion on SO in the past, but there are many ways to do things, so I understand it may be confusing, since you can use anything from Regular Expressions to str.translate()
Another thing one could do is to take a look at Unicode Categories, and filter out your data based on the set of symbols you need.

It looks as if you have a csv file that contains quoted values, that is values such as embedded commas or newlines which have to be surrounded with quotes so that csv readers handle them correctly.
If you look at the example data you can see there's an opening doublequote but no closing doublequote at the end of the first line, and a closing doublequote with no opening doublequote on the second line, indicating that the quotes contain a value with an embedded newline.
The fact that the lines are broken in two may be an artefact of the application used to view them, or the code that's processing them: if the software doesn't understand csv quoting it will assume each newline character denotes a new line.
It's not clear exactly what problem this is causing in the database, but it's quite likely that quote characters - especially unmatched quotes - could be causing a problem, particularly if the data isn't being properly escaped before insertion.
This snippet rewrites the file, removing embedded commas, newlines and tabs, and instructs the writer not to quote any values. It will fail with the error message _csv.Error: need to escape, but no escapechar set if it finds a value that needs to be escaped. Depending on your data, you may need to adjust the regex pattern.
with open('lines.csv') as f, open('fixed.csv', 'w') as out:
reader = csv.reader(f)
writer = csv.writer(out, quoting=csv.QUOTE_NONE)
for line in reader:
new_row = [re.sub(r'\t|\n|,', ' ', x) for x in line]
writer.writerow(new_row)

Another approach using re, python to filter non printable ASCII character:
import re
import string
string_with_printable = re.sub(f'[^{re.escape(string.printable)}]', '', original_string)
re.escape escapes special characters in the given pattern.

Regular expression, trimming after a particular sign and neglecting the list terms which do not have that sign

file = open('SMSm.txt', 'r')
file2 = open('SMSw.txt', 'w')
debited=[]
for line in file.readlines():
if 'debited with' in line:
import re
a= re.findall(r'[INR]\S*', line)
debited.append(a)
file2.write(line)
print re.findall(r'^(.*?)(=)?$', (debited)
My output is [['INR 2,000=2E00'], ['INR 12,000=2E400', 'NFS*Cash'], ['INR 2,000=2E0d0']]
I only want the digits after INR. For example ['INR 2,000','INR 12000','INR 2000']. What changes shall I make in the regular expression?
I have tried using str(debited) but it didn't work out.

You can use a simple regex matching INR + whitespace if any + any digits with , as separator:
import re
s = "[['INR 2,000=2E00']['INR 12,000=2E400', 'NFS*Cash']['INR 2,000=2E0d0']]"
t = re.findall(r"INR\s*(\d+(?:,\d+)*)", s)
print(t)
# Result: ['2,000', '12,000', '2,000']
With findall, all captured texts will be output as a list.
See IDEONE demo
If you want INR as part of the output, just remove the capturing round brackets from the pattern: r"INR\s*\d+(?:,\d+)*".
UPDATE
Just tried out a non-regex approach (a bit error prone if there are entries with no =), here it is:
t = [x[0:x.find("=")].strip("'") for x in s.strip("[]").replace("][", "?").split("?")]
print(t)

Given the code you already have, the simplest solution is to make the extracted string start with INR (it already does) and end just before the equals sign. Just replace this line
a= re.findall(r'[INR]\S*', line)
with this:
a= re.findall(r'[INR][^\s=]*', line)

re.sub in Python 3.3

I am trying to change the text string from the form of file1 to file01. I am really new to python and can't figure out what should go in 'repl' location when trying to use a pattern. Can anyone give me a hand?
text = 'file1 file2 file3'
x = re.sub(r'file[1-9]',r'file\0\w',text) #I'm not sure what should go in repl.

You could try this:
>>> import re
>>> text = 'file1 file2 file3'
>>> x = re.sub(r'file([1-9])',r'file0\1',text)
'file01 file02 file03'
The brackets wrapped around the [1-9] captures the match, and it is the first match. You will see I used it in the replace using \1 meaning the first catch in the match.
Also, if you don't want to add the zero for files with 2 digits or more, you could add [^\d] in the regexp:
x = re.sub(r'file([1-9](\s|$))',r'file0\1',text)
A bit more of a generic solution now that I'm revisiting this answer using str.format() and a lambda expression:
import re
fmt = '{:03d}' # Let's say we want 3 digits with leading zeroes
s = 'file1 file2 file3 text40'
result = re.sub(r"([A-Za-z_]+)([0-9]+)", \
lambda x: x.group(1) + fmt.format(int(x.group(2))), \
s)
print(result)
# 'file001 file002 file003 text040'
A bit of details about the lambda expression:
lambda x: x.group(1) + fmt.format(int(x.group(2)))
# ^--------^ ^-^ ^-------------^
# filename format file number ([0-9]+) converted to int
# ([A-Za-z_]+) so format() can work with our format
I am using the expression [A-Za-z_]+ assuming the filename contains letters and underscores only besides the training digits. Do pick a more appropriate expression if required.

To match files with single digit on the end, use a word boundary \b:
>>> text = ' '.join('file{}'.format(i) for i in range(12))
>>> text
'file0 file1 file2 file3 file4 file5 file6 file7 file8 file9 file10 file11'
>>> import re
>>> re.sub(r'file(\d)\b',r'file0\1',text)
'file00 file01 file02 file03 file04 file05 file06 file07 file08 file09 file10 file11'

its also possible to use \D|$ while checking for two digits presence with file, which decides whether to replace file to file0 or not
the following code will also help to achieve the required.
import re
text = 'file1 file2 file3 file4 file11 file22 file33 file1'
x = re.sub(r'file([0-9] (\D|$))',r'file0\1',text)
print(x)

You could use groups to capture the parts that you wish to keep, then use those groups in the replacement text.
x = re.sub(r'file([1-9])',r'file0\1',text)
The matching group is created by including ( ) in the regex search. You can then use it with \group, or \1 in this case since we want the first group inserted.

I believe the following will help you. It is beneficial in that it will only insert a '0' where there is a single digit after 'file' (via boundary ['\b'] special character inclusion):
text = 'file1 file2 file3'
findallfile = re.findall(r'file\d\b', text)
for instance in findallfile:
textwithzeros = re.sub('file', 'file0', text)
'textwithzeros' should now be a new version of the 'text' string with '0' before each number. Try it out!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

automating regex to process multiple files - python

Related

Search for sentences containing characters using Python regular expressions

How can I find some words in files with regex?

How to find/replace non printable / non-ascii characters using Python 3?

Regular expression, trimming after a particular sign and neglecting the list terms which do not have that sign

re.sub in Python 3.3

Categories

Resources