I am trying to change the text string from the form of file1 to file01. I am really new to python and can't figure out what should go in 'repl' location when trying to use a pattern. Can anyone give me a hand?
text = 'file1 file2 file3'
x = re.sub(r'file[1-9]',r'file\0\w',text) #I'm not sure what should go in repl.
You could try this:
>>> import re
>>> text = 'file1 file2 file3'
>>> x = re.sub(r'file([1-9])',r'file0\1',text)
'file01 file02 file03'
The brackets wrapped around the [1-9] captures the match, and it is the first match. You will see I used it in the replace using \1 meaning the first catch in the match.
Also, if you don't want to add the zero for files with 2 digits or more, you could add [^\d] in the regexp:
x = re.sub(r'file([1-9](\s|$))',r'file0\1',text)
A bit more of a generic solution now that I'm revisiting this answer using str.format() and a lambda expression:
import re
fmt = '{:03d}' # Let's say we want 3 digits with leading zeroes
s = 'file1 file2 file3 text40'
result = re.sub(r"([A-Za-z_]+)([0-9]+)", \
lambda x: x.group(1) + fmt.format(int(x.group(2))), \
s)
print(result)
# 'file001 file002 file003 text040'
A bit of details about the lambda expression:
lambda x: x.group(1) + fmt.format(int(x.group(2)))
# ^--------^ ^-^ ^-------------^
# filename format file number ([0-9]+) converted to int
# ([A-Za-z_]+) so format() can work with our format
I am using the expression [A-Za-z_]+ assuming the filename contains letters and underscores only besides the training digits. Do pick a more appropriate expression if required.
To match files with single digit on the end, use a word boundary \b:
>>> text = ' '.join('file{}'.format(i) for i in range(12))
>>> text
'file0 file1 file2 file3 file4 file5 file6 file7 file8 file9 file10 file11'
>>> import re
>>> re.sub(r'file(\d)\b',r'file0\1',text)
'file00 file01 file02 file03 file04 file05 file06 file07 file08 file09 file10 file11'
its also possible to use \D|$ while checking for two digits presence with file, which decides whether to replace file to file0 or not
the following code will also help to achieve the required.
import re
text = 'file1 file2 file3 file4 file11 file22 file33 file1'
x = re.sub(r'file([0-9] (\D|$))',r'file0\1',text)
print(x)
You could use groups to capture the parts that you wish to keep, then use those groups in the replacement text.
x = re.sub(r'file([1-9])',r'file0\1',text)
The matching group is created by including ( ) in the regex search. You can then use it with \group, or \1 in this case since we want the first group inserted.
I believe the following will help you. It is beneficial in that it will only insert a '0' where there is a single digit after 'file' (via boundary ['\b'] special character inclusion):
text = 'file1 file2 file3'
findallfile = re.findall(r'file\d\b', text)
for instance in findallfile:
textwithzeros = re.sub('file', 'file0', text)
'textwithzeros' should now be a new version of the 'text' string with '0' before each number. Try it out!
Related
I have the following code that tries to retrieve the name of a file from a directory based on a double \ character:
import re
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=*\\\\)*'
re.findall(pattern,string)
The reasoning behind is that the name of the file is always after a double \ , so I try to look any string which is preceeded by any text that finishes with \ .
Neverthless, when I apply this code I get the following error:
error: nothing to repeat at position 4
What am I doing wrong?
Edit: The concrete output I am looking for is getting the string 'abo_st_gas_dtd_csv' as a match.
There's a couple of things going on:
You need to declare your string definition using the same r'string' notation as for the pattern; right now your string only has a single backslash, since the first one of the two is escaped.
I'm not sure you're using * correctly. It means "repeat immediately preceding group", and not just "any string" (as, e.g., in the usual shell patterns). The first * in parentheses does not have anything preceding it, meaning that the regex is invalid. Hence the error you see. I think, what you want is .*, i.e., repeating any character 0 or more times. Furthermore, it is not needed in the parentheses. A more correct regexp would be r'(?<=\\\\).*':
import re
string = r'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=\\\\).*'
re.findall(pattern,string)
Your pattern is just a lookabehind, which, by itself, can't match anything. I would use this re.findall approach:
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
filename = re.findall(r'\\([^.]+\.\w+)$', string)[0]
print(filename) # abo_st_gas_dtd.csv
files = 'I:E\\trm.csvest/PZMALIo4\ETRM841_FX_.csvDeals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
counter = -1
my_files = []
for f in files:
counter += 1
if ord(f) == 92:#'\'
temp = files[counter+1:len(files)]
temp_file = ""
for f1 in temp:
temp_file += f1
# [0-len(temp_file)] => if [char after . to num index of type file]== csv
if f1 == '.' and temp[len(temp_file):len(temp_file)+3] == "csv":
my_files.append(temp_file + "csv")
break
print(my_files)#['trm.csv', 'ETRM841_FX_.csv', 'abo_st_gas_dtd.csv']
I've written the following script to count the number of sentences in a text file:
import re
filepath = 'sample_text_with_ellipsis.txt'
with open(filepath, 'r') as f:
read_data = f.read()
sentences = re.split(r'[.{1}!?]+', read_data.replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
However, if I run it on a sample_text_with_ellipsis.txt with the following content:
Wait for it... awesome!
I get sentence_count = 2 instead of 1, because it does not ignore the ellipsis (i.e., the "...").
What I tried to do in the regex is to make it match only one occurrence of a period through .{1}, but this apparently doesn't work the way I intended it. How can I get the regex to ignore ellipses?
Splitting sentences with a regex like this is not enough. See Python split text on sentences to see how NLTK can be leveraged for this.
Answering your question, you call 3 dot sequence an ellipsis. Thus, you need to use
[!?]+|(?<!\.)\.(?!\.)
See the regex demo. The . is moved from the character class since you can't use quantifiers inside them, and only that . is matched that is not enclosed with other dots.
[!?]+ - 1 or more ! or ?
| - or
(?<!\.)\.(?!\.) - a dot that is neither preceded ((?<!\.)), nor followed ((?!\.)) with a dot.
See Python demo:
import re
sentences = re.split(r'[!?]+|(?<!\.)\.(?!\.)', "Wait for it... awesome!".replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
print(sentence_count) # => 1
Following Wiktor's suggestion to use NLTK, I also came up with the following alternative solution:
import nltk
read_data="Wait for it... awesome!"
sentence_count = len(nltk.tokenize.sent_tokenize(read_data))
This yields a sentence count of 1 as expected.
I'm trying to process some data - specifically I have to
Delete any decimals from all numbers in the file, eg 4.0 -> 4
Add a dash between any dates and any times, eg 2014-01-01 23:45:52 -> 2014-01-01-23:45:52
I've wrote some regexes in sublime text to do this using the find and replace function:
Find : "\.\d", Replace : ""
Find : "(\d{2})\s(\d)", Replace : "$1-$2"
This all works fine and gives me the right results. The problem is that I have to process hundreds of csv files in this way, I've tried to do it in python but it isn't working the way I'd expect. Here's the code used:
for file in csv_list: # csv_list is the list of all the files I need to process
with open(file, "r") as infile:
with open("{}EDIT.csv".format(file.split(".")[0]), "w", newline="") as outfile: # Save the processed version
writer = csv.writer(outfile, delimiter=",")
reader = csv.reader(infile)
for line in reader:
writer.writerow([re.sub("(\d{2})\s(\d)",
"$1-$2", re.sub("\.\d", "", string)) for string in line])
I'm not too confident with regex, so I can't see why this isn't working the way I'd expect. If anyone could help me out that'd be great. Thanks in advance!
As requested, here is an input row, what output I was expecting, and what the actual output is:
input : 0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active
desired output : 0,2013-01-01-20:59:39,5737,english,2013-01-01-21:01:07,active
actual output : 0, 2013-01-$1-$20:59:39,5737,english,2013-01-$1-$21:01:07
You could solve your issue by replacing the first regex pattern with r"\1-\2":
import re
rx = r"(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub("(\d{2})\s(\d)", r"\1-\2", re.sub(r"\.\d", "", s))
print (result)
See the Python demo. See the re.sub reference:
Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.
Or, to avoid that fuss with string replacement backreferences, use a single regex for that task and modify the matches inside a lambda expression:
import re
pat = r"\.\d|(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub(pat, lambda m: r"{}-{}".format(m.group(1),m.group(2)) if m.group(1) else "", s)
print (result)
See another Python demo.
Note that perhaps, for better safety, you could use r'\.\d+\b' as the pattern to remove decimal parts (\d+ matches one or more digits, and \b requires a char other than letter, digit or _ after it, or the end of string). The second pattern can be spelled out for the same purpose as r'(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})'.
I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy
Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters
For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.
You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)
file = open('SMSm.txt', 'r')
file2 = open('SMSw.txt', 'w')
debited=[]
for line in file.readlines():
if 'debited with' in line:
import re
a= re.findall(r'[INR]\S*', line)
debited.append(a)
file2.write(line)
print re.findall(r'^(.*?)(=)?$', (debited)
My output is [['INR 2,000=2E00'], ['INR 12,000=2E400', 'NFS*Cash'], ['INR 2,000=2E0d0']]
I only want the digits after INR. For example ['INR 2,000','INR 12000','INR 2000']. What changes shall I make in the regular expression?
I have tried using str(debited) but it didn't work out.
You can use a simple regex matching INR + whitespace if any + any digits with , as separator:
import re
s = "[['INR 2,000=2E00']['INR 12,000=2E400', 'NFS*Cash']['INR 2,000=2E0d0']]"
t = re.findall(r"INR\s*(\d+(?:,\d+)*)", s)
print(t)
# Result: ['2,000', '12,000', '2,000']
With findall, all captured texts will be output as a list.
See IDEONE demo
If you want INR as part of the output, just remove the capturing round brackets from the pattern: r"INR\s*\d+(?:,\d+)*".
UPDATE
Just tried out a non-regex approach (a bit error prone if there are entries with no =), here it is:
t = [x[0:x.find("=")].strip("'") for x in s.strip("[]").replace("][", "?").split("?")]
print(t)
Given the code you already have, the simplest solution is to make the extracted string start with INR (it already does) and end just before the equals sign. Just replace this line
a= re.findall(r'[INR]\S*', line)
with this:
a= re.findall(r'[INR][^\s=]*', line)