Regex + Python - Remove all lines beginning with a * - python

I want to remove all lines from a given file that begin with a *. So for example, the following:
* This needs to be gone
But this line should stay
*remove
* this too
End
Should generate this:
But this line should stay
End
What I ultimately need to do is the following:
Remove all text inside parenthesis and brackets (parenthesis/brackets included),
As mentioned above, remove lines starting with ''.
So far I was able to address #1 with the following: re.sub(r'[.?]|(.*?)', '', fileString). I tried several things for #2 but always end up removing things I don't want to
Solution 1 (no regex)
>>> f = open('path/to/file.txt', 'r')
>>> [n for n in f.readlines() if not n.startswith('*')]
Solution 2 (regex)
>>> s = re.sub(r'(?m)^\*.*\n?', '', s)
Thanks everyone for the help.

Using regex >>
s = re.sub(r'(?m)^\*.*\n?', '', s)
Check this demo.

You don't need regex for this.
text = file.split('\n') # split everything into lines.
for line in text:
# do something here
Let us know if you need any more help.

You should really give more information here. At the minimum, what version of python you are using and a code snippet. But, that said, why do you need a regular expression? I don't see why you can't just use startswith.
The following works for me with Python 2.7.3
s = '* this line gotta go!!!'
print s.startswith('*')
>>>True

>>> f = open('path/to/file.txt', 'r')
>>> [n for n in f.readlines() if not n.startswith('*')]
['But this line should stay\n', 'End\n']

Related

Removing specific set of characters?

I know a code like this
translated1 = str(''.join( c for c in translated2 if c not in "[']" ))
will remove any instances of [ or ' or ] but how would I code it so that it removes "---" exactly this.
I don't want it to remove any instances of "-", only if those 3 are together at once.
How can I do that? Thank you!
This can be done very easily by regular expression.
You can read more about python regular expression here.
You can use it like this-
import re
str1 = 'there is three --- and now single - and now two --'
str2 = re.sub('---', '', str1)
print(str2)
Output-
there is three and now single - and now two --
DEMO
You can use str.replace(patt, repl):
s = "---test--test-test----"
print(s.replace("---", ""))
# "test--test-test-"
You could also do it using re if you want something more extensible.

automating regex to process multiple files

I'm trying to process some data - specifically I have to
Delete any decimals from all numbers in the file, eg 4.0 -> 4
Add a dash between any dates and any times, eg 2014-01-01 23:45:52 -> 2014-01-01-23:45:52
I've wrote some regexes in sublime text to do this using the find and replace function:
Find : "\.\d", Replace : ""
Find : "(\d{2})\s(\d)", Replace : "$1-$2"
This all works fine and gives me the right results. The problem is that I have to process hundreds of csv files in this way, I've tried to do it in python but it isn't working the way I'd expect. Here's the code used:
for file in csv_list: # csv_list is the list of all the files I need to process
with open(file, "r") as infile:
with open("{}EDIT.csv".format(file.split(".")[0]), "w", newline="") as outfile: # Save the processed version
writer = csv.writer(outfile, delimiter=",")
reader = csv.reader(infile)
for line in reader:
writer.writerow([re.sub("(\d{2})\s(\d)",
"$1-$2", re.sub("\.\d", "", string)) for string in line])
I'm not too confident with regex, so I can't see why this isn't working the way I'd expect. If anyone could help me out that'd be great. Thanks in advance!
As requested, here is an input row, what output I was expecting, and what the actual output is:
input : 0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active
desired output : 0,2013-01-01-20:59:39,5737,english,2013-01-01-21:01:07,active
actual output : 0, 2013-01-$1-$20:59:39,5737,english,2013-01-$1-$21:01:07
You could solve your issue by replacing the first regex pattern with r"\1-\2":
import re
rx = r"(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub("(\d{2})\s(\d)", r"\1-\2", re.sub(r"\.\d", "", s))
print (result)
See the Python demo. See the re.sub reference:
Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.
Or, to avoid that fuss with string replacement backreferences, use a single regex for that task and modify the matches inside a lambda expression:
import re
pat = r"\.\d|(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub(pat, lambda m: r"{}-{}".format(m.group(1),m.group(2)) if m.group(1) else "", s)
print (result)
See another Python demo.
Note that perhaps, for better safety, you could use r'\.\d+\b' as the pattern to remove decimal parts (\d+ matches one or more digits, and \b requires a char other than letter, digit or _ after it, or the end of string). The second pattern can be spelled out for the same purpose as r'(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})'.

Regex to find links in one row

I have this string:
http://pastebin.com/XXXXXXXhttp://pastebin.com/XXXXXX\r
I need to extract all links in one line which ends with \r. It can contain one link or even five links. I got something like this :
(http[s]*:.*)[\\r|h]
but it returns whole row as one match,
any ideas ?
You can use this lookahead based regex in findall:
>>> s='http://pastebin.com/XXXXXXXhttp://pastebin.com/XXXXXX\r'
>>> re.findall(r'https?://.+?(?=https?://|[\r\n]|$)', s)
['http://pastebin.com/XXXXXXX', 'http://pastebin.com/XXXXXX']
(?=http://|[\r\n]|$) is positive lookahead that asserts next position has http:// or \r or \n or line end.
RegEx Demo
Try this
va = 'http://pastebin.com/XXXXXXXhttp://pastebin.com/XXXXXX\r'
import re
vac = re.findall(r"(?:https?:\/+)([^\r|h]+)",va)
print vac
Give this a try: (https?:\/\/[^\\r|h]+)
You don't need regex for this. Try this:
mylinks = []
with open('yourfile', 'r') as f:
for line in f.readlines():
for link in line.split('http'):
mylinks.append('http'+link)
EDIT: Looks like you just need one string not the whole file. Just run:
mylinks = []
for link in mystring.split('http'):
mylinks.append('http'+link)

Regular expression, trimming after a particular sign and neglecting the list terms which do not have that sign

file = open('SMSm.txt', 'r')
file2 = open('SMSw.txt', 'w')
debited=[]
for line in file.readlines():
if 'debited with' in line:
import re
a= re.findall(r'[INR]\S*', line)
debited.append(a)
file2.write(line)
print re.findall(r'^(.*?)(=)?$', (debited)
My output is [['INR 2,000=2E00'], ['INR 12,000=2E400', 'NFS*Cash'], ['INR 2,000=2E0d0']]
I only want the digits after INR. For example ['INR 2,000','INR 12000','INR 2000']. What changes shall I make in the regular expression?
I have tried using str(debited) but it didn't work out.
You can use a simple regex matching INR + whitespace if any + any digits with , as separator:
import re
s = "[['INR 2,000=2E00']['INR 12,000=2E400', 'NFS*Cash']['INR 2,000=2E0d0']]"
t = re.findall(r"INR\s*(\d+(?:,\d+)*)", s)
print(t)
# Result: ['2,000', '12,000', '2,000']
With findall, all captured texts will be output as a list.
See IDEONE demo
If you want INR as part of the output, just remove the capturing round brackets from the pattern: r"INR\s*\d+(?:,\d+)*".
UPDATE
Just tried out a non-regex approach (a bit error prone if there are entries with no =), here it is:
t = [x[0:x.find("=")].strip("'") for x in s.strip("[]").replace("][", "?").split("?")]
print(t)
Given the code you already have, the simplest solution is to make the extracted string start with INR (it already does) and end just before the equals sign. Just replace this line
a= re.findall(r'[INR]\S*', line)
with this:
a= re.findall(r'[INR][^\s=]*', line)

In Python how to strip dollar signs and commas from dollar related fields only

I'm reading in a large text file with lots of columns, dollar related and not, and I'm trying to figure out how to strip the dollar fields ONLY of $ and , characters.
so say I have:
a|b|c
$1,000|hi,you|$45.43
$300.03|$MS2|$55,000
where a and c are dollar-fields and b is not.
The output needs to be:
a|b|c
1000|hi,you|45.43
300.03|$MS2|55000
I was thinking that regex would be the way to go, but I can't figure out how to express the replacement:
f=open('sample1_fixed.txt','wb')
for line in open('sample1.txt', 'rb'):
new_line = re.sub(r'(\$\d+([,\.]\d+)?k?)',????, line)
f.write(new_line)
f.close()
Anyone have an idea?
Thanks in advance.
Unless you are really tied to the idea of using a regex, I would suggest doing something simple, straight-forward, and generally easy to read:
def convert_money(inval):
if inval[0] == '$':
test_val = inval[1:].replace(",", "")
try:
_ = float(test_val)
except:
pass
else:
inval = test_val
return inval
def convert_string(s):
return "|".join(map(convert_money, s.split("|")))
a = '$1,000|hi,you|$45.43'
b = '$300.03|$MS2|$55,000'
print convert_string(a)
print convert_string(b)
OUTPUT
1000|hi,you|45.43
300.03|$MS2|55000
A simple approach:
>>> import re
>>> exp = '\$\d+(,|\.)?\d+'
>>> s = '$1,000|hi,you|$45.43'
>>> '|'.join(i.translate(None, '$,') if re.match(exp, i) else i for i in s.split('|'))
'1000|hi,you|45.43'
It sounds like you are addressing the entire line of text at once. I think your first task would be to break up your string by columns into an array or some other variables. Once you've don that, your solution for converting strings of currency into numbers doesn't have to worry about the other fields.
Once you've done that, I think there is probably an easier way to do this task than with regular expressions. You could start with this SO question.
If you really want to use regex though, then this pattern should work for you:
\[$,]\g
Demo on regex101
Replace matches with empty strings. The pattern gets a little more complicated if you have other kinds of currency present.
I Try this regex take if necessary.
\$(\d+)[\,]*([\.]*\d*)
SEE DEMO : http://regex101.com/r/wM0zB6/2
Use the regexx
((?<=\d),(?=\d))|(\$(?=\d))
eg
import re
>>> x="$1,000|hi,you|$45.43"
re.sub( r'((?<=\d),(?=\d))|(\$(?=\d))', r'', x)
'1000|hi,you|45.43'
Try the below regex and then replace the matched strings with \1\2\3
\$(\d+(?:\.\d+)?)(?:(?:,(\d{2}))*(?:,(\d{3})))?
DEMO
Defining a black list and checking if the characters are in it, is an easy way to do this:
blacklist = ("$", ",") # define characters to remove
with open('sample1_fixed.txt','wb') as f:
for line in open('sample1.txt', 'rb'):
clean_line = "".join(c for c in line if c not in blacklist)
f.write(clean_line)
\$(?=(?:[^|]+,)|(?:[^|]+\.))
Try this.Replace with empty string.Use re.M option.See demo.
http://regex101.com/r/gT6kI4/6

Categories

Resources