Regex to find links in one row - python

I have this string:
http://pastebin.com/XXXXXXXhttp://pastebin.com/XXXXXX\r
I need to extract all links in one line which ends with \r. It can contain one link or even five links. I got something like this :
(http[s]*:.*)[\\r|h]
but it returns whole row as one match,
any ideas ?

You can use this lookahead based regex in findall:
>>> s='http://pastebin.com/XXXXXXXhttp://pastebin.com/XXXXXX\r'
>>> re.findall(r'https?://.+?(?=https?://|[\r\n]|$)', s)
['http://pastebin.com/XXXXXXX', 'http://pastebin.com/XXXXXX']
(?=http://|[\r\n]|$) is positive lookahead that asserts next position has http:// or \r or \n or line end.
RegEx Demo

Try this
va = 'http://pastebin.com/XXXXXXXhttp://pastebin.com/XXXXXX\r'
import re
vac = re.findall(r"(?:https?:\/+)([^\r|h]+)",va)
print vac

Give this a try: (https?:\/\/[^\\r|h]+)

You don't need regex for this. Try this:
mylinks = []
with open('yourfile', 'r') as f:
for line in f.readlines():
for link in line.split('http'):
mylinks.append('http'+link)
EDIT: Looks like you just need one string not the whole file. Just run:
mylinks = []
for link in mystring.split('http'):
mylinks.append('http'+link)

Related

Replacing when a word is in another word but with special circumstances

My program replaces tokens with values when they are in a file. When reading in a certain line it gets stuck here is an example:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a
The two tokens in the example are Token100 and Token100a. I need a way to only replace Token100 with its data and not replace Token100a with Token100's data with an a afterwards. I can't look for spaces before and after because sometimes they are in the middle of lines. Any thoughts are appreciated. Thanks.
You can use regex:
import re
line = "1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a"
match = re.sub("Token100a", "data", line)
print(match)
Outputs:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1data
More about regex here:
https://www.w3schools.com/python/python_regex.asp
You can use a regular expression with a negative lookahead to ensure that the following character is not an "a":
>>> import re
>>> test = '1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a'
>>> re.sub(r'Token100(?!a)', 'data', test)
'1.1.1.1.1.1.1.1.1.1 data.1 1.1.1.1.1.1.1Token100a'

Regular expression, trimming after a particular sign and neglecting the list terms which do not have that sign

file = open('SMSm.txt', 'r')
file2 = open('SMSw.txt', 'w')
debited=[]
for line in file.readlines():
if 'debited with' in line:
import re
a= re.findall(r'[INR]\S*', line)
debited.append(a)
file2.write(line)
print re.findall(r'^(.*?)(=)?$', (debited)
My output is [['INR 2,000=2E00'], ['INR 12,000=2E400', 'NFS*Cash'], ['INR 2,000=2E0d0']]
I only want the digits after INR. For example ['INR 2,000','INR 12000','INR 2000']. What changes shall I make in the regular expression?
I have tried using str(debited) but it didn't work out.
You can use a simple regex matching INR + whitespace if any + any digits with , as separator:
import re
s = "[['INR 2,000=2E00']['INR 12,000=2E400', 'NFS*Cash']['INR 2,000=2E0d0']]"
t = re.findall(r"INR\s*(\d+(?:,\d+)*)", s)
print(t)
# Result: ['2,000', '12,000', '2,000']
With findall, all captured texts will be output as a list.
See IDEONE demo
If you want INR as part of the output, just remove the capturing round brackets from the pattern: r"INR\s*\d+(?:,\d+)*".
UPDATE
Just tried out a non-regex approach (a bit error prone if there are entries with no =), here it is:
t = [x[0:x.find("=")].strip("'") for x in s.strip("[]").replace("][", "?").split("?")]
print(t)
Given the code you already have, the simplest solution is to make the extracted string start with INR (it already does) and end just before the equals sign. Just replace this line
a= re.findall(r'[INR]\S*', line)
with this:
a= re.findall(r'[INR][^\s=]*', line)

In Python how to strip dollar signs and commas from dollar related fields only

I'm reading in a large text file with lots of columns, dollar related and not, and I'm trying to figure out how to strip the dollar fields ONLY of $ and , characters.
so say I have:
a|b|c
$1,000|hi,you|$45.43
$300.03|$MS2|$55,000
where a and c are dollar-fields and b is not.
The output needs to be:
a|b|c
1000|hi,you|45.43
300.03|$MS2|55000
I was thinking that regex would be the way to go, but I can't figure out how to express the replacement:
f=open('sample1_fixed.txt','wb')
for line in open('sample1.txt', 'rb'):
new_line = re.sub(r'(\$\d+([,\.]\d+)?k?)',????, line)
f.write(new_line)
f.close()
Anyone have an idea?
Thanks in advance.
Unless you are really tied to the idea of using a regex, I would suggest doing something simple, straight-forward, and generally easy to read:
def convert_money(inval):
if inval[0] == '$':
test_val = inval[1:].replace(",", "")
try:
_ = float(test_val)
except:
pass
else:
inval = test_val
return inval
def convert_string(s):
return "|".join(map(convert_money, s.split("|")))
a = '$1,000|hi,you|$45.43'
b = '$300.03|$MS2|$55,000'
print convert_string(a)
print convert_string(b)
OUTPUT
1000|hi,you|45.43
300.03|$MS2|55000
A simple approach:
>>> import re
>>> exp = '\$\d+(,|\.)?\d+'
>>> s = '$1,000|hi,you|$45.43'
>>> '|'.join(i.translate(None, '$,') if re.match(exp, i) else i for i in s.split('|'))
'1000|hi,you|45.43'
It sounds like you are addressing the entire line of text at once. I think your first task would be to break up your string by columns into an array or some other variables. Once you've don that, your solution for converting strings of currency into numbers doesn't have to worry about the other fields.
Once you've done that, I think there is probably an easier way to do this task than with regular expressions. You could start with this SO question.
If you really want to use regex though, then this pattern should work for you:
\[$,]\g
Demo on regex101
Replace matches with empty strings. The pattern gets a little more complicated if you have other kinds of currency present.
I Try this regex take if necessary.
\$(\d+)[\,]*([\.]*\d*)
SEE DEMO : http://regex101.com/r/wM0zB6/2
Use the regexx
((?<=\d),(?=\d))|(\$(?=\d))
eg
import re
>>> x="$1,000|hi,you|$45.43"
re.sub( r'((?<=\d),(?=\d))|(\$(?=\d))', r'', x)
'1000|hi,you|45.43'
Try the below regex and then replace the matched strings with \1\2\3
\$(\d+(?:\.\d+)?)(?:(?:,(\d{2}))*(?:,(\d{3})))?
DEMO
Defining a black list and checking if the characters are in it, is an easy way to do this:
blacklist = ("$", ",") # define characters to remove
with open('sample1_fixed.txt','wb') as f:
for line in open('sample1.txt', 'rb'):
clean_line = "".join(c for c in line if c not in blacklist)
f.write(clean_line)
\$(?=(?:[^|]+,)|(?:[^|]+\.))
Try this.Replace with empty string.Use re.M option.See demo.
http://regex101.com/r/gT6kI4/6

Extracting text from a line: Regex in Python

I'm working with regular expressions in Python and I'm struggling with this.
I have data in a file of lines like this one:
|person=[[Old McDonald]]
and I just want to be able to extract Old McDonald from this line.
I have been trying with this regular expression:
matchLine = re.match(r"\|[a-z]+=(\[\[)?[A-Z][a-z]*(\]\])", line)
print matchLine
but it doesn't work; None is the result each time.
Construct [A-Z][a-z]* does not match Old McDonald. You probably should use something like [A-Z][A-Za-z ]*. Here is code example:
import re
line = '|person=[[Old McDonald]]'
matchLine = re.match ('\|[a-z]+=(?:\[\[)?([A-Z][A-Za-z ]*)\]\]', line)
print matchLine.group (1)
The output is Old McDonald for me. If you need to search in the middle of the string, use re.search instead of re.match:
import re
line = 'blahblahblah|person=[[Old McDonald]]blahblahblah'
matchLine = re.search ('\|[a-z]+=(?:\[\[)?([A-Z][A-Za-z ]*)\]\]', line)
print matchLine.group (1)

Regex + Python - Remove all lines beginning with a *

I want to remove all lines from a given file that begin with a *. So for example, the following:
* This needs to be gone
But this line should stay
*remove
* this too
End
Should generate this:
But this line should stay
End
What I ultimately need to do is the following:
Remove all text inside parenthesis and brackets (parenthesis/brackets included),
As mentioned above, remove lines starting with ''.
So far I was able to address #1 with the following: re.sub(r'[.?]|(.*?)', '', fileString). I tried several things for #2 but always end up removing things I don't want to
Solution 1 (no regex)
>>> f = open('path/to/file.txt', 'r')
>>> [n for n in f.readlines() if not n.startswith('*')]
Solution 2 (regex)
>>> s = re.sub(r'(?m)^\*.*\n?', '', s)
Thanks everyone for the help.
Using regex >>
s = re.sub(r'(?m)^\*.*\n?', '', s)
Check this demo.
You don't need regex for this.
text = file.split('\n') # split everything into lines.
for line in text:
# do something here
Let us know if you need any more help.
You should really give more information here. At the minimum, what version of python you are using and a code snippet. But, that said, why do you need a regular expression? I don't see why you can't just use startswith.
The following works for me with Python 2.7.3
s = '* this line gotta go!!!'
print s.startswith('*')
>>>True
>>> f = open('path/to/file.txt', 'r')
>>> [n for n in f.readlines() if not n.startswith('*')]
['But this line should stay\n', 'End\n']

Categories

Resources