I'm new to Python, i have text file which consists of punctuation and other words how to recompile using specific text match.
text file looks like below actual with more that 100 sentences like below
file.txt
copy() {
foundation.d.k("cloud control")
this.is.a(context),reality, new point {"copy.control.ZOOM_CONTROL", "copy.control.ACTIVITY_CONTROL"},
context control
I just want the output something like this
copy.control.ZOOM_CONTROL
copy.control.ACTIVITY_CONTROL
i coded something like this
file=(./data/.txt)
data=re.compile('copy.control. (.*?)', re.DOTALL | re.IGNORECASE).findall(file)
res= str("|".join(data))
The above regex doesn't match for my required output. please help me on this issue. Thanks in Advance
You need to open and read the file first, then apply the re.findall method:
data = []
with open('./data/.txt', 'r') as file:
data = re.findall(r'\bcopy\.control\.(\w+)', file.read())
The \bcopy\.control\.(\w+) regex matches
\bcopy\.control\. - a copy.control. string as a whole word (\b is a word boundary)
(\w+) - Capturing group 1 (the output of re.findall): 1 or more letters, digits or _
See the regex demo.
Then, you may print the matches:
for m in data:
print(m)
Related
I want to extract gene boundaries (like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list.
Below is example file
Below is what i have so far:
import re
#with open('boundaries.txt','a') as wf:
with open('sequence.gb','r') as rf:
for line in rf:
x= re.findall(r"^\s+\w+\s+\d+\W\d+",line)
print(x)
The pattern does not match, as you are matching a single non word character after matching the first digits that you encounter.
You can repeat matching those 1 or more times.
As you want to have a single match from the start of the string, you can also use re.match without the anchor ^
^\s+\w+\s+\d+\W+\d+
^
Regex demo
import re
s=" gene 1..3256"
pattern = r"\s+\w+\s+\d+\W+\d+"
m = re.match(pattern, s)
if m:
print(m.group())
Output
gene 1..3256
Maybe you used the wrong regex.Try the code below.
for line in rf:
x = re.findall(r"g+.*\s*\d+",line)
print(x)
You can also use online regex 101, to test your regex pattern online.
online regex 101
More suitable pattern would be: ^\s*gene\s*(\d+\.\.\d+)
Explanation:
^ - match beginning of a line
\s* - match zero or more whitespaces
gene - match gene literally
(...) - capturing group
\d+ - match one or more digits
\.\. - match .. literally
Then, it's enough to get match from first capturing group to get gene boundaries.
I am fairly new to python. I am trying to use regular expressions to match specific text in a file.
I can extract the data but only one regular expression at a time since the both values are in different lines and I am struggling to put them together. These severa lines repeat all the time in the file.
[06/05/2020 08:30:16]
othertext <000.000.000.000> xx s
example <000.000.000.000> xx s
I managed to print one or the other regular expressions:
[06/05/2020 08:30:16]
or
example <000.000.000.000> xx s
But not combined into something like this:
(timestamp) (text)
[06/05/2020 08:30:16] example <000.000.000.000> xx s
These are the regular expressions
regex = r"^\[\d\d\/\d\d\/\d\d\d\d\s\d\d\:\d\d\:\d\d\]" #Timestamp
regex = r"(^example\s+.*\<000\.000\.000\.000\>\s+.*$)" # line that contain the text
This is the code so far, I have tried a secondary for loop with another condition but seem that only match one of the regular expression at a time.
Any pointers will be greatly appreciated.
import re
filename = input("Enter the file: ")
regex = r"^\[\d\d\/\d\d\/\d\d\d\d\s\d\d\:\d\d\:\d\d\]" #Timestamp
with open (filename, "r") as file:
list = []
for line in file:
for match in re.finditer(regex, line, re.S):
match_text = match.group()
list.append(match_text)
print (match_text)
You can match blocks of text similar to this in one go with a regex of this type:
(^\[\d\d\/\d\d\/\d\d\d\d[ ]+\d\d:\d\d:\d\d\])\s+[\s\S]*?(^example.*)
Demo
All the file's text needs to be 'gulped' to do so however.
The key elements of the regex:
[\s\S]*?
^ idiomatically, this matches all characters in regex
^ zero or more
^ not greedily or the rest of the text will match skipping
the (^example.*) part
I'm currently having a hard time separating words on a txt document
with regex into a list, I have tried ".split" and ".readlines" my document
consists of words like "HelloPleaseHelpMeUnderstand" the words are
capitalized but not spaced so I'm at a loss on how to get them into a list.
this is what I have currently but it only returns a single word.
import re
file1 = open("file.txt","r")
strData = file1.readline()
listWords = re.findall(r"[A-Za-z]+", strData)
print(listWords)
one of my goals for doing this is to search for another word within the elements of the list, but i just wish to know how to list them so i may continue my work.
if anyone can guide me to a solution I would be grateful.
A regular regex based on lookarounds to insert spaces between glued letter words is
import re
text = "HelloPleaseHelpMeUnderstand"
print( re.sub(r"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[a-z])(?=[A-Z])", " ", text) )
# => Hello Please Help Me Understand
See the regex demo. Note adjustments will be necessary to account for numbers, or single letter uppercase words like I, A, etc.
Regarding your current code, you need to make sure you read the whole file into a variable (using file1.read(), you are reading just the first line with readline()) and use a [A-Z]+[a-z]* regex to match all the words glued the way you show:
import re
with open("file.txt","r") as file1:
strData = file1.read()
listWords = re.findall(r"[A-Z]+[a-z]*", strData)
print(listWords)
See the Python demo
Pattern details
[A-Z]+ - one or more uppercase letters
[a-z]* - zero or more lowercase letters.
How about this:
import re
strData = """HelloPleaseHelpMeUnderstand
And here not in
HereIn"""
listWords = re.findall(r"(([A-Z][a-z]+){2,})", strData)
result = [i[0] for i in listWords]
print(result)
# ['HelloPleaseHelpMeUnderstand', 'HereIn']
print(re.sub(r"\B([A-Z])", r" \1", "DoIThinkThisIsABetterAnswer?"))
Do i Think This Is A Better Answer?
Iam trying to make a python script that reads a text file input.txt and then scans all phone numbers in that file and writes back all matching phone no's to output.txt
lets say text file is like:
Hey my number is 1234567890 and another number is +91-1234567890. but if none of these is available you can call me on +91 5645454545 (or) mail me at abc#xyz.com
it should match 1234567890, +91-1234567890 and +91 5645454545
import re
no = '^(\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
Regexp for no is like : it takes country codes upto 3 digits and then a - or space which is optional and country code itself is optional and then a 10 digit number.
Yes, the problem is with your regex. Fortunately, it's a small one. You just need to remove the ^ character:
'(\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}'
The ^ signifies that you want to match only at the beginning of the string. You want to match multiple times throughout the string. Here's a 101demo.
For python, you'll need to specify a non-capturing group as well with ?:. Otherwise, re.findall does not return the complete match:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups.
Bold emphasis mine. Here's a relevant question.
This is what you get when you specify non-capturing groups for your problem:
In [485]: re.findall('(?:\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}', text)
Out[485]: ['1234567890', '+91-1234567890', '+91 5645454545']
this code will work:
import re
no = '(?:\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
The output will be:
1234567890
+91-1234567890
+91 5645454545
you can use
(?:\+[1-9]\d{1,2}-?)?\s?[1-9][0-9]{9}
see the demo at demo
pattern = '\d{10}|\+\d{2}[- ]+\d{10}'
matches = re.findall(pattern,text)
o/p -> ['1234567890', '+91-1234567890', '+91 5645454545']
I am trying to make a regex in python to detect 7-digit numbers and update contacts from a .vcf file. It then modifies the number to 8-digit number (just adding 5 before the number).Thing is the regex does not work.
I am having as error message "EOL while scanning string literal"
regex=re.compile(r'^(25|29|42[1-3]|42[8-9]|44|47[1-9]|49|7[0-9]|82|85|86|871|87[5-8]|9[0-8])/I s/^/5/')
#Open file for scanning
f = open("sample.vcf")
#scan each line in file
for line in f:
#find all results corresponding to regex and store in pattern
pattern=regex.findall(line)
#isolate results
for word in pattern:
print word
count = count+1 #display number of occurences
wordprefix = '5{}'.format(word)
s=open("sample.vcf").read()
s=s.replace(word,wordprefix)
f=open("sample.vcf",'w')
print wordprefix
f.write(s)
f.close()
I am suspecting that my regex is not in the correct format for detecting a particular pattern of numbers with 2 digits which have a particular format like the 25x and 29x and 5 digits that can be any pattern of numbers.. (TOTAL 7 digits)
can anyone help me out on the correct format to adopt for such a case?
/I is not how you give modifiers for regex in python. And neither you do substitution like s///.
You should use re.sub() for substitution, and give the modifier as re.I, as 2nd argument to re.compile:
reg = re.compile(regexPattern, re.I)
And then for a string s, the substitution would look like:
re.sub(reg, replacement, s)
As such, your regex looks weird to me. If you want to match 7 digits numbers, starting with 25 or 29, then you should use:
r'(2[59][0-9]{5})'
And for replacement, use "5\1". In all, for a string s, your code would look like:
reg = re.compile(r'(2[59][0-9]{5})', re.I)
new_s = re.sub(reg, "5\1", s)