Regular expression doesn't extract whole the id from a log file? - python

I have following input in the log file which I am interested to capture all the part of IDs, however it won't return me the whole of the ID and just returns me some part of that:
id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤
id:A2uhasan30hamwix160212145302428
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧
id:A2uhasan30hamwix160207145023750
I have used the following regular expression with python 2.7:
I have edited sid to id:
RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9._+]*))', re.U)
to
>>> RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U)
>>> sid = RE_SID.search('id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤').group('sid')
>>> sid
'A2uhasan30hamwix'
and this is my result:
is: A2uhasan30hamwix
After edit:
This is how I am reading the log file:
with open(cfg.log_file) as input_file: ...
fields = line.strip().split(' ')
and an example of line in log:
2015-11-30T23:58:13.760950+00:00 calxxx enexxxxce[10476]: INFO consume_essor: user:<<"ailxxxied">> callee_num:<<"+144442567413">> id:<<"A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧">> credits:0.0 result:ok provider:sipovvvv1.yv.vs
I will appreciated to help me to edit my regular expression.

Based on what we discussed in the chat, posting the solution:
import codecs
import re
RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U) # \d used to match non-ASCII digits, too
input_file = codecs.open(cfg.log_file, encoding='utf-8') # Read the file with UTF8 encoding
for line in input_file:
fields = line.strip().split(u' ') # u prefix is important!
if len(fields) >= 11:
try:
# ......
sid = RE_SID.search(fields[7]).group('sid') # Or check if there is a match first

3 things to fix:
id instead of sid
use \d instead of 0-9 to also catch the arabic numerals
no need to add an extra capturing group inside the sid named group
Fixed version:
id:(<<")?(?P<sid>[A-Za-z\d_.+]+)

string = '''
id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤
id:A2uhasan30hamwix160212145302428
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧
id:A2uhasan30hamwix160207145023750
'''
import re
reObj = re.compile(r'id:.*')
ans = reObj.findall(string,re.DOTALL)
print(ans)
Output :
['id:A2uhasan30hamwix160212145302428 ',
'id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ ',
'id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ ',
'id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ ',
'id:A2uhasan30hamwix160207145023750']

Related

Print the title from a txt file from the given word

I have a txt file with the following info:
545524---Python foundation---Course---https://img-c.udemycdn.com/course/100x100/647442_5c1f.jpg---Outsourcing Development Work: Learn My Proven System To Hire Freelance Developers
Another line with the same format but different info and continue....
Here on line 1, Python foundation is the course title. If a user has input "foundation" how do I print out Python foundation? It's basically printing the whole title of a course based on the given word.
I can use something like:
input_text = 'foundation'
file1 = open("file.txt", "r")
readfile = file1.read()
if input_text in readfile:
#This prints only foundation keyword not the whole title
I assume that your input file has multiple lines separated by enters in this format:
<Course-id>---<Course-name>---Course---<Course-image-link>---<Desc>
input_text = 'foundation'
file1 = open('file.txt', 'r')
lines = file1.readlines()
for line in lines:
book_title_pattern = r'---([\w\d\s_\.,;:()]+)---'
match = re.search(book_title_pattern, line)
if match:
matched_title = match.groups(1)[0]
if input_text in matched_title:
print(matched_title)
Get the key value that you're searching for. User input perhaps or we'll hard-code it here for demo' purposes.
Open the file and read one line at a time. Use RE to parse the line looking for a specific pattern. Check that we've actually found a token matching the RE criterion then check if it contains the 'key' value. Print result as appropriate.
import re
key = 'foundation'
with open('input.txt') as infile:
for line in map(str.strip, infile):
if (t := re.findall('---([a-zA-Z\s]+)---', line)) and key in t[0]:
print(t[0])
You can use regex to match ---Course name--- using ---([a-zA-Z ]+)---. This will give you all the course names. Then you can check for the user_input in each course and print the course name if you find user_input in it.:
import re
user_input = 'foundation'
file1 = open("file.txt", "r")
readfile = file1.read()
course_name = re.findall('---([a-zA-Z ]+)---', readfile)
for course in course_name:
if user_input in course: #Then check for 'foundation' in course_name
print(course)
Output:
Python foundation

Python remove multiple line

My file has something like this
#email = "abc";
%area = (
"abc" => 10,
"xyz" => 10,
);
Is there any regex match I can use to match begin with %area = ( and read the nextline until ); is found. This is so that I can remove those lines from the file.
Regex that I tried ^%area = \(.*|\n\) somehow does not continue to match is next line.
So my final file will only have
#email = "abc";
Assuming a file file contains:
#email = "abc";
%area = (
"abc" => 10,
"xyz" => 10,
);
Would you please try the following:
import re
with open("file") as f:
s = f.read()
print(re.sub(r'^%area =.*?\);', '', s, flags=(re.DOTALL|re.MULTILINE)))
Output:
#email = "abc";
If you want to clean-up the remaining empty lines, please try instead:
print(re.sub(r'\n*^%area =.*?\);\n*', '\n', s, flags=(re.DOTALL|re.MULTILINE))
Then the result looks like:
#email = "abc";
The re.DOTALL flag makes . match any character including a newline.
The re.MULTILINE flag allows ^ and $ to match, respectively,
just after and just before newlines within the string.
[EDIT]
If you want to overwrite the original file, please try:
import re
with open("file") as f:
s = f.read()
with open("file", "w") as f:
f.write(re.sub(r'\n*^%area =.*?\);\n*', '\n', s, flags=(re.DOTALL|re.MULTILINE)))
To capture and remove your area group, you can use; link
re.sub('%area = \((.|\n)*\);', '', string)
#'#email = "abc";\n\n'
However, this will include two new lines after your #email line. You could add \n\n to the regex to capture that as well;
re.sub('\n\n%area = \((.|\n)*\);', '', string)
#'#email = "abc";'
However, if the email always follows the same logic, you would be best searching for that line only. link
re.search('(#email = ).*(?=\n)', string).group()
#'#email = "abc";'

How to read a specific part of a text file (Py 3x)

Other questions don't seem to be getting answered or are not getting answered for Python. I'm trying to get it to find the keyword "name", set the position to there, then set a variable to that specific line, and then have it use only that piece of text as a variable. In shorter terms, I'm trying to locate a variable in the .txt file based on "name" or "HP" which will always be there.
I hope that makes sense...
I've tried to use different variables like currentplace instead of namePlace but neither works.
import os
def savetest():
save = open("nametest_text.txt", "r")
print("Do you have a save?")
conf = input(": ")
if conf == "y" or conf == "Y" or conf == "Yes" or conf == "yes":
text = save.read()
namePlace = text.find("name")
currentText = namePlace + 7
save.seek(namePlace)
nameLine = save.readline()
username = nameLine[currentText:len(nameLine)]
print(username)
hpPlace = text.find("HP")
currentText = hpPlace + 5
save.seek(hpPlace)
hpLine = save.readline()
playerHP = hpLine[currentText:len(hpLine)]
print(playerHP)
os.system("pause")
save.close()
savetest()
My text file is simply:
name = Wubzy
HP = 100
I want it to print out whatever is put after the equals sign at name and the same for HP, but not name and HP itself.
So it should just print
Wubzy
100
Press any key to continue . . .
But it instead prints
Wubzy
Press any key to continue . . .
This looks like a good job for a regex. Regexes can match and capture patterns in text, which seems to be exactly what you are trying to do.
For example, the regex ^name\s*=\s*(\w+)$ will match lines that have the exact text "name", followed by 0 or more whitespace characters, an '=', and then another 0 or more whitespace characters then a one or more letters. It will capture the word group at the end.
The regex ^HP\s*=\s*(\d+)$ will match lines that have the exact text "HP", followed by 0 or more whitespace characters, an '=', and then another 0 or more whitespace characters then one or more digits. It will capture the number group at the end.
# This is the regex library
import re
# This might be easier to use if you're getting more information in the future.
reg_dict = {
"name": re.compile(r"^name\s*=\s*(\w+)$"),
"HP": re.compile(r"^HP\s*=\s*(\d+)$")
}
def savetest():
save = open("nametest_text.txt", "r")
print("Do you have a save?")
conf = input(": ")
# instead of checking each one individually, you can check if conf is
# within a much smaller set of valid answers
if conf.lower() in ["y", "yes"]:
text = save.read()
# Find the name
match = reg_dict["name"].search(text)
# .search will return the first match of the text, or if there are
# no occurrences, None
if(match):
# With match groups, group(0) is the entire match, group(1) is
# What was captured in the first set of parenthesis
username = match.group(1)
else:
print("The text file does not contain a username.")
return
print(username)
# Find the HP
match = reg_dict["HP"].search(text)
if(match):
player_hp = match.group(1)
else:
print("The text file does not contain a HP.")
return
print(player_hp)
# Using system calls to pause output is not a great idea for a
# variety of reasons, such as cross OS compatibility
# Instead of os.system("pause") try
input("Press enter to continue...")
save.close()
savetest()
Use a regex to extract based on a pattern:
'(?:name|HP) = (.*)'
This captures anything that follows an equal to sign preceded by either name or HP.
Code:
import re
with open("nametest_text.txt", "r") as f:
for line in f:
m = re.search(r'(?:name|HP) = (.*)', line.strip())
if m:
print(m.group(1))
Simplest way may be to use str.split() and then print everything after the '=' character:
with open("nametest_text.txt", "r") as f:
for line in f:
if line.strip():
print(line.strip().split(' = ')[1])
output:
Wubzy
100
Instead of trying to create and parse a proprietary format (you will most likely hit limitations at some point and will need to change your logic and/or file format), better stick to a well-known and well-defined file format that comes with the required writers and parsers, like yaml, json, cfg, xml, and many more.
This saves a lot of pain; consider the following quick example of a class that holds a state and that can be serialized to a key-value-mapped file format (I'm using yaml here, but you can easily exchange it for json, or others):
#!/usr/bin/python
import os
import yaml
class GameState:
def __init__(self, name, **kwargs):
self.name = name
self.health = 100
self.__dict__.update(kwargs)
#staticmethod
def from_savegame(path):
with open(path, 'r') as savegame:
args = yaml.safe_load(savegame)
return GameState(**args)
def save(self, path, overwrite=False):
if os.path.exists(path) and os.path.isfile(path) and not overwrite:
raise IOError('Savegame exists; refusing to overwrite.')
with open(path, 'w') as savegame:
savegame.write(yaml.dump(self.__dict__))
def __str__(self):
return (
'GameState(\n{}\n)'
.format(
'\n'.join([
' {}: {}'.format(k, v)
for k, v in self.__dict__.iteritems()
]))
)
Using this simple class exemplarily:
SAVEGAMEFILE = 'savegame_01.yml'
new_gs = GameState(name='jbndlr')
print(new_gs)
new_gs.health = 98
print(new_gs)
new_gs.save(SAVEGAMEFILE, overwrite=True)
old_gs = GameState.from_savegame(SAVEGAMEFILE)
print(old_gs)
... yields:
GameState(
health: 100
name: jbndlr
)
GameState(
health: 98
name: jbndlr
)
GameState(
health: 98
name: jbndlr
)

Optimization of Parsing of Python Scripts

I want to apply regex for every newline in my txt file.
For example
comments={ts=2010-02-09T04:05:20.777+0000,comment_id=529590|2886|LOL|Baoping Wu|529360}
comments={ts=2010-02-09T04:20:53.281+0000, comment_id=529589|2886|cool|Baoping Wu|529360}
comments={ts=2010-02-09T05:19:19.802+0000,comment_id=529591|2886|ok|Baoping Wu|529360}
My Python Code is:
import re
p = re.compile(ur'(comment_id=)(\d+)\|(\d+)\|([^|]+)\|([^|]+)\|(\d+)', re.MULTILINE|re.DOTALL)
#open =
test_str = r"comments={ts=2010-02-09T04:05:20.777+0000, comment_id=529590|2886|LOL|Baoping Wu|529360}"
subst = ur"\1\2, user_id = \3, comment='\4', user= '\5', post_commented=\6"
result = re.sub(p, subst, test_str)
print result
I want to solve it with help of MULTILINE, but it doesnt Work.
Can anyone help me
The Output for the first line should be
comments={ts=2010-02-09T04:05:20.777+0000, comment_id=529590, user_id = 2886, comment='LOL', user= 'Baoping Wu', post_commented=529360}
My issue is only to apply the regex for every line and write it on txt file.
Your regex works without having to use MULTILINE or DOTALL. You can replace through the entire document at once. In action
import re
with open('file.txt', 'r') as f:
txt = f.read()
pattern = r'(comment_id=)(\d+)\|(\d+)\|([^|]+)\|([^|]+)\|(\d+)'
repl = r"\1\2, user_id = \3, comment='\4', user= '\5', post_commented=\6"
result = re.sub(pattern, repl, txt)
with open('file2.txt', 'w') as f:
f.write(result)

Replace newlines and XML tags with ','

I have an XML document that looks like this:
<file>
<name>NAME_OF_FILE</name>
</file>
<file>
<name>NAME_OF_FILE</name>
</file>
I’m trying to write a Python script that will replace all newlines, tags and whitespace between tags (i.e. not the elements themselves) with ','.
The output for the file above should look like this:
NAME_OF_FILE','NAME_OF_FILE','NAME_OF_FILE','
Here's what I've got so far. I'm having trouble understanding exactly how Python handles newlines:
import sys
import os
import re
source = r'c:\A\grepper.txt'
f = open(source,'r')
out = open(r'c:\A\bout.txt', 'a')
for line in f:
one = re.sub(r"\n", '', line)
two = re.sub(r"\r", '', one)
three = re.sub(r'</name>.*<name>', '\',\'', two)
out.write(three)
out.close()
Remove the rs, as they quote the string literally.
one = re.sub("\n", '', line)
two = re.sub("\r", '', one)
You can also use string.replace() for these simple replacements, as well as combine them into one line.
line = re.sub('r</name>.*<name>', "','", line.replace('\n', '').replace('\r', ''))
out.write(line)
However, that still doesn't solve the problem of getting your desired output. I'd suggest doing the following for that:
results = []
for line in f:
match = re.search(r'<name>(.*)</name>', line)
if match:
results.append(match.group(1))
print >>out, "','".join(results)
Here's it working: http://ideone.com/ik48G
Instead of replacing you might want to consider matching what you want:
tag_re = re.compile('''
<(?P<tag>[a-z]+)> # First match the tag, must be a-z enclosed in <>
(?P<value>[^<>]+) # Match the value, anything but <>
</(?P=tag)> # Match the same tag we got earlier, but the closing version
''', re.VERBOSE)
print "','".join(m.group('value') for m in tag_re.finditer(data))
Regular expressions are wrong for this. Use the xml.sax.handler module.
Untested:
import xml.sax
from xml.sax.handler import ContentHandler
class CharactersOnlyContentHandler(ContentHandler):
def __init__(self):
ContentHandler.__init__(self)
self.text = ""
self.texts = []
def characters(self, content):
self.text += content
def endElement(self, name):
if self.text:
self.texts.append(self.text)
self.text = ""
handler = CharactersOnlyContentHandler()
xml.sax.parse(xml_file_name, handler)
print ",".join("'%s'" % s for s in handler.texts)
import lxml.etree
myxml = """
<filelist>
<file>
<name>FIRST FILE NAME</name>
</file>
<file>
<name>SECOND FILE NAME</name>
</file>
</filelist>
"""
root = lxml.etree.fromstring(myxml)
filenames = root.xpath('//file/name/text()')
print ', '.join(filenames)
results in
FIRST FILE NAME, SECOND FILE NAME

Categories

Resources