Find if a string contains substring python - python

I'm trying to figure out how to find a substring with regex in python inside an input.
What I mean is that I'm getting an input string from the user, and I have JSON file I load, inside every block in my JSON file I have 'alert_regex', and I want to check it the string inside my input contains my regex.
this is what I have tried so far:
import json
from pprint import pprint
import re
# Load json file
json_data=open('alerts.json')
jdata = json.load(json_data)
json_data.close()
# Input for users
input = 'Check Liveness in dsadakjnflkds.server'
# Search in json function
def searchInJson(input, jdata):
for i in jdata:
# checks if the input is similiar for the alert name in the json
print(i["alert_regex"])
regexCheck = re.search(i["alert_regex"], input)
if(regexCheck):
# saves and prints the confluence's related link
alert = i["alert_confluence"]
print(alert)
return print('Alert successfully found in `alerts.json`.')
print('Alert was not found!')
searchInJson(input,jdata)
what I want my regex to check is only if the string contains 'Check flink liveness'
There are 2 optional problems:
1. maybe my regex is not correct inside i["alert_regex"] (I've tried to same one with javascript and it worked)
2. my code is not correct.
An example of my JSON file:
[
{
"id": 0,
"alert_regex": "check (.*) Liveness (.*)",
"alert_confluence": "link goes here"
}
]

You have two problems. All of your code can be reduced down to:
import re
re.search("check (.*) Liveness (.*)", 'Check Liveness in dsadakjnflkds.server')
This will not match because:
You need to set the "case insensitivity" on the search because check will not match with Check otherwise.
check (.*) Liveness ends up with two spaces between check and Liveness if (.) matches the empty string.
You need:
re.search("check (.*)Liveness (.*)", 'Check Liveness in dsadakjnflkds.server', flags=re.I)

Related

Google Document Ai giving different outputs for the same file

I was using Document OCR API to extract text from a pdf file, but part of it is not accurate. I found that the reason may be due to the existence of some Chinese characters.
The following is a made-up example in which I cropped part of the region that the extracted text is wrong and add some Chinese characters to reproduce the problem.
When I use the website version, I cannot get the Chinese characters but the remaining characters are correct.
When I use Python to extract the text, I can get the Chinese characters correctly but part of the remaining characters are wrong.
The actual string that I got.
Are the versions of Document AI in the website and API different? How can I get all the characters correctly?
Update:
When I print the detected_languages (don't know why for lines = page.lines, the detected_languages for both lines are empty list, need to change to page.blocks or page.paragraphs first) after printing the text, I get the following output.
Code:
from google.cloud import documentai_v1beta3 as documentai
project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' # Create processor in Cloud Console
opts = {}
if location == "eu":
opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)
def get_text(doc_element: dict, document: dict):
"""
Document AI identifies form fields by their offsets
in document text. This function converts offsets
to text snippets.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in doc_element.text_anchor.text_segments:
start_index = (
int(segment.start_index)
if segment in doc_element.text_anchor.text_segments
else 0
)
end_index = int(segment.end_index)
response += document.text[start_index:end_index]
return response
def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):
# You must set the api_endpoint if you use a location other than 'us', e.g.:
# opts = {}
# if location == "eu":
# opts = {"api_endpoint": "eu-documentai.googleapis.com"}
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "application/pdf"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)
document = result.document
document_pages = document.pages
response_text = []
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
lines = page.blocks
for line in lines:
block_text = get_text(line.layout, document)
confidence = line.layout.confidence
response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
print(f"Text: {block_text}")
print("Detected Language", line.detected_languages)
return response_text
if __name__ == '__main__':
print(get_lines_of_text('/pdf path'))
It seems the language code is wrong, will this affect the result?
Posting this Community Wiki for better visibility.
One of features of DocumentAI is OCR - Optical Character Recognition which allows recognizing text from various files.
OP in this scenario received difference outputs using Try it function and Client Libraries - Python.
Why are there discrepancies between Try it and Python library?
It's hard to say as both methods use the same API documentai_v1beta3. It might be related to some files modifications when pdf is uploading to Try it Demo, different endpoints, language alphabet recognition or some random stuff.
When you are using Python Client you also get accuracy % of text identification. Below examples from my testes:
However, OP's identification is about 0,73 so it might get wrong results and in this situation is a visible issue. I guess it cannot be anyhow improved using code. Maybe if there would be different quality of PDF (in shown OPs example there are some dots which might affect identification).

regex to grep string from config file in python

I have config file which contains network configurations something like given below.
LISTEN=192.168.180.1 #the network which listen the traffic
NETMASK=255.255.0.0
DOMAIN =test.com
Need to grep the values from the config. the following is my current code.
import re
with open('config.txt') as f:
data = f.read()
listen = re.findall('LISTEN=(.*)',data)
print listen
the variable listen contains
192.168.180.1 #the network which listen the traffic
but I no need the commented information but sometimes comments may not exist like other "NETMASK"
If you really want to this using regular expressions I would suggest changing it to LISTEN=([^#$]+)
Which should match anything up to the pound sign opening the comment or a newline character.
I come up with solution which will have common regex and replace "#".
import re
data = '''
LISTEN=192.168.180.1 #the network which listen the traffic
NETMASK=255.255.0.0
DOMAIN =test.com
'''
#Common regex to get all values
match = re.findall(r'.*=(.*)#*',data)
print "Total match found"
print match
#Remove # part if any
for index,val in enumerate(match):
if "#" in val:
val = (val.split("#")[0]).strip()
match[index] = val
print "Match after removing #"
print match
Output :
Total match found
['192.168.180.1 #the network which listen the traffic', '255.255.0.0', 'test.com']
Match after removing #
['192.168.180.1', '255.255.0.0', 'test.com']
data = """LISTEN=192.168.180.1 #the network which listen the traffic"""
import re
print(re.search(r'\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}', data).group())
>>>192.168.180.1
print(re.search(r'[0-9]+(?:\.[0-9]+){3}', data).group())
>>>192.168.180.1
In my experience regex is slow runtime and not very readable. I would do:
with open('config.txt') as f:
for line in f:
if not line.startswith("LISTEN="):
continue
rest = line.split("=", 1)[1]
nocomment = rest.split("#", 1)[0]
print nocomment
I think the better approach is to read the whole file as the format it is given in. I wrote a couple of tutorials, e.g. for YAML, CSV, JSON.
It looks as if this is an INI file.
Example Code
Example INI file
INI files need a header. I assume it is network:
[network]
LISTEN=192.168.180.1 #the network which listen the traffic
NETMASK=255.255.0.0
DOMAIN =test.com
Python 2
#!/usr/bin/env python
import ConfigParser
import io
# Load the configuration file
with open("config.ini") as f:
sample_config = f.read()
config = ConfigParser.RawConfigParser(allow_no_value=True)
config.readfp(io.BytesIO(sample_config))
# List all contents
print("List all contents")
for section in config.sections():
print("Section: %s" % section)
for options in config.options(section):
print("x %s:::%s:::%s" % (options,
config.get(section, options),
str(type(options))))
# Print some contents
print("\nPrint some contents")
print(config.get('other', 'use_anonymous')) # Just get the value
Python 3
Look at configparser:
#!/usr/bin/env python
import configparser
# Load the configuration file
config = configparser.RawConfigParser(allow_no_value=True)
with open("config.ini") as f:
config.readfp(f)
# Print some contents
print(config.get('network', 'LISTEN'))
gives:
192.168.180.1 #the network which listen the traffic
Hence you need to parse that value as well, as INI seems not to know #-comments.

How to check the Emoji property of a character in Python?

In unicode a character can have an Emoji property.
Is there a standard way in Python to determine if a character is an Emoji?
I know of unicodedata, but it doesn't appear to expose all these extra character details.
Note: I'm asking about the specific attribute called "Emoji" in the unicdoe standard, as provided in the link. I don't want to have an arbitrary list of pattern ranges, and preferably use a standard library.
This is the code I ended up creating to load the Emoji information. The get_emoji function gets the data file, parses it, and calls the enumeraton callback. The rest of the code uses this to produce a JSON file of the information I needed.
#!/usr/bin/env python3
# Generates a list of emoji characters and names in JS format
import urllib.request
import unicodedata
import re, json
'''
Enumerates the Emoji characters that match an attributes from the Unicode standard (the Emoji list).
#param on_emoji A callback that is called with each found character. Signature `on_emoji( code_point_value )`
#param attribute The attribute that is desired, such as `Emoji` or `Emoji_Presentation`
'''
def get_emoji(on_emoji, attribute):
with urllib.request.urlopen('http://www.unicode.org/Public/emoji/5.0/emoji-data.txt') as f:
content = f.read().decode(f.headers.get_content_charset())
cldr = re.compile('^([0-9A-F]+)(..([0-9A-F]+))?([^;]*);([^#]*)#(.*)$')
for line in content.splitlines():
m = cldr.match(line)
if m == None:
continue
line_attribute = m.group(5).strip()
if line_attribute != attribute:
continue
code_point = int(m.group(1),16)
if m.group(3) == None:
on_emoji(code_point)
else:
to_code_point = int(m.group(3),16)
for i in range(code_point,to_code_point+1):
on_emoji(i)
# Dumps the values into a JSON format
def print_emoji(value):
c = chr(value)
try:
obj = {
'code': value,
'name': unicodedata.name(c).lower(),
}
print(json.dumps(obj),',')
except:
# Unicode DB is likely outdated in installed Python
pass
print( "module.exports = [" )
get_emoji(print_emoji, "Emoji_Presentation")
print( "]" )
That solved my original problem. To answer the question itself it'd just be a matter of sticking the results into a dictionary and doing a lookup.
I have used the following regex pattern successfully before
import re
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
Also check out this question: removing emojis from a string in Python

Match an arbitrary path, or the empty string, without adding multiple Flask route decorators

I want to capture all urls beginning with the prefix /stuff, so that the following examples match: /users, /users/, and /users/604511/edit. Currently I write multiple rules to match everything. Is there a way to write one rule to match what I want?
#blueprint.route('/users')
#blueprint.route('/users/')
#blueprint.route('/users/<path:path>')
def users(path=None):
return str(path)
It's reasonable to assign multiple rules to the same endpoint. That's the most straightforward solution.
If you want one rule, you can write a custom converter to capture either the empty string or arbitrary data beginning with a slash.
from flask import Flask
from werkzeug.routing import BaseConverter
class WildcardConverter(BaseConverter):
regex = r'(|/.*?)'
weight = 200
app = Flask(__name__)
app.url_map.converters['wildcard'] = WildcardConverter
#app.route('/users<wildcard:path>')
def users(path):
return path
c = app.test_client()
print(c.get('/users').data) # b''
print(c.get('/users-no-prefix').data) # (404 NOT FOUND)
print(c.get('/users/').data) # b'/'
print(c.get('/users/400617/edit').data) # b'/400617/edit'
If you actually want to match anything prefixed with /users, for example /users-no-slash/test, change the rule to be more permissive: regex = r'.*?'.

How to extract a word from text in Python

I have this string "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate." in a log file. What I need to do is look for this message and extract the IP address (1.2.3.4) from the log file.
import os
import shutil
import optparse
import sys
def main():
file = open("messages", "r")
log_data = file.read()
file.close()
search_str = "is currently trusted in the white list, but it is now using a new trusted certificate."
index = log_data.find(search_str)
print index
return
if __name__ == '__main__':
main()
How do I extract the IP address? Your response is appreciated.
Really simple answer:
msg = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
parts = msg.split(' ', 2)
print parts[1]
results in:
1.2.3.4
You could also do REs if you wanted, but for something this simple...
There will be dozens of possible approaches, pros and cons depend on the details of your log file. One example, using the re module:
import re
x = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
pattern = "IP ([0-9\.]+) is currently trusted in the white list"
m = re.match(pattern, x)
for ip in m.groups():
print ip
If you want to print out every instance of that string in your log file, you'd do something like this:
import re
pattern = "(IP [9-0\.]+ is currently trusted in the white list, but it is now using a new trusted certificate.)"
m = re.match(pattern, log_data)
for match in m.groups():
print match
Use regular expressions.
Code like this:
import re
compiled = re.compile(r"""
.*? # Leading junk
(?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
.*? # Trailing junk
""", re.VERBOSE)
str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
m = compiled.match(str)
print m.group("ipaddress")
And you get this:
>>> import re
>>>
>>> compiled = re.compile(r"""
... .*? # Leading junk
... (?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
... .*? # Trailing junk
... """, re.VERBOSE)
>>> str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
>>> m = compiled.match(str)
>>> print m.group("ipaddress")
1.2.3.4
Also, I learned there there is a dictionary of matches, groupdict():
>>>> str = "Peer 10.11.6.224 is currently trusted in the white list, but it is now using a new trusted certificate. Consider removing its likely outdated white list entry."
>>>> m = compiled.match(str)
>>>> print m.groupdict()
{'ipaddress': '10.11.6.224'}
Later: fixed that. The initial '.*' was eating your first character match. Changed it to be non-greedy. For consistency (but not necessity), I changed the trailing match, too.
Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format to a regular expression. In your case, you will do the following:
from stringparser import Parser
parser = Parser("IP {} is currently trusted in the white list, but it is now using a new trusted certificate.")
ip = parser(text)
If you have a file with multiple lines you can replace the last line by:
with open("log.txt", "r") as fp:
ips = [parser(line) for line in fp]
Good luck.

Categories

Resources