How to extract a word from text in Python

How to extract a word from text in Python - python

I have this string "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate." in a log file. What I need to do is look for this message and extract the IP address (1.2.3.4) from the log file.
import os
import shutil
import optparse
import sys
def main():
file = open("messages", "r")
log_data = file.read()
file.close()
search_str = "is currently trusted in the white list, but it is now using a new trusted certificate."
index = log_data.find(search_str)
print index
return
if __name__ == '__main__':
main()
How do I extract the IP address? Your response is appreciated.

Really simple answer:
msg = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
parts = msg.split(' ', 2)
print parts[1]
results in:
1.2.3.4
You could also do REs if you wanted, but for something this simple...

There will be dozens of possible approaches, pros and cons depend on the details of your log file. One example, using the re module:
import re
x = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
pattern = "IP ([0-9\.]+) is currently trusted in the white list"
m = re.match(pattern, x)
for ip in m.groups():
print ip
If you want to print out every instance of that string in your log file, you'd do something like this:
import re
pattern = "(IP [9-0\.]+ is currently trusted in the white list, but it is now using a new trusted certificate.)"
m = re.match(pattern, log_data)
for match in m.groups():
print match

Use regular expressions.
Code like this:
import re
compiled = re.compile(r"""
.*? # Leading junk
(?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
.*? # Trailing junk
""", re.VERBOSE)
str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
m = compiled.match(str)
print m.group("ipaddress")
And you get this:
>>> import re
>>>
>>> compiled = re.compile(r"""
... .*? # Leading junk
... (?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
... .*? # Trailing junk
... """, re.VERBOSE)
>>> str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
>>> m = compiled.match(str)
>>> print m.group("ipaddress")
1.2.3.4
Also, I learned there there is a dictionary of matches, groupdict():
>>>> str = "Peer 10.11.6.224 is currently trusted in the white list, but it is now using a new trusted certificate. Consider removing its likely outdated white list entry."
>>>> m = compiled.match(str)
>>>> print m.groupdict()
{'ipaddress': '10.11.6.224'}
Later: fixed that. The initial '.*' was eating your first character match. Changed it to be non-greedy. For consistency (but not necessity), I changed the trailing match, too.

Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format to a regular expression. In your case, you will do the following:
from stringparser import Parser
parser = Parser("IP {} is currently trusted in the white list, but it is now using a new trusted certificate.")
ip = parser(text)
If you have a file with multiple lines you can replace the last line by:
with open("log.txt", "r") as fp:
ips = [parser(line) for line in fp]
Good luck.

Related

How to check if 2 urls in same domain? [duplicate]

how would you extract the domain name from a URL, excluding any subdomains?
My initial simplistic attempt was:
'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])
This works for http://www.foo.com, but not http://www.foo.com.au.
Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).
thanks

Here's a great python module someone wrote to solve this problem after seeing this question:
https://github.com/john-kurkowski/tldextract
The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
Quote:
tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains]
and ccTLDs [Country Code Top-Level Domains] look like
by looking up the currently living ones according to the Public Suffix
List. So, given a URL, it knows its subdomain from its domain, and its
domain from its country code.

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).
You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

Using this file of effective tlds which someone else found on Mozilla's website:
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]
def get_domain(url, tlds):
url_elements = urlparse(url)[1].split('.')
# url_elements = ["abcde","co","uk"]
for i in range(-len(url_elements), 0):
last_i_elements = url_elements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
exception_candidate = "!" + candidate
# match tlds:
if (exception_candidate in tlds):
return ".".join(url_elements[i:])
if (candidate in tlds or wildcard_candidate in tlds):
return ".".join(url_elements[i-1:])
# returns "abcde.co.uk"
raise ValueError("Domain not in global list of TLDs")
print get_domain("http://abcde.co.uk", tlds)
results in:
abcde.co.uk
I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

Using python tld
https://pypi.python.org/pypi/tld
Install
pip install tld
Get the TLD name as string from the URL given
from tld import get_tld
print get_tld("http://www.google.co.uk")
co.uk
or without protocol
from tld import get_tld
get_tld("www.google.co.uk", fix_protocol=True)
co.uk
Get the TLD as an object
from tld import get_tld
res = get_tld("http://some.subdomain.google.co.uk", as_object=True)
res
# 'co.uk'
res.subdomain
# 'some.subdomain'
res.domain
# 'google'
res.tld
# 'co.uk'
res.fld
# 'google.co.uk'
res.parsed_url
# SplitResult(
# scheme='http',
# netloc='some.subdomain.google.co.uk',
# path='',
# query='',
# fragment=''
# )
Get the first level domain name as string from the URL given
from tld import get_fld
get_fld("http://www.google.co.uk")
# 'google.co.uk'

There are many, many TLD's. Here's the list:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Here's another list
http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
Here's another list
http://www.iana.org/domains/root/db/

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.
def get_tld():
try:
return get_tld(self.content_url)
except Exception, e:
re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
matchObj = re_domain.findall(str(e))
if matchObj:
for m in matchObj:
return m
raise e

Here's how I handle it:
if not url.startswith('http'):
url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
sys.exit(2)
elif not match.group(0):
sys.exit(2)

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!
So finally, I decided to write this method
IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract
def urlextract(url):
url_split=url.split(".")
if len(url_split) <= 2:
raise Exception("Full url required with subdomain:",url)
return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

Find if a string contains substring python

I'm trying to figure out how to find a substring with regex in python inside an input.
What I mean is that I'm getting an input string from the user, and I have JSON file I load, inside every block in my JSON file I have 'alert_regex', and I want to check it the string inside my input contains my regex.
this is what I have tried so far:
import json
from pprint import pprint
import re
# Load json file
json_data=open('alerts.json')
jdata = json.load(json_data)
json_data.close()
# Input for users
input = 'Check Liveness in dsadakjnflkds.server'
# Search in json function
def searchInJson(input, jdata):
for i in jdata:
# checks if the input is similiar for the alert name in the json
print(i["alert_regex"])
regexCheck = re.search(i["alert_regex"], input)
if(regexCheck):
# saves and prints the confluence's related link
alert = i["alert_confluence"]
print(alert)
return print('Alert successfully found in `alerts.json`.')
print('Alert was not found!')
searchInJson(input,jdata)
what I want my regex to check is only if the string contains 'Check flink liveness'
There are 2 optional problems:
1. maybe my regex is not correct inside i["alert_regex"] (I've tried to same one with javascript and it worked)
2. my code is not correct.
An example of my JSON file:
[
{
"id": 0,
"alert_regex": "check (.*) Liveness (.*)",
"alert_confluence": "link goes here"
}
]

You have two problems. All of your code can be reduced down to:
import re
re.search("check (.*) Liveness (.*)", 'Check Liveness in dsadakjnflkds.server')
This will not match because:
You need to set the "case insensitivity" on the search because check will not match with Check otherwise.
check (.*) Liveness ends up with two spaces between check and Liveness if (.) matches the empty string.
You need:
re.search("check (.*)Liveness (.*)", 'Check Liveness in dsadakjnflkds.server', flags=re.I)

regex to grep string from config file in python

I have config file which contains network configurations something like given below.
LISTEN=192.168.180.1 #the network which listen the traffic
NETMASK=255.255.0.0
DOMAIN =test.com
Need to grep the values from the config. the following is my current code.
import re
with open('config.txt') as f:
data = f.read()
listen = re.findall('LISTEN=(.*)',data)
print listen
the variable listen contains
192.168.180.1 #the network which listen the traffic
but I no need the commented information but sometimes comments may not exist like other "NETMASK"

If you really want to this using regular expressions I would suggest changing it to LISTEN=([^#$]+)
Which should match anything up to the pound sign opening the comment or a newline character.

I come up with solution which will have common regex and replace "#".
import re
data = '''
LISTEN=192.168.180.1 #the network which listen the traffic
NETMASK=255.255.0.0
DOMAIN =test.com
'''
#Common regex to get all values
match = re.findall(r'.*=(.*)#*',data)
print "Total match found"
print match
#Remove # part if any
for index,val in enumerate(match):
if "#" in val:
val = (val.split("#")[0]).strip()
match[index] = val
print "Match after removing #"
print match
Output :
Total match found
['192.168.180.1 #the network which listen the traffic', '255.255.0.0', 'test.com']
Match after removing #
['192.168.180.1', '255.255.0.0', 'test.com']

data = """LISTEN=192.168.180.1 #the network which listen the traffic"""
import re
print(re.search(r'\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}', data).group())
>>>192.168.180.1
print(re.search(r'[0-9]+(?:\.[0-9]+){3}', data).group())
>>>192.168.180.1

In my experience regex is slow runtime and not very readable. I would do:
with open('config.txt') as f:
for line in f:
if not line.startswith("LISTEN="):
continue
rest = line.split("=", 1)[1]
nocomment = rest.split("#", 1)[0]
print nocomment

I think the better approach is to read the whole file as the format it is given in. I wrote a couple of tutorials, e.g. for YAML, CSV, JSON.
It looks as if this is an INI file.
Example Code
Example INI file
INI files need a header. I assume it is network:
[network]
LISTEN=192.168.180.1 #the network which listen the traffic
NETMASK=255.255.0.0
DOMAIN =test.com
Python 2
#!/usr/bin/env python
import ConfigParser
import io
# Load the configuration file
with open("config.ini") as f:
sample_config = f.read()
config = ConfigParser.RawConfigParser(allow_no_value=True)
config.readfp(io.BytesIO(sample_config))
# List all contents
print("List all contents")
for section in config.sections():
print("Section: %s" % section)
for options in config.options(section):
print("x %s:::%s:::%s" % (options,
config.get(section, options),
str(type(options))))
# Print some contents
print("\nPrint some contents")
print(config.get('other', 'use_anonymous')) # Just get the value
Python 3
Look at configparser:
#!/usr/bin/env python
import configparser
# Load the configuration file
config = configparser.RawConfigParser(allow_no_value=True)
with open("config.ini") as f:
config.readfp(f)
# Print some contents
print(config.get('network', 'LISTEN'))
gives:
192.168.180.1 #the network which listen the traffic
Hence you need to parse that value as well, as INI seems not to know #-comments.

Python: Passing a list of IP addresses as a list of strings

My code is designed to geo-locate IP addresses from a text file. I'm having trouble on the last section. When I run the code, I get a complaint from the map_ip.update line: socket.error: illegal IP address string passed to inet_pton
When I troubleshoot with a print statement, I get the following format:
['$ ip address']
['$ ip address']
['$ ip address']
How do I get country_name_by_addr() to read each IP address in the proper format? It appears my IP addresses are being formatted as a list of strings in individual lists.
# script that geo-locates IP addresses from a consolidated dictionary
import pygeoip
import itertools
import re
# initialize dictionary for IP addresses
count = {}
"""
This loop reads text file line-by-line and
returns one-to-one key:value pairs of IP addresses.
"""
with open('$short_logins.txt path') as f:
for cnt, line in enumerate(f):
ip = re.findall(r'[0-9]+(?:\.[0-9]+){3}', line)
count.update({cnt: ip})
cnt += 1
"""
This line consolidates unique IP addresses. Keys represent how
many times each unique IP address occurs in the text file.
"""
con_count = [(k, len(list(v))) for k, v in itertools.groupby(sorted(count.values)))]
"""
Country lookup:
This section passes each unique IP address from con_count
through country name database. These IP address are not required
to come from con_count.
"""
map_ip = {}
gi = pygeoip.GeoIP('$GeoIP.dat path')
for i in count.itervalues():
map_ip.update({i: gi.country_name_by_addr(i)})
print map_ip

So I solved this dilemma yesterday by doing away with the regular expression:
ip = re.findall(r'[0-9]+(?:\.[0-9]+){3}', line)
I found a much simpler solution by stripping the whitespace in the file and checking to see if the IP address was accounted for. IP addresses are all in the third column hence the [2]:
ip = line.split()[2]
if ip in count:
count[ip] += 1
else:
count.update({ip: 1})
I removed the con_count line as well. Pygeoip functions are much more receptive to lists not made out of regular expressions.

How to check the Emoji property of a character in Python?

In unicode a character can have an Emoji property.
Is there a standard way in Python to determine if a character is an Emoji?
I know of unicodedata, but it doesn't appear to expose all these extra character details.
Note: I'm asking about the specific attribute called "Emoji" in the unicdoe standard, as provided in the link. I don't want to have an arbitrary list of pattern ranges, and preferably use a standard library.

This is the code I ended up creating to load the Emoji information. The get_emoji function gets the data file, parses it, and calls the enumeraton callback. The rest of the code uses this to produce a JSON file of the information I needed.
#!/usr/bin/env python3
# Generates a list of emoji characters and names in JS format
import urllib.request
import unicodedata
import re, json
'''
Enumerates the Emoji characters that match an attributes from the Unicode standard (the Emoji list).
#param on_emoji A callback that is called with each found character. Signature `on_emoji( code_point_value )`
#param attribute The attribute that is desired, such as `Emoji` or `Emoji_Presentation`
'''
def get_emoji(on_emoji, attribute):
with urllib.request.urlopen('http://www.unicode.org/Public/emoji/5.0/emoji-data.txt') as f:
content = f.read().decode(f.headers.get_content_charset())
cldr = re.compile('^([0-9A-F]+)(..([0-9A-F]+))?([^;]*);([^#]*)#(.*)$')
for line in content.splitlines():
m = cldr.match(line)
if m == None:
continue
line_attribute = m.group(5).strip()
if line_attribute != attribute:
continue
code_point = int(m.group(1),16)
if m.group(3) == None:
on_emoji(code_point)
else:
to_code_point = int(m.group(3),16)
for i in range(code_point,to_code_point+1):
on_emoji(i)
# Dumps the values into a JSON format
def print_emoji(value):
c = chr(value)
try:
obj = {
'code': value,
'name': unicodedata.name(c).lower(),
}
print(json.dumps(obj),',')
except:
# Unicode DB is likely outdated in installed Python
pass
print( "module.exports = [" )
get_emoji(print_emoji, "Emoji_Presentation")
print( "]" )
That solved my original problem. To answer the question itself it'd just be a matter of sticking the results into a dictionary and doing a lookup.

I have used the following regex pattern successfully before
import re
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
Also check out this question: removing emojis from a string in Python

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract a word from text in Python - python

Really simple answer: msg = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate." parts = msg.split(' ', 2) print parts[1] results in: 1.2.3.4 You could also do REs if you wanted, but for something this simple...

Related

How to check if 2 urls in same domain? [duplicate]

Find if a string contains substring python

regex to grep string from config file in python

Python: Passing a list of IP addresses as a list of strings

How to check the Emoji property of a character in Python?

Categories

Resources