I have a column called Description in my Dataframe. I have text in that column as below.
Description
Summary: SD1: Low free LOG space in database saptempdb: 2.99% Date: 01/01/2017 Severity: Major Reso
Summary: SD1: Low free DATA space in database 10:101:101:1 2.99% Date: 01/01/2017 Severity: Major Res
Summary: SAP SolMan Sys=SM1_SNG01AMMSOL04,MO=AGEEPM40,Alert=Columnstore Unloads,Desc= ,Cat=Exception
How to extract the Server name or IPs fro the above description. I have around 10000 rows.
I have written as below, to split the senetences as comma separated. Now I need to filter the server names or ips
df['sentsplit'] = df["Description"].str.split(" ")
print df
The general case of what you're asking is "How do I parse this input?". The task then is what knowledge of your input can you exploit to answer your question? Do all the lines follow one or a few forms? Can you place any restrictions on where the hostname or IP address will be on each line?
Given your input, here's a regex I might apply. Quick and dirty -- not elegant -- but if it's only for 10,000 lines, and a one-off job, who cares? It's functional:
database (\d+:\d+:\d+:\d+)|database (\w+)|Sys=([^, ]+),
This regex assumes that the IP address will always be after the word database and preceded by a space, OR that the hostname will be after the word database, OR that the hostname will be preceded bySys=and followed by a,` or a space.
Obviously, test for your purposes, and fine tune as appropriate. In the Python API:
host_or_ip_re = re.compile(r'database (\d+:\d+:\d+:\d+)|database (\w+)|Sys=([^, ]+),')
for line in log:
m = host_or_ip_re.searc( line )
if m:
print m.groups()
The detail that always trips me up is the difference between match and search. Match only matches from the beginning of the string
Related
I'm trying to use a regex statement to extract a specific block of text between two known phrases that will be repeated in other documents, and remove everything else. These few sentences will then be passed into other functions.
My problem seems to be that when I use a regex statement that has the words im searching for on the same line, it works. If they're on different lines I get:
print(match.group(1).strip())
AttributeError: 'NoneType' object has no attribute 'group'
I'm expecting future reports to have line breaks at different points depending on what was written before - is there a way to prepare the text first by removing all line breaks, or to make my regex statement ignore those when searching?
Any help would be great, thanks!
import fitz
import re
doc = fitz.open(r'file.pdf')
text_list = [ ]
for page in doc:
text_list.append(page.getText())
#print(text_list[-1])
text_string = ' '.join(text_list)
test_string = "Observations of Client Behavior: THIS IS THE DESIRED TEXT. Observations of Client's response to skill acquisition" #works for this test
pat = r".*?Observations of Client Behavior: (.*) Observations of Client's response to skill acquisition*"
match = re.search(pat, text_string)
print(match.group(1).strip())
When I make my pat being searched for phrases that are on the same line in the long text file, it works. But as soon as they are on different lines, it no longer works.
Here is a sample of the input text giving me an issue:
Observations of Client Behavior: Overall interfering behavior data trends are as followed: Aggression frequency
has been low and stable at 0 occurrences for the past two consecutive sessions. Elopement frequency is on an
overall decreasing trend. Property destruction frequency is on an overall decreasing trend. Non-compliance
frequency has been stagnant at 2 occurrences for the past two consecutive sessions, but overall on a
decreasing trend. Tantrum duration data are variable; data were at 89 minutes on 9/27/21, but have starkly
decreased to 0 minutes for the past two consecutive sessions. Observations of Client's response to skill
acquisition: Overall skill acquisition data trends are as followed: Frequency of excessive mands
Note that . Matches any character other than newline. So you could use (.|\n) to capture everything. Also, it seems that the line could break inside your fixed pattern. first define prefix and suffix of the pattern:
prefix=r"Observations\s+of\s+Client\s+Behavior:"
sufix=r"Observations\s+of\s+Client's\s+response\s+to\s+skill\s+acquisition:"
and then create pattern and find all occurrences:
pattern=prefix+r"((?:.|\n)*?)"+suffix
f=re.findall(pattern,text_string)
By using *? at the end of r"((?:.|\n)*?)" we matches as few characters as possible.
Example of multi-line multi-pattern:
text_string = '''any thing Observations of Client Behavior: patern1 Observations of Client's
response to skill acquisition: any thing
any thing Observations of Client Behavior: patern2 Observations of
Client's response to skill acquisition: any thing Observations of Client
Behavior: patern3 Observations of Client's response to skill acquisition: any thing any thing'''
result=re.findall(pattern,text_string)
result=[' patern1 ', ' patern2 ', ' patern3 ']
check the result here
I am trying to grab a list of messages that have a specific content e.g. billing emails and work on data in there.
In order to get these messages, I run the following
service.users().messages().list(userId=user_id, page_token=page_token, q=query).execute()
which returns all the messages.
I want to limit the messages that I get to confirm to the following criteria:
Sent in the last two days
Definitely deny if from: address not in a list of email addresses i.e. blacklist e.g. notifications, facebook
Definitely accept if from: address in a list of email addresses i.e. whitelist
Look if the subject: matches a set of strings
I understand that I can create a query that would match the email address and subject (from:bill#pge.com AND subject:"Your bill for this month"), but the blacklist and whitelist, as mentioned above, can become significantly large as the scope and the number of vendors I can accept increases, and similar is the case with subject. So my question is:
Is there a limit on the number of query terms?
Is there a way to achieve this other than generating a very long query string combining the black list whitelist and subject (from:abc#this.com AND NOT from:xyz#that.com AND subject:"Your bill" AND subject:"This month's bill")?
Note: For project settings I mostly conform to https://developers.google.com/gmail/api/quickstart/python
There's no limit documented for the number of query terms you can use. Yes, you would have to create programmatically a long query string combining all the emails from the lists. Here [1] you can check the operators you can use, the best approach would be like this:
1) Use "after" or "newer" operators with a timestamp from 2 days before the current date.
2) -from:{xxx#xxx.com xxx#xxx.com ...}
3) from:{xxx#xxx.com xxx#xxx.com ...}
4) subject:{xxx xxx ...}
[1] https://support.google.com/mail/answer/7190
I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file
Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$
I'm trying to parse a WhatsApp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.
The chat.txt file looks like this:
[06.12.16, 16:46:19] Person One: Wow thats amazing
[06.12.16, 16:47:13] Person Two: Good morning and this goes over multiple
lines as it is a very long message
[06.12.16, 16:47:22] Person Two: ::
While my solution so far would parse most of these messages correctly, however I have a few hundred cases where the message starts with a colon, like the last example above. This leads to an unwanted value of Person Two: : as the sender.
Here is the regex I am working with so far:
pattern = re.compile(r'\[(?P<date>\d{2}\.\d{2}\.\d{2}),\s(?P<time>\d{2}:\d{2}:\d{2})]\s(?P<sender>(?<=\s).*(?::\s*\w+)*(?=:)):\s(?P<message>(?:.+|\n+(?!\[\d{2}\.\d{2}\.\d{2}))+)')
Any advice on how I could go around this bug would be appreciated!
i would pre-process the list to remove the consecutive colons before applying the regex. So for each line e.g
line = [06.12.16, 16:47:22] Person Two: ::
line = line.replace("::","")
which would give :
[06.12.16, 16:47:22] Person Two:
You can then call your regex function on the pre-processed data.
I encountered similar issues when building a tool to analyze WhatsApp chats.
The main issue is that the format of the chat.txt is depending on your system language. In German you will get this 16:47, but in English it might be PM and the month format changes for American users ....
The library I used has the 4 regexs below. So far they covered all occurring cases (Latin Languages).
Filtering general:
const regexParser = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? (.+?): ([^]*)/i;
Filter System Messages:
const regexParserSystem = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? ([^]+)/i;
Date:
const regexSplitDate = /[-/.] ?/;
Handle attachments, which are passed in "< >" even when you export the chat without attachments. (e.g. <media ommitted>)
const regexAttachment = /<.+:(.+)>/;`
I have a text file, populated with the following:
"": "hello1(10.0.0.1)"
},
{
"": "hello2 (10.0.0.2)"
},
{
"": "hello3(10.0.0.3)"
},
{
It's not properly structured as it was scraped off and dumped into a text file.
There is over a 100 of such segments.
Despite how it looks, the page was not in just html hence why I couldn't simply extract the data as a structured form.
Now I would like to use Python to extract just the hostname, Model number and IP address in an orderly list.
So it would look something like the following in new lines:
hostname: hello1 Model No: 2901 IP address: 10.0.0.1<br>
hostname: hello2 Model No: 2911 IP address: 10.0.0.2<br>
hostname: hello3 Model No: 2911 IP address: 10.0.0.3
But I struggle to find how to do this by firstly extracting the necessary info from the first segment, then the next etc.
Any suggestions would be greatly appreciated.
I'm not gonna fully answer this as you didn't show us any code. Rather I'll give you some hints that will help:
The way I'd do it:
strip() away any new-line character and any space from your file
use a regex to match the groups you need. You can use this one
Regex101 has also a nice way of generating the desired code in different languages so you'll be done after a bit of self-processing. (however, for learning purposes, I don't recommend it)
Look into re module and implement the above regex. You can read the docs for this
Of course, you'll have to manage yourself how to handle opening of a file, reading its content, apply all the above on it and order the data however you like. Good luck.
Here's a starting point