I have never used procmail before but I believe (from my R&D) that it is likely my best choice to crack my riddle. Our system receives an email, out of which I need 3 values, which are:
either a 4-digit or 5-digit integer from the SUBJECT line. (we will refer to as "N")
email alias from REPLY-TO line (we will refer to as "R")
determine the type of email it is, by which I mean to say a "case" or a "project". (we will refer to as "T") This value would be parsed out of the SUBJECT line.
If any one could help me with that recipe, I would be most appreciative.
The next thing I need to do is:
send these 3 values to a Python script (can I do this directly from procmail? pipe? something else?)
delete the email messages
I need to accept these emails from only 4 domain names, such as:
(#sjobeck.com|#cases.example.com|#messages.example.com|#bounces.example.com)
Last, is to pipe these 3 values in to the second script, and some advice as to the best syntax to do so. Any advice here is most appreciative. Would this be something like this:
this-recipe $N $T $R | second-script.py
Or exactly how would that look? Or is this not a procmail issue and a Python issue? (if it is, that's fine, I'll handle it over there.)
Thanks so much!
Jason
Procmail can extract those values, or you can just pass the whole message to Python on stdin.
Assuming you want the final digits and you require there to be 4 or 5, something like this:
R=`formail -zxReply-to: | sed 's/.*<//;s/>.*//'`
:0
* ^From:.*#(helpicantfindgoogle\.com|searchengineshateme\.net|disabled\.org)\>
* ^Subject:(.*[^0-9])?\/[0-9][0-9][0-9][0-9][0-9]?$
| scriptname.py --reply-to "$R" --number "$MATCH"
This illustrates two different techniques for extracting a header value; the Reply-To header is extracted by invoking formail (this will extract just the email terminus, as per your comment; if you mean something else by "alias" then please define it properly) while the trailing 4- or 5-number integer from the Subject is grabbed my matching it in the condition with the special operator \/.
Update: Added an additional condition to only process email where the From: header indicates a sender in one of the domains helpicantfindgoogle.com, searchengineshateme.net, or disabled.org.
As implied by the pipe action, your script will be able to read the triggering message on its standard input, but if you don't need it, just don't read standard input.
If delivery is successful, Procmail will stop processing when this recipe finishes. Thus you should not need to explicitly discard a matching message. (If you want to keep going, use :0c instead of just :0.)
As an efficiency tweak (if you receive a lot of email, and only a small fraction of it needs to be passed to this script, for example) you might want to refactor to only extract the Reply-To: when the conditions match.
:0
* ^From:.*#(helpicantfindgoogle.com|searchengineshateme\.net|disabled\.org)\>
* ^Subject:(.*[^0-9])?\/[0-9][0-9][0-9][0-9][0-9]?$
{
R=`formail -zxReply-To: | sed 's/.*<//;s/>.*//'`
:0
| scriptname.py --reply-to "$R" --number "$MATCH"
}
The block (the stuff between { and }) will only be entered when both the conditions are met. The extraction of the number from the Subject: header into $MATCH works as before; if the From: condition matched and the Subject: condition matched, the extracted number will be in $MATCH.
Related
I am trying to grab a list of messages that have a specific content e.g. billing emails and work on data in there.
In order to get these messages, I run the following
service.users().messages().list(userId=user_id, page_token=page_token, q=query).execute()
which returns all the messages.
I want to limit the messages that I get to confirm to the following criteria:
Sent in the last two days
Definitely deny if from: address not in a list of email addresses i.e. blacklist e.g. notifications, facebook
Definitely accept if from: address in a list of email addresses i.e. whitelist
Look if the subject: matches a set of strings
I understand that I can create a query that would match the email address and subject (from:bill#pge.com AND subject:"Your bill for this month"), but the blacklist and whitelist, as mentioned above, can become significantly large as the scope and the number of vendors I can accept increases, and similar is the case with subject. So my question is:
Is there a limit on the number of query terms?
Is there a way to achieve this other than generating a very long query string combining the black list whitelist and subject (from:abc#this.com AND NOT from:xyz#that.com AND subject:"Your bill" AND subject:"This month's bill")?
Note: For project settings I mostly conform to https://developers.google.com/gmail/api/quickstart/python
There's no limit documented for the number of query terms you can use. Yes, you would have to create programmatically a long query string combining all the emails from the lists. Here [1] you can check the operators you can use, the best approach would be like this:
1) Use "after" or "newer" operators with a timestamp from 2 days before the current date.
2) -from:{xxx#xxx.com xxx#xxx.com ...}
3) from:{xxx#xxx.com xxx#xxx.com ...}
4) subject:{xxx xxx ...}
[1] https://support.google.com/mail/answer/7190
I'm trying to parse a WhatsApp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.
The chat.txt file looks like this:
[06.12.16, 16:46:19] Person One: Wow thats amazing
[06.12.16, 16:47:13] Person Two: Good morning and this goes over multiple
lines as it is a very long message
[06.12.16, 16:47:22] Person Two: ::
While my solution so far would parse most of these messages correctly, however I have a few hundred cases where the message starts with a colon, like the last example above. This leads to an unwanted value of Person Two: : as the sender.
Here is the regex I am working with so far:
pattern = re.compile(r'\[(?P<date>\d{2}\.\d{2}\.\d{2}),\s(?P<time>\d{2}:\d{2}:\d{2})]\s(?P<sender>(?<=\s).*(?::\s*\w+)*(?=:)):\s(?P<message>(?:.+|\n+(?!\[\d{2}\.\d{2}\.\d{2}))+)')
Any advice on how I could go around this bug would be appreciated!
i would pre-process the list to remove the consecutive colons before applying the regex. So for each line e.g
line = [06.12.16, 16:47:22] Person Two: ::
line = line.replace("::","")
which would give :
[06.12.16, 16:47:22] Person Two:
You can then call your regex function on the pre-processed data.
I encountered similar issues when building a tool to analyze WhatsApp chats.
The main issue is that the format of the chat.txt is depending on your system language. In German you will get this 16:47, but in English it might be PM and the month format changes for American users ....
The library I used has the 4 regexs below. So far they covered all occurring cases (Latin Languages).
Filtering general:
const regexParser = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? (.+?): ([^]*)/i;
Filter System Messages:
const regexParserSystem = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? ([^]+)/i;
Date:
const regexSplitDate = /[-/.] ?/;
Handle attachments, which are passed in "< >" even when you export the chat without attachments. (e.g. <media ommitted>)
const regexAttachment = /<.+:(.+)>/;`
I am trying to make a custom right click command for nautilus.
I managed to find a useful content here.
What I don't understand is what does these two lines essentially mean ?
IFS_BAK=$IFS
IFS="
"
And these are present at the bottom too. What do they mean ?
Please help.
IFS_BAK is essentially creating a backup of existing value of IFS variable.
The next line then assigns a new value to IFS i.e specific/required the script.
More info on Internal Field Separator (IFS) can be found here: https://unix.stackexchange.com/questions/16192/what-is-ifs-in-context-of-for-looping
https://unix.stackexchange.com/questions/184863/what-is-the-meaning-of-ifs-n-in-bash-scripting
https://unix.stackexchange.com/questions/26784/understanding-ifs
Okay, I got it.
It is called an 'Internal Field Separator', a special variable in shell.
If you set IFS to | (i.e. IFS=| ), | will be treated as delimiters between words/fields when splitting a line of input.
In the first line:
IFS_BAK=$IFS
the initial 'IFS' value is stored in the variable 'IFS_BAK' and the value of IFS is set to 'new line' by
IFS="
"
so that the entire line is treated as a 'single input'.
Later, at the end of the program, the IFS value is restored to what it was originally.
I am trying to write out some basic python for my kolab email server. For the primary_mail, I want it to be first initial last name, such as jdoe. The default is first name (dot) last name. john.doe#domain.com
I have came up the following:
primary_mail ='%(givenname)s'[0:1]%(surname)s#%(domain)s
Which I want to basically say jdoe#domain.com
givenname would be someone's full name. (i.e John)
surname would be someone's last name. (i.e Doe)
domain is the email domain. domain.com
When python goes to canonify it, it comes up with some mumbo jumbo like so:
'john[0:1]'doe#domain.com
Can someone help me out with correcting this? I am so close.
EDIT:
According to kolab documentation, it looks like it is something like:
"{0}#{1}": "format('%(uid)s', '%(domain)s')"
This of course doesn't work for me though....
EDIT 2:
I am getting the following in my error logs:
imaps[1916]: ptload completely failed: unable to canonify identifier: 'john'[0:1]doe#domain.com
String formatting is by far the easiest, most readable and preferred way of accomplishing this:
first_name = 'John'
surname = 'Smith'
domain = 'company.com'
primary_mail = '{initial}{surname}#{domain}'.format(initial=first_name[0].lower(), surname=surname.lower(), domain=domain)
primary_mail now equals 'jsmith#company.com'. You define a string containing named placeholders in braces, then call the format method to have those placeholders replaced at runtime with the appropriate values. Here, we take the first character of first_name and convert it to lower case, convert the entirety of surname also, and leave domain unchanged.
You can read more on string formatting at the Python 2.7 docs.
James Scholes is right that format is a better way of doing it, however reading the Kolab documentation it seems that you can only give the format string, and they use the % style formatter internally, where you can't change it. From
the Kolab 'primary_mail' documentation
primary_mail = %(givenname)s.%(surname)s#%(domain)s
The equivalent of the following Python is then executed:
primary_mail = "%(givenname)s.%(surname)s#%(domain)s" % {
"givenname": "Maria",
"surname": "Moller",
"preferredlanguage": "en_US"
}
In this case, we need a modifier to the format conversation. We have %(givenname)s, which ensures that givenname is a string. We can also specify a minimum length, followed by a . and then a precision. This is normally only used it numbers, but we can use it for strings, too. Here is a format string with no minimum length, but a maximum length (precision) of 1 character:
"%(givenname).1s"
So you probably want a string like this:
"%(givenname).1s%(surname)#%(domain)"
I have a bunch of files (TV episodes, although that is fairly arbitrary) that I want to check match a specific naming/organisation scheme..
Currently: I have three arrays of regex, one for valid filenames, one for files missing an episode name, and one for valid paths.
Then, I loop though each valid-filename regex, if it matches, append it to a "valid" dict, if not, do the same with the missing-ep-name regexs, if it matches this I append it to an "invalid" dict with an error code (2:'missing epsiode name'), if it matches neither, it gets added to invalid with the 'malformed name' error code.
The current code can be found here
I want to add a rule that checks for the presence of a folder.jpg file in each directory, but to add this would make the code substantially more messy in it's current state..
How could I write this system in a more expandable way?
The rules it needs to check would be..
File is in the format Show Name - [01x23] - Episode Name.avi or Show Name - [01xSpecial02] - Special Name.avi or Show Name - [01xExtra01] - Extra Name.avi
If filename is in the format Show Name - [01x23].avi display it a 'missing episode name' section of the output
The path should be in the format Show Name/season 2/the_file.avi (where season 2 should be the correct season number in the filename)
each Show Name/season 1/ folder should contain "folder.jpg"
.any ideas? While I'm trying to check TV episodes, this concept/code should be able to apply to many things..
The only thought I had was a list of dicts in the format:
checker = [
{
'name':'valid files',
'type':'file',
'function':check_valid(), # runs check_valid() on all files
'status':0 # if it returns True, this is the status the file gets
}
I want to add a rule that checks for
the presence of a folder.jpg file in
each directory, but to add this would
make the code substantially more messy
in it's current state..
This doesn't look bad. In fact your current code does it very nicely, and Sven mentioned a good way to do it as well:
Get a list of all the files
Check for "required" files
You would just have have add to your dictionary a list of required files:
checker = {
...
'required': ['file', 'list', 'for_required']
}
As far as there being a better/extensible way to do this? I am not exactly sure. I could only really think of a way to possibly drop the "multiple" regular expressions and build off of Sven's idea for using a delimiter. So my strategy would be defining a dictionary as follows (and I'm sorry I don't know Python syntax and I'm a tad to lazy to look it up but it should make sense. The /regex/ is shorthand for a regex):
check_dict = {
'delim' : /\-/,
'parts' : [ 'Show Name', 'Episode Name', 'Episode Number' ],
'patterns' : [/valid name/, /valid episode name/, /valid number/ ],
'required' : ['list', 'of', 'files'],
'ignored' : ['.*', 'hidden.txt'],
'start_dir': '/path/to/dir/to/test/'
}
Split the filename based on the delimiter.
Check each of the parts.
Because its an ordered list you can determine what parts are missing and if a section doesn't match any pattern it is malformed. Here the parts and patterns have a 1 to 1 ratio. Two arrays instead of a dictionary enforces the order.
Ignored and required files can be listed. The . and .. files should probably be ignored automatically. The user should be allowed to input "globs" which can be shell expanded. I'm thinking here of svn:ignore properties, but globbing is natural for listing files.
Here start_dir would be default to the current directory but if you wanted a single file to run automated testing of a bunch of directories this would be useful.
The real loose end here is the path template and along the same lines what path is required for "valid files". I really couldn't come up with a solid idea without writing one large regular expression and taking groups from it... to build a template. It felt a lot like writing a TextMate language grammar. But that starts to stray on the ease of use. The real problem was that the path template was not composed of parts, which makes sense but adds complexity.
Is this strategy in tune with what you were thinking of?
maybe you should take the approach of defaulting to: "the filename is correct" and work from there to disprove that statement:
with the fact that you only allow filenames with: 'show name', 'season number x episode number' and 'episode name', you know for certain that these items should be separated by a "-" (dash) so you have to have 2 of those for a filename to be correct.
if that checks out, you can use your code to check that the show name matches the show name as seen in the parent's parent folder (case insensitive i assume), the season number matches the parents folder numeric value (with or without an extra 0 prepended).
if however you don't see the correct amount of dashes you instantly know that there is something wrong and stop before the rest of the tests etc.
and separately you can check if the file folder.jpg exists and take the necessary actions. or do that first and filter that file from the rest of the files in that folder.