Apache NiFi: Processing multiple csv's using the ExecuteScript Processor - python

I have a csv with 70 columns. The 60th column contains a value which decides wether the record is valid or invalid. If the 60th column has 0, 1, 6 or 7 it's valid. If it contains any other value then its invalid.
I realised that this functionality wasn't possible relying completely on changing property's of processors in Apache NiFi. Therfore I decided to use the executeScript processor and added this python code as the text body.
import csv
valid =0
invalid =0
total =0
file2 = open("invalid.csv","w")
file1 = open("valid.csv","w")
with open('/Users/himsaragallage/Desktop/redder/Regexo_2019101812750.dat.csv') as f:
r = csv.reader(f)
for row in f:
# print row[1]
total +=1
if row[59] == "0" or row[59] == "1" or row[59] == "6" or row[59] == "7":
valid +=1
file1.write(row)
else:
invalid += 1
file2.write(row)
file1.close()
file2.close()
print("Total : " + str(total))
print("Valid : " + str(valid))
print("Invalid : " + str(invalid))
I have no idea how to use a session and code within the executeScript processor as shown in this question. So I just wrote a simple python code and directed the valid and invalid data to different files. This approach I have used has many limitations.
I want to be able to dynamically process csv's with different filenames.
The csv which the invalid data is sent to, must also have the same filename as the input csv.
There would be around 20 csv's in my redder folder. All of them must be processed in one go.
Hope you could suggest a method for me to do the following. Feel free to provide me with a solution by editing the python code I have used or even completely using a different set of processors and totally excluding the use of ExecuteScript Processer

Here is complete step-by-step instructions on how to use QueryRecord processor
Basically, you need to setup highlighted properties

You want to route records based on values from one column. There are various ways to make this happen in NiFi. I can think of the following:
Use QueryRecord processor to partition records by column values
Use RouteOnContent processor to route using a regular expression
Use ExecuteScript processor to create a custom routing logic
Use PartitionRecord processor to route based on RecordPaths
I show you how to solve your problem using PartitionRecord processor. Since you did not provide any example data I created an example use case. I want to distinguish cities in Europe from cities elsewhere. Following data is given:
id,city,country
1,Berlin,Germany
2,Paris,France
3,New York,USA
4,Frankfurt,Germany
Flow:
GenerateFlowFile:
PartitionRecord:
CSVReader should be setup to infer schema and CSVRecordSetWriter to inherit schema. PartitionRecord will group records by country and pass them on together with an attribute country that has the country value. You will see following groups of records:
id,city,country
1,Berlin,Germany
4,Frankfurt,Germany
id,city,country
2,Paris,France
id,city,country
3,New York,USA
Each group is a flowfile and will have the country attribute, which you will use to route the groups.
RouteOnAttribute:
All countries from Europe will be routed to the is_europe relationship. Now you can apply the same strategy to your use case.

Related

Python Script takes very long time to run

I've managed to write a piece of code (composed by multiple sources along the web, and adapted to my needs) which should do the following:
Reads an excel file
From column A to search the value of each cell within the subject of mails from a specific folder
If matches (cell value equal to first 9 characters of the subject), save the attachment (each mail has only one attachment, no more, no less) with the value of cell in an "output" folder.
If doesn't match, go to the next mail, respectively next cell value.
In the end, display the run time (not very important, only for my knowledge)
The code actually works (tested with an email folder with only 9 emails). My problem is the run time.
The actual scope of the script is to look for 2539 values in a folder with 32700 emails and save the attachments.
I've done 2 runs as follow:
2539 values in 32700 emails (stopped after ~1 hour)
10 values in 32700 emails (stopped after ~40 minutes; in this time the script processed 4 values)
I would like to know / learn, if there a way to make the script faster, or if it's slow because it's bad written etc.
Below is my code:
from pathlib import Path
import win32com.client
import os
from datetime import datetime
import time
import openpyxl
#name of the folder created for output
output_dir = Path.cwd() / "Orders"
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
folder = outlook.Folders.Item("Shared Mailbox Name")
inbox = folder.Folders.Item("Inbox")
messages = inbox.Items
wb = openpyxl.load_workbook(r"C:\Users\TEST\Path-to-excel\FolderName\ExcelName.xlsx")
sheet = wb['Sheet1']
names=sheet['A']
for cellObj in names:
ordno = str(cellObj.value)
print(ordno)
for message in messages:
subject = message.Subject
body = message.body
attachments = message.Attachments
if str(subject)[:9] == ordno:
output_dir.mkdir(parents=True, exist_ok=True)
for attachment in attachments:
attachment.SaveAsFile(output_dir / str(attachment))
else:
pass
start = time()
print(f'Time taken to run: {time() - start} seconds')
I need to mention that I am a complete rookie in Python thus any help from the community is welcomed, especially next to some clarifications of what I did wrong and why.
I've also read some similar questions but nothing helps, or at least I don't know how to adopt the methods.
Thank you!
Seems to me the main problem with your program is that you have two nested loop (one over the values & one over the mails) when you only need to loop over the mails and check if their subject is in the list of values.
First you need to construct your list of value with something like :
ordno_values = [str(cellObj.value) for cellObj in names]
then, in your loop over mails, you just need to adapt the condition to :
if str(subject)[:9] in ordno_values:
Your use case is too specific for anyone to be able to recreate, and hints about performance only generic but your main problem is a combination of "O x N" and synchronous processing: currently you are processing one value, one message at a time, which includes disk IO to get the e-mail.
You can certainly improve things by creating a single list of values from the workbook. You can then use this list with a processing pool (see the Python documentation) to read multiple e-mails at once.
But things might be even better if you can use the subject to query the mail server.
If you have follow-up questions, please break them down to specific parts of the task.
First of all, instead of iterating over all items in the folder:
for message in messages:
subject = message.Subject
And then checking whether a subject starts from the specified string or includes such string:
if str(subject)[:9] == ordno:
Instead, you need to use the Find/FindNext or Restrictmethods of theItems` class where you could get collection of items that correspond to your search criteria. Read more about these methods in the following articles:
How To: Use Find and FindNext methods to retrieve Outlook mail items from a folder (C#, VB.NET)
How To: Use Restrict method to retrieve Outlook mail items from a folder
For example, you could use the following restriction on the collection (taken form the VBA sample):
criteria = "#SQL=" & Chr(34) & "urn:schemas:httpmail:subject" & Chr(34) & " ci_phrasematch 'question'"
See Filtering Items Using a String Comparison for more information.
Also you may find the AdvancedSearch method of the Application class helpful. The key benefits of using the AdvancedSearch method in Outlook are:
The search is performed in another thread. You don’t need to run another thread manually since the AdvancedSearch method runs it automatically in the background.
Possibility to search for any item types: mail, appointment, calendar, notes etc. in any location, i.e. beyond the scope of a certain folder. The Restrict and Find/FindNext methods can be applied to a particular Items collection (see the Items property of the Folder class in Outlook).
Full support for DASL queries (custom properties can be used for searching too). To improve the search performance, Instant Search keywords can be used if Instant Search is enabled for the store (see the IsInstantSearchEnabled property of the Store class).
You can stop the search process at any moment using the Stop method of the Search class.
See Advanced search in Outlook programmatically: C#, VB.NET for more information on that.

Python loop error in SPSS syntax only if i run the same code twice

I'm quite new in python programming.
I'm trying to automate some tabulations in SPSS using python (and i kind of managed it...) using a loop and some python code, but it works fine only the first time i run the syntax, the second time it tabulates only once:
I have an SPSS file with different projects merged together (i.e. different countries) , so first i try to extract a list of projects using a built in function.
Once i have my list of project i run a loop and i change the spss syntax for the case selection and tabulation.
this is the code:
begin program.
import spss
#Function that extracts the data from spss
def DatiDaSPSS(vars, num):
if num == 0:
num = spss.GetCaseCount()
if vars == None:
varNums = range(spss.GetVariableCount())
else:
allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
varNums = [allvars.index(i) for i in vars]
data = spss.Cursor(varNums)
pydata = data.fetchmany(num)
data.close()
return pydata
#store the result of the function into a list:
all_prj=DatiDaSPSS(vars=["Project"],num=0)
#remove duplicates and keep only the country that i need:
prj_list=list(set([i[0] for i in all_prj]))
#loop for the tabulation:
for i in range(len(prj_list)):
prj_now=str(prj_list[i])
spss.Submit("""
compute filter_$=Project='%s'.
filter by filter_$.
exe.
TEXT "Country"
/OUTLINE HEADING="%s" TITLE="Country".
CTABLES
/VLABELS VARIABLES=HisInterviewer HisResult DISPLAY=DEFAULT
/TABLE HisInterviewer [C][COUNT F40.0, ROWPCT.COUNT PCT40.1] BY HisResult [C]
/CATEGORIES VARIABLES=HisInterviewer HisResult ORDER=A KEY=VALUE EMPTY=EXCLUDE TOTAL=YES
POSITION=AFTER
/CRITERIA CILEVEL=95.
""" %(prj_now,prj_now))
end program.
When i run it the second time it shows only the last value of the list (and only one tabulation). If i restart SPSS it works fine the first time.
Is it because of the function?
i'm using spss25
can I reply myself, should i edit the discussion or maybe delete it? i think i found out the reason, i guess the function picks up only the values that are already selected, i tried now adding this SPSS code before the begin and it seems to be working:
use all.
exe.
begin program.
...
at the last loop there is a filter on the data and i removed it before of running the script. please let me know if you want me to edit or remove the message

How to merge few lines with filtering some text

I have a text file with the following format.
The first line includes "USERID"=12345678 and the other lines include the user groups for each application:
For example:
User with user T-number T12345 has WRITE access to the APP1 and APP2 and READ-ONLY access to APP1.
T-Number is just some other kind of ID.
00001, 00002 and so on are sequence numbers and can be ignored.
T12345;;USERID;00001;12345678;
T12345;APPLICATION;WRITE;00001;APP1
T12345;APPLICATION;WRITE;00002;APP2
T12345;APPLICATION;READ-ONLY;00001;APP1
I need to do some filtering and merge the line containing USERID with all the lines having user groups, matching t-number with userid (T12345 = 12345678)
So the output should look like this.
12345678;APPLICATION;WRITE;APP1
12345678;APPLICATION;WRITE;APP2
12345678;APPLICATION;READ-ONLY;APP1
Should I use csv python module to accomplish this?
I do not see any advantage in using the csv module for reading and parsing the input text file. The number of fields varies: 6 fields in the USERID line, with 2 of them empty, but 5 non-empty fields in the other lines. The fields look very simple, so there is no need for csv's handling of the separator character hidden away in quotes and the like. There is no header line as in a csv file, but rather many headers sprinkled in among the data lines.
A simple routine that reads each line, splits each on the semicolon character, and parses the line, and combines related lines would suffice.
The output file is another matter. The lines have the same format, with the same number of fields. So creating that output may be a good use for csv. However, the format is so simple that the file could also be created without csv.
I am not so sure if you should use the csv module here - it has mixed data, possibly more than just users and user group rights? In the case of a user declaration, you only need to retrieve its group and id, while for the application rights you need to extract the group, app name and right. The more differing data you have, the more issues you will encounter - with manual parsing of the data you are always able to just continue when you met certain criterias.
So far i must say you are better off with a manual, line-by-line parsing of the lines, structure it into something meaningful, then output the data. For instance
from StringIO import StringIO
from pprint import pprint
feed = """T12345;;USERID;00001;12345678;
T12345;;USERID;00001;2345678;
T12345;;USERID;00002;345678;
T12345;;USERID;00002;45678;
T12345;APPLICATION;WRITE;00001;APP1
T12345;APPLICATION;WRITE;00002;APP2
T12345;APPLICATION;READ-ONLY;00001;APP1
T12345;APPLICATION;WRITE;00002;APP1
T12345;APPLICATION;WRITE;00002;APP2"""
buf = StringIO(feed)
groups = {}
# Read all data into a dict of dicts
for line in buf:
values = line.strip().split(";")
if values[3] not in groups:
groups[values[3]] = {"users": [], "apps": {}}
if values[2] == "USERID":
groups[values[3]]['users'].append(values[4])
continue
if values[1] == "APPLICATION":
if values[4] not in groups[values[3]]["apps"]:
groups[values[3]]["apps"][values[4]] = []
groups[values[3]]["apps"][values[4]].append(values[2])
print("Structured data with group as root")
pprint(groups)
print("Output data")
for group_id, group in groups.iteritems():
# Order by user, app
for user in group["users"]:
for app_name, rights in group["apps"].iteritems():
for right in rights:
print(";".join([user, "APPLICATION", right, app_name]))
Online demo here

Checking if A follows B on twitter using Tweepy/Python

I have a list of a few thousand twitter ids and I would like to check who follows who in this network.
I used Tweepy to get the accounts using something like:
ids = {}
for i in list_of_accounts:
for page in tweepy.Cursor(api.followers_ids, screen_name=i).pages():
ids[i]=page
time.sleep(60)
The values in the dictionary ids form the network I would like to analyze. If I try to get the complete list of followers for each id (to compare to the list of users in the network) I run into two problems.
The first is that I may not have permission to see the user's followers - that's okay and I can skip those - but they stop my program. This is the case with the following code:
connections = {}
for x in user_ids:
l=[]
for page in tweepy.Cursor(api.followers_ids, user_id=x).pages():
l.append(page)
connections[x]=l
The second is that I have no way of telling when my program will need to sleep to avoid the rate-limit. If I put a 60 second wait after every page in this query - my program would take too long to run.
I tried to find a simple 'exists_friendship' command that might get around these issues in a simpler way - but I only find things that became obsolete with the change to API 1.1. I am open to using other packages for Python. Thanks.
if api.exists_friendship(userid_a, userid_b):
print "a follows b"
else:
print "a doesn't follow b, check separately if b follows a"

Compare two files and make a list

I have two files that I want to compare with each other and form a list. Each file have their own class. Book and Person. In these, I have different attributes. The ones I want to compare are: person.personalcode == book.borrowed. From this I want a list of all the borrowed books. I have started like this:
for person in person_list:
for book in booklibrary_list:
if person.personalcode == book.borrowed:
person.books.append(book, person)
for person in person_list:
if len(person.books) > 0:
print(person.personalcode + "," + person.firstname + person.lastname + "have borrowed the following books: ")
for book in person.books:
print(book)
for person in person_list:
person.books = []
But it does not work, what have I missed or done wrong?
Posting as an answer as this is too long for a comment.
First: improve your question. Show how you construct the Person and the Book class, and how you populate them. Describe what the personalcode is and how come personalcode would be the same as a book code. Some sample data and a bit more code would make this easier to answer.
Second: reading your other question, you seem to be storing your data in a text file, loading and querying, modifying and saving the data directly. This will lead you to problems and instead you should consider going down one of two lines:
Use an SQL database, possibly the easiest to start with is SQLite as it does not need a server to be set up and there is a module in the standard library that is very easy to use. Store your data there and you will find it easier in the long run.
Use Python objects (e.g. three classes: Person, Book, and BorrowedBook), manage lists of them within the program, and use shelve from the standard library to store and retrieve these lists of objects between queries.
The use of shelve would be easier if you have not used SQL before, and I hope you will forgive the pun when I say that it might be very appropriate for a book-related application!

Categories

Resources