Apache NiFi: ExecuteStreamCommand generating two flow files - python

I'm currently running in a problem withe Apache NiFi ExecuteStreamCommand using PYthon. I have a script which reads a csv and converts it in a pandas-Dataframes and afterwards in a JSON. The script splits the csv file in several DataFrames due to inconsistent naming of the columns. My current script looks as follows:
import pandas as pd
import sys
input = sys.stdin.readlines()
#searching for subfiles and saving them to a list with files ...
appendDataFrames = []
for dataFrames in range(len(files)):
df = pd.DataFrame(files[dataFrame])
#several improvements of DataFrame ...
appendDataFrames.append(df)
output = pd.concat(appendDataFrames)
JSONOutPut = output.to_json(orient='records', date_format='iso', date_unit='s')
sys.stdout.write(JSONOutPut)
In the queue to my next processor I can now see one FlowFile as JSON (as expected).
My question is, is it possible to write each JSON in seperate FlowFiles, so that my next processor is able to work at them separated? I need to do this because the next processor is a InferAvroSchema and since all JSONs have different schemas this is no opportunity. Am I mistaken? Or is there a possible way to solve this?
The code below won't work since its anyway in the same flow file and my InferAvroSchema is not able to handle this separated.
import pandas as pd
import sys
input = sys.stdin.readlines()
#searching for subfiles and saving them to a list with files ...
appendDataFrames = []
for dataFrames in range(len(files)):
df = pd.DataFrame(files[dataFrame])
#several improvements of DataFrame ...
JSONOutPut = df.to_json(orient='records', date_format='iso', date_unit='s')
sys.stdout.write(JSONOutPut)
Thanks in advance!

with ExecuteStreamCommand you can't split output because you have to write to stdout.
However you could write some delimiter into output and use SplitContent with the same delimiter as next processor.

I just modified my code as follows:
import pandas as pd
import sys
input = sys.stdin.readlines()
#searching for subfiles and saving them to a list with files ...
appendDataFrames = []
for dataFrames in range(len(files)):
df = pd.DataFrame(files[dataFrame])
#several improvements of DataFrame ...
JSONOutPut = df.to_json(orient='records', date_format='iso', date_unit='s')
sys.stdout.write(JSONOutPut)
sys.stdout.write("#;#")
And added a SplitContent processor like:

Related

Is there an easier way to filter out rows from multiple CSV files and paste them into a new csv file? IM having issues doing that using a for loop

#Purpose: to read CSV files from every every csv files in the directory. Filter the rows with the column that say 'fail" from the csv file. Then copy and paste those rows onto a new CSV file.
# import necessary libraries
from sqlite3 import Row
import pandas as pd
import os
import glob
import csv
# Writing to a CSV
# using Python
import csv
# the path to your csv file directory.
mycsvdir = 'C:\Users\'' #this is where all the csv files will be housed.
# use glob to get all the csv files
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csv_files:
# read the csv file
df = pd.read_csv(csvfile)
dataframes.append(df)
#print(row['roi_id'], row['result']) #roi_id is the column label for the first cell on the csv, result is the Jth column label
dataframes = dataframes[dataframes['result'].str.contains('fail')]
# print out to a new csv file
dataframes.to_csv('ROI_Fail.csv') #rewrite this to mirror the variable you want to save the failed rows in.
I tried running this script but im getting a couple of errors. First off, i know my indentation is off(newbie over here), and im getting a big error under my for loop saying that "csv_files" is not defined. Any help would be greatly appreciated.
There are two issues here:
The first one is kind of easy - The variable in the for loop should be csvfiles, not csv_files.
The second one (Which will show up when you fix the one above) is that you are treating a list of dataframes as a dataframe.
The object "dataframes" in your script is a list to which you are appending the dataframes created from the CSV files. As such, you cannot index them by the column name as you are trying to do.
If your dataframes have the same layout I'd recommend using df.concat to join all dataframes into a single one, and then filtering the rows as you did here.
full_dataframe = pd.concat(dataframes, axis=0)
full_dataframe = full_dataframe[full_dataframe['result'].str.contains('fail')]
As a tip for further posts I'd recommend you also post the full traceback from your program. It helps us understand exactly what error you had when executing your code.

Converting .CIF files to a dataset (csv, xls, etc)

how are you all? Hope you're doing good!
So, get this. I need to convert some .CIF files (found here: https://www.ccdc.cam.ac.uk/support-and-resources/downloads/ - MOF Collection) to a format that i can use with pandas, such as CSV or XLS. I'm researching about using MOF's for hydrogen storage, and this collection from Cambrigde's Structural Database would do wonders for me.
So far, i was able to convert them using ToposPro, but not to a format that i can use with Pandas readTo.
So, do any of you know of a way to do this? I've also read about pymatgen and matminer, but i've never used them before.
Also, sorry for any mishap with my writing, english isn't my main language. And thanks for your help!
To read a .CIF file as a pandas DataFrame, you can use Bio.PDB.MMCIF2Dict module from biopython to firstly parse the .CIF file and return a dictionnary. Then, you will need pandas.DataFrame.from_dict to create a dataframe from the bio-dictionnary. Finally, you have to pandas.DataFrame.transpose to make rows as columns (since we'll define index as an orientation for the dict to deal with "missing" values).
You need to install biopython by executing this line in your (Windows) terminal :
pip install biopython
Then, you can use the code below to read a specific .CIF file :
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
dico = MMCIF2Dict(r"path_to_the_MOF_collection\abavij_P1.cif")
df = pd.DataFrame.from_dict(dico, orient='index')
df = df.transpose()
>>> display(df)
Now, if you need the read the whole MOF collection (~10k files) as a dataframe, you can use this :
from pathlib import Path
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
from time import time
mof_collection = r"path_to_the_MOF_collection"
start = time()
list_of_cif = []
for file in Path(mof_collection).glob('*.cif'):
dico = MMCIF2Dict(file)
temp = pd.DataFrame.from_dict(dico, orient='index')
temp = temp.transpose()
temp.insert(0, 'Filename', Path(file).stem) #to get the .CIF filename
list_of_cif.append(temp)
df = pd.concat(list_of_cif)
end = time()
print(f'The DataFrame of the MOF Collection was created in {end-start} seconds.')
df
>>> output
I'm sure you're aware that the .CIF files may have different number of columns. So, feel free to concat (or not) the MOF collection. And last but not least, if you want to get a .csv and/or an .xlsx file of your dataframe, you can use either pandas.DataFrame.to_csv or pandas.DataFrame.to_excel:
df.to_csv('your_output_filename.csv', index=False)
df.to_excel('your_output_filename.xlsx', index=False)
EDIT :
To read the structure of a .CIF file as a DataFrame, you can use the as_dataframe() method by using pymatgen :
from pymatgen.io.cif import CifParser
parser = CifParser("abavij_P1.cif")
structure = parser.get_structures()[0]
structure.as_dataframe()
>>> output
In case you need to check if a .CIF file has a valid structure, you can use :
if len(structure)==0:
print('The .CIF file has no structure')
Or:
try:
structure = parser.get_structures()[0]
except:
print('The .CIF file has no structure')

How to automate the process of converting the list of .dat files, with their dictionaries (in seperate .dct format), to pandas data frames?

The following code coverts .dat files into data frames with the use of its dictionary file in .dct format. It works well. But my problem is that I was unable to automate this process, creating a loop that takes the pairs of these files from lists is a little bit tricky, atleast for me. I could really use some help with that.
try:
from statadict import parse_stata_dict
except ImportError:
!pip install statadict
import pandas as pd
from statadict import parse_stata_dict
dict_file = '2015_2017_FemPregSetup.dct'
data_file = '2015_2017_FemPregData.dat'
stata_dict = parse_stata_dict(dict_file)
stata_dict
nsfg = pd.read_fwf(data_file,
names=stata_dict.names,
colspecs=stata_dict.colspecs)
# nsfg is now a pandas DataFrame
These are the lists of files that I would like to convert into data frames. Every .dat file has its own dictionary file:
dat_name = ['2002FemResp.dat',
'2002Male.dat'...
dct_name = ['2002FemResp.dct',
'2002Male.dct'...
Assuming both lists have the same length and you will want to save the csv dataframe you could try:
c=0
for dat,dct in zip(dat_name, dct_name):
c+=1
stata_dict = parse_stata_dict(dct)
pd.read_fwf(dat, names=stata_dict.names, colspecs=stata_dict.colspecs).to_csv(r'path_name\file_name_{}.csv'.format(c))
# don't forget the '.csv'!
Also consider that if you are not using windows you need to use '/' rather than '\' in your path (or you can use os.path.join() to avoid this issue.

Storing and accessing dataframe objects

I'm trying to read some excel files as pandas dataframes. The problem is they are quite large (about 2500 rows, columns up to 'CYK' label in the excel sheet, and there are 14 of them).
Every time that I run my program, it has to import again the files from excel. This causes the runtime to grow a lot, currently it's a bit more than 15 minutes and as of now the program doesn't even do anything significant except importing the files.
I would like to be able to import the files just once, then save the dataframe objects somewhere and make my program work only on those dataframes.
Any suggestions?
This is the code I developed until now:
import pandas as pd
import os
path = r'C:/Users/damia/Dropbox/Tesi/WIOD'
dirs = os.listdir(path)
complete_dirs = []
for f in dirs:
complete_dirs.append(path + r"/" + f)
data = []
for el in complete_dirs:
wiod = pd.read_excel(el, engine='pyxlsb')
data.append(wiod)
If anyone is interested, you can find the files I'm trying to read at this link:
http://www.wiod.org/database/wiots16
You could use the to_pickle and read_pickle methods provided by pandas to serialize the dataframes and store them in files.
docs
Example pickling:
data = []
pickle_paths = []
for el in complete_dirs:
wiod = pd.read_excel(el, engine='pyxlsb')
# here's where you store it
pickle_loc = 'your_unique_path_to_save_this_frame'
wiod.to_pickle(pickle_loc)
pickle_paths.append(pickle_loc)
data.append(wiod)
Depickling
data = []
for el in pickle_paths:
data.append(pd.read_pickle(el))
Another solution using to_pickle and read_pickle.
As an aside, you can read Excel files directly from URLs if you don't want to save to your drive first.
#read each file from the URL and save to disk
for year in range(2000, 2015):
pd.read_excel(f"http://www.wiod.org/protected3/data16/wiot_ROW/WIOT{year}_Nov16_ROW.xlsb").to_pickle(f"{year}.pkl")
#read pickle files from disk to a dataframe
data = list()
for year in range(2000, 2015):
data.append(pd.read_pickle(f"{year}.pkl"))
data = pd.concat(data)

Pandas Dataframe.to_csv - insert value of variable into beginning of csv file

Python 3.8.5 Pandas 1.1.3
I'm using the following to loop through json files and create csv files:
import os
import glob
impot pandas as pd
def stuff():
results_list = []
for filepath in glob.iglob('/Users/me/data/*.json'):
filename = str(filepath)
file = open(filepath,"r")
data = file.read()
df = pd.json_normalize(data, 'main')
df.to_csv(filename + '.csv')
file.close()
results_list.append(data)
return results_list
The format of the resulting csv files fits my requirements exactly without having to pass any additional params to the to_csv method - when viewing the csv file in Excel, row 1 is the keys as the headers, and column 1 is the index numbers. Exactly what I need. Cell A1 is blank.
One final step that I need to accomplish is to write the filename variable value to the csv file. Ideally I'd like to put it in cell A1, if possible. Can I accomplish this solely with to_csv or am I going to need to get into csv.writer world?
You can exploit the index name for that purpose:
df.rename_axis('somename').to_csv()

Categories

Resources