I have a GridVariable defined at the job (using Manage Grid Variables). I then have a python code which creates a dataframe. I then append the contents of the dataframe to an array then update my grid variable. However when I use the grid variable for a transformation job the contents are not updated.
This a snippet of my code
for obj in objs:
s3file=s3_client.get_object(Bucket = 'somebucket', Key = obj.key)
tbl = pd.read_csv(s3file['Body'])
row_count=len(tbl.index)
for i in range(row_count) :
record=[]
record.append(tbl.at[i, 'a'])
record.append(tbl.at[i, 'b'])
record.append(tbl.at[i, 'c'])
record.append(tbl.at[i, 'd'])
record.append(tbl.at[i, 'e'])
record.append(tbl.at[i, 'f'])
record.append(tbl.at[i, 'g'])
record.append(tbl.at[i, 'h'])
record.append(tbl.at[i, 'i'])
record.append(tbl.at[i, 'j'])
record.append(tbl.at[i, 'k'])
record.append(tbl.at[i, 'l'])
record.append(tbl.at[i, 'm'])
record.append(tbl.at[i, 'n'])
record.append(tbl.at[i, 'o'])
record.append(tbl.at[i, 'p'])
record.append(tbl.at[i, 'q'])
record.append(tbl.at[i, 'r'])
record.append(tbl.at[i, 's'])
record.append(tbl.at[i, 't'])
record.append(tbl.at[i, 'u'])
record.append(tbl.at[i, 'v'])
record.append(tbl.at[i, 'w'])
record.append(tbl.at[i, 'x'])
record.append(tbl.at[i, 'y'])
record.append('something')
array.append(record)
context.updateGridVariable('somegridvar', array)
print(len(array))
arr2=context.getGridVariable('somegridvar')
print(len(arr2))
print(arr2)
the print(len(arr2)) line prints the correct number of records
print(arr2) prints the array correctly.
But when I use the grid variable in a transformation it doesn't get the records loaded in the python script.
Looks good so far.. in the Orchestration Job, you have got the Grid Variable to contain the values you need.
Check how you are passing the Grid Variable into the Transformation Job?
The best way is to follow the Python Script component with a Run Transformation, and use the Set Grid Variables property to map somegridvar onto whatever you named the columns in the Transformation Job's own Grid Variable.
You have to do that from the same instance of the Orchestration Job run. If you run the Orchestration Job on its own, then sometime later run the Transformation Job on its own, all variables will just have reverted back to their default values.
Related
I am working with a large data set containing portfolio holdings of clients per date (i.e. in each time period, I have a number of stock investments for each person). My goal is to try and identify 'buys' and 'sells'. A buy happens when a new stock appears in a person's portfolio (compared to the previous period). A sell happens when a stock disappears in a person's portfolio (compared to the previous period). Is there an easy/efficient way to do this in Python? I can only think of a cumbersome way via for-loops.
Suppose we have the following dataframe:
which can be computed with the following code:
df = pd.DataFrame({'Date_ID':[1,1,1,1,2,2,2,2,2,2,3,3,3,3], 'Person':['a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'b'], 'Stock':['x1', 'x2', 'x2', 'x3', 'x1', 'x2', 'x3', 'x4', 'x2', 'x3', 'x1', 'x2', 'x3', 'x3']})
I would like to create the 'bought' and 'sell' columns which identify stocks that have been added or are going to be removed from the portfolio. The buy column equals 'True' if the stock newly appears in the persons portfolio (compared to the previous date). The Sell column equals True in case the stock disappears from the person's portfolio the next date.
How to accomplish this (or something similar to identify trades efficiently) in Python?
You can group your dataframe by 'Person' first because
people are completely independent from each other.
After that, for each person - group by 'Date_ID', and for each stock in a group determine if it is present in the next group:
def get_person_indicators(df):
"""`df` here contains info for 1 person only."""
g = df.groupby('Date_ID')['Stock']
prev_stocks = g.agg(set).shift()
was_bought = g.transform(lambda s: ~s.isin(prev_stocks[s.name])
if not pd.isnull(prev_stocks[s.name])
else False)
next_stocks = g.agg(set).shift(-1)
will_sell = g.transform(lambda s: ~s.isin(next_stocks[s.name])
if not pd.isnull(next_stocks[s.name])
else False)
return pd.DataFrame({'was_bought': was_bought, 'will_sell': will_sell})
result = pd.concat([df, df.groupby('Person').apply(get_person_indicators)],
axis=1)
Note:
For better memory usage you can change the dtype of the 'Stock' column from str to Categorical:
df['Stock'] = df['Stock'].astype('category')
I have a output in below format
values = {'A': (node:A connections:{B:[0.9565217391304348], C:[0.5], D:[0], E:[0], F:[0], I:[0]}),
'B': (node:B connections:{F:[0.7], D:[0.631578947368421], J:[0]}),
'C': (node:C connections:{D:[0.5]})}
when i print(type(values)) output is pm4py.objects.heuristics_net.obj.HeuristicsNet
I want to take NODE & CONNECTION, then create two columns which has all connections to individual nodes as seen below
import pandas as pd
df = pd.DataFrame({'Nodes':['A','A','A','A','A','A','B','B','B','C'], 'Connection':['B','C','D','E,'F','I', 'F', 'D', 'J', 'D']})
It is simply a combination of each node with each of its connection. I have worked on simple dictionary before, but i am unaware to extract info as required here.
How to proceed further?
I'm having trouble changing the type of my variable to a categorical data type.
My variable is called "Energy class" and contains the following values:
A++, A+, A, B, C, D, E, F, G.
I want to change the type to a category and order the categories in that same order.
Hence: A++ = 1, A+ = 2, A = 3, B = 4 , etc.
I will also have to perform the same manipulation with another variable, "Condition of the building", which conains the following values: "Very good, "Good", "To be restored".
I tried using the pandas set_categories() method. But it didn't work. There is very little information on how to use it in the documentation.
Anyone knows how to deal with this?
Thank you
You can use map:
energy_class = {'A++':1, 'A+':2,...}
df['Energy class'] = df['Energy class'].map(energy_class)
A bit fancier when you have ordered list of the classes
energy_classes = ['A++', 'A+',...]
df['Energy_class'] = df['Energy class'].map(dict(**enumerate(energy_classes,1))
You can use ordered pd.Categorical:
df['energy_class'] = pd.Categorical(
df['energy_class'],
categories=['A++', 'A+', 'A', 'B', 'C', 'D', 'E', 'F', 'G'],
ordered=True)
I have this code as mentioned by this gentleman in "https://github.com/venky14/Machine-Learning-with-Iris-Dataset/blob/master/Machine%20Learning%20with%20Iris%20Dataset.ipynb"
After splitting the data into training and testing , I am unable to taking the features for training and testing data.Error is being thrown at In[92].
It is giving me
error "KeyError: "['A' 'B' 'C' 'D' 'E' 'F' 'H' 'I'] not in index""
Below is image of how my CSV file looks like
It seems that you are calling column names as indexes.
Please provide sample code because the refed ipynb seems to be correct.
Probably you are looking for this:
import pandas as pd
df = pd.read_csv('sample-table.csv')
df_selected_columns = df[['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I']]
np_ndarray = df_selected_columns.values
I am trying to join recarrys in python such that the same value joins to many elements. The following code works when it is a 1:1 ratio, but when I am trying to do many:1, it only joins one instance:
import numpy as np
import matplotlib
# First data structure
sex = np.array(['M', 'F', 'M', 'F', 'M', 'F'])
causes = np.array(['c1', 'c1', 'c2', 'c2', 'c3', 'c3'])
data1 = np.core.records.fromarrays([sex, causes], names='sex, causes')
# Second data structure
causes2 = np.array(['c1', 'c2', 'c3'])
analyst = np.array(['p1', 'p2', 'p3'])
data2 = np.core.records.fromarrays([causes2, analyst], names='causes, analyst')
# Join on Cause
all_data = matplotlib.mlab.rec_join(('causes'), data1, data2, jointype='leftouter')
What I would like the all_data recarray to contain is all of the data from data1 with the corresponding analyst indicated in data2.
There might be a good use of record array, but I thought python dict should be as good here... Want to know numpy way of doing this myself, too, if it is good.
dct = dict(zip(data2['causes'], data2['analyst']))
all_data = mlab.rec_append_fields(data1, 'analyst',
[dct[x] for x in data1['causes']])