I have this code as mentioned by this gentleman in "https://github.com/venky14/Machine-Learning-with-Iris-Dataset/blob/master/Machine%20Learning%20with%20Iris%20Dataset.ipynb"
After splitting the data into training and testing , I am unable to taking the features for training and testing data.Error is being thrown at In[92].
It is giving me
error "KeyError: "['A' 'B' 'C' 'D' 'E' 'F' 'H' 'I'] not in index""
Below is image of how my CSV file looks like
It seems that you are calling column names as indexes.
Please provide sample code because the refed ipynb seems to be correct.
Probably you are looking for this:
import pandas as pd
df = pd.read_csv('sample-table.csv')
df_selected_columns = df[['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I']]
np_ndarray = df_selected_columns.values
Related
Is it possible in jupyter to display tabular data in some interactive format?
So that for example following data
A,x,y
a,0,0
b,5,2
c,5,3
d,0,3
will be scrollable and sortable by A,x and y columns?
Yes, it is possible.
First install itables
!pip install itables
In the next step import module and turn on interactive mode:
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
Let's load your data to pandas dataframe:
data = {
'A' : ['a', 'b', 'c', 'd'],
'x' : [0, 5, 5, 0],
'y' : ['0', '2', '3', '3'],
}
df = pd.DataFrame(data)
df
see the result
i want to inpute the missing data based on multivariate imputation, in the below-attached data sets, column A has some missing values, and Column A and Column B have the correlation factor of 0.70. So I want to use a regression kind of realationship so that it will build the relation between Column A and Column B and impute the missing values in Python.
N.B.: I can do it using Mean, median, and mode, but I want to use the relationship from another column to fill the missing value.
How to deal the problem. your solution, please
import pandas as pd
from sklearn.preprocessing import Imputer
import numpy as np
# assign data of lists.
data = {'Date': ['9/19/14', '9/20/14', '9/21/14', '9/21/14','9/19/14', '9/20/14', '9/21/14', '9/21/14','9/19/14', '9/20/14', '9/21/14', '9/21/14', '9/21/14'],
'A': [77.13, 39.58, 33.70, np.nan, np.nan,39.66, 64.625, 80.04, np.nan ,np.nan ,19.43, 54.375, 38.41],
'B': [19.5, 21.61, 22.25, 25.05, 24.20, 23.55, 5.70, 2.675, 2.05,4.06, -0.80, 0.45, -0.90],
'C':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'c', 'c']}
# Create DataFrame
df = pd.DataFrame(data)
df["Date"]= pd.to_datetime(df["Date"])
# Print the output.
print(df)
Use:
dfreg = df[df['A'].notna()]
dfimp = df[df['A'].isna()]
from sklearn.neural_network import MLPRegressor
regr = MLPRegressor(random_state=1, max_iter=200).fit(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
regr.score(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
regr.predict(dfimp['B'].values.reshape(-1, 1))
Note that in the provided data correlation of the A and B columns are very low (less than .05).
For replacing the imputed values with empty cells:
s = df[df['A'].isna()]['A'].index
df.loc[s, 'A'] = regr.score(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
Output:
I have a output in below format
values = {'A': (node:A connections:{B:[0.9565217391304348], C:[0.5], D:[0], E:[0], F:[0], I:[0]}),
'B': (node:B connections:{F:[0.7], D:[0.631578947368421], J:[0]}),
'C': (node:C connections:{D:[0.5]})}
when i print(type(values)) output is pm4py.objects.heuristics_net.obj.HeuristicsNet
I want to take NODE & CONNECTION, then create two columns which has all connections to individual nodes as seen below
import pandas as pd
df = pd.DataFrame({'Nodes':['A','A','A','A','A','A','B','B','B','C'], 'Connection':['B','C','D','E,'F','I', 'F', 'D', 'J', 'D']})
It is simply a combination of each node with each of its connection. I have worked on simple dictionary before, but i am unaware to extract info as required here.
How to proceed further?
I have a GridVariable defined at the job (using Manage Grid Variables). I then have a python code which creates a dataframe. I then append the contents of the dataframe to an array then update my grid variable. However when I use the grid variable for a transformation job the contents are not updated.
This a snippet of my code
for obj in objs:
s3file=s3_client.get_object(Bucket = 'somebucket', Key = obj.key)
tbl = pd.read_csv(s3file['Body'])
row_count=len(tbl.index)
for i in range(row_count) :
record=[]
record.append(tbl.at[i, 'a'])
record.append(tbl.at[i, 'b'])
record.append(tbl.at[i, 'c'])
record.append(tbl.at[i, 'd'])
record.append(tbl.at[i, 'e'])
record.append(tbl.at[i, 'f'])
record.append(tbl.at[i, 'g'])
record.append(tbl.at[i, 'h'])
record.append(tbl.at[i, 'i'])
record.append(tbl.at[i, 'j'])
record.append(tbl.at[i, 'k'])
record.append(tbl.at[i, 'l'])
record.append(tbl.at[i, 'm'])
record.append(tbl.at[i, 'n'])
record.append(tbl.at[i, 'o'])
record.append(tbl.at[i, 'p'])
record.append(tbl.at[i, 'q'])
record.append(tbl.at[i, 'r'])
record.append(tbl.at[i, 's'])
record.append(tbl.at[i, 't'])
record.append(tbl.at[i, 'u'])
record.append(tbl.at[i, 'v'])
record.append(tbl.at[i, 'w'])
record.append(tbl.at[i, 'x'])
record.append(tbl.at[i, 'y'])
record.append('something')
array.append(record)
context.updateGridVariable('somegridvar', array)
print(len(array))
arr2=context.getGridVariable('somegridvar')
print(len(arr2))
print(arr2)
the print(len(arr2)) line prints the correct number of records
print(arr2) prints the array correctly.
But when I use the grid variable in a transformation it doesn't get the records loaded in the python script.
Looks good so far.. in the Orchestration Job, you have got the Grid Variable to contain the values you need.
Check how you are passing the Grid Variable into the Transformation Job?
The best way is to follow the Python Script component with a Run Transformation, and use the Set Grid Variables property to map somegridvar onto whatever you named the columns in the Transformation Job's own Grid Variable.
You have to do that from the same instance of the Orchestration Job run. If you run the Orchestration Job on its own, then sometime later run the Transformation Job on its own, all variables will just have reverted back to their default values.
I'm having trouble changing the type of my variable to a categorical data type.
My variable is called "Energy class" and contains the following values:
A++, A+, A, B, C, D, E, F, G.
I want to change the type to a category and order the categories in that same order.
Hence: A++ = 1, A+ = 2, A = 3, B = 4 , etc.
I will also have to perform the same manipulation with another variable, "Condition of the building", which conains the following values: "Very good, "Good", "To be restored".
I tried using the pandas set_categories() method. But it didn't work. There is very little information on how to use it in the documentation.
Anyone knows how to deal with this?
Thank you
You can use map:
energy_class = {'A++':1, 'A+':2,...}
df['Energy class'] = df['Energy class'].map(energy_class)
A bit fancier when you have ordered list of the classes
energy_classes = ['A++', 'A+',...]
df['Energy_class'] = df['Energy class'].map(dict(**enumerate(energy_classes,1))
You can use ordered pd.Categorical:
df['energy_class'] = pd.Categorical(
df['energy_class'],
categories=['A++', 'A+', 'A', 'B', 'C', 'D', 'E', 'F', 'G'],
ordered=True)