extract value from object to column python - python

I have a output in below format
values = {'A': (node:A connections:{B:[0.9565217391304348], C:[0.5], D:[0], E:[0], F:[0], I:[0]}),
'B': (node:B connections:{F:[0.7], D:[0.631578947368421], J:[0]}),
'C': (node:C connections:{D:[0.5]})}
when i print(type(values)) output is pm4py.objects.heuristics_net.obj.HeuristicsNet
I want to take NODE & CONNECTION, then create two columns which has all connections to individual nodes as seen below
import pandas as pd
df = pd.DataFrame({'Nodes':['A','A','A','A','A','A','B','B','B','C'], 'Connection':['B','C','D','E,'F','I', 'F', 'D', 'J', 'D']})
It is simply a combination of each node with each of its connection. I have worked on simple dictionary before, but i am unaware to extract info as required here.
How to proceed further?

Related

Is it possible to "explode" an array that contains multiple dictionaries using pandas or python?

Is it possible to "explode" an array that contains multiple dictionaries using pandas or python?
I am developing a code that returns these two arrays (simplified version):
data_for_dataframe = ["single nucleotide variant",
[{'assembly': 'GRCh38',
'start': '11016874',
'end': '11016874',
'ref': 'C',
'alt': 'T',
'risk_allele': 'T'},
{'assembly': 'GRCh37',
'start': '11076931',
'end': '11076931',
'ref': 'C',
'alt': 'T',
'risk_allele': 'T'}]]
columns = ["variant_type", "assemblies"]
So I created a pandas dataframe with using these two arrays - "data_for_dataframe" and "columns":
import pandas as pd
df = pd.DataFrame(data_for_dataframe, columns).transpose()
And the output was:
The type of the "variant_type" column is string and the type of the "assemblies" column is array. My question is whether it is possible, and if so, how, to "explode" the "assemblies" column and create a dataframe as shown in the following image:
Could you help me?
It's possible with a combination of apply() and explode().
exploded = df['assemblies'].explode().apply(pd.Series)
exploded['variant_type'] = df['variant_type']
Output:
assembly start end ref alt risk_allele variant_type
0 GRCh38 11016874 11016874 C T T single nucleotide variant
0 GRCh37 11076931 11076931 C T T single nucleotide variant

How to specify data type in PyExasol export_to_pandas

How can I pass datatype parameters in export_to_pandas API. and can I change column names to lower cases ?
from pyexasol import ExaConnection con = ExaConnection(dsn=dns, user=user, password=password) con.execute('OPEN SCHEMA SCHEMATEST1')
data = con.export_to_pandas('select * from TABLETEST1')
You may specify any parameters used for pandas.read_csv and pass it using callback_params argument.
For example:
callback_params = {
'names': ['A', 'B', 'C'],
'header': 0,
'dtype': {'A': numpy.float64, 'B': numpy.int32, 'C': 'Int64'}
}
data = con.export_to_pandas('select * from TABLETEST1', callback_params=callback_params)
Please note that names of columns are actually stored uppercased in Exasol. You may use connection option lower_ident=True to lower case for normal .execute(), but it is not going to work with .export_to_pandas. The only way is to specify column names manually or to modify each name later.

Unable to train or test data

I have this code as mentioned by this gentleman in "https://github.com/venky14/Machine-Learning-with-Iris-Dataset/blob/master/Machine%20Learning%20with%20Iris%20Dataset.ipynb"
After splitting the data into training and testing , I am unable to taking the features for training and testing data.Error is being thrown at In[92].
It is giving me
error "KeyError: "['A' 'B' 'C' 'D' 'E' 'F' 'H' 'I'] not in index""
Below is image of how my CSV file looks like
It seems that you are calling column names as indexes.
Please provide sample code because the refed ipynb seems to be correct.
Probably you are looking for this:
import pandas as pd
df = pd.read_csv('sample-table.csv')
df_selected_columns = df[['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I']]
np_ndarray = df_selected_columns.values

What is the most efficient way to create a DataFrame from a JSON file in Python?

I have a JSON file that I want to convert into a DataFrame object in Python. I found a way to do the conversion but unfortunately it takes ages, and thus I'm asking if there are more efficient and elegant ways to do the conversion.
I use json library to open the JSON file as a dictionary which works fine:
import json
with open('path/file.json') as d:
file = json.load(d)
Here's some mock data that mimics the structure of the real data set:
dict1 = {'first_level':[{'A': 'abc',
'B': 123,
'C': [{'D' :[{'E': 'zyx'}]}]},
{'A': 'bcd',
'B': 234,
'C': [{'D' :[{'E': 'yxw'}]}]},
{'A': 'cde',
'B': 345},
{'A': 'def',
'B': 456,
'C': [{'D' :[{'E': 'xwv'}]}]}]}
Then I create an empty DataFrame and append the data that I'm interested in to it with a for loop:
df = pd.DataFrame(columns = ['A', 'B', 'C'])
for i in range(len(dict1['first_level'])):
try:
data = {'A': dict1['first_level'][i]['A'],
'B': dict1['first_level'][i]['B'],
'C': dict1['first_level'][i]['C'][0]['D'][0]['E']}
df = df.append(data, ignore_index = True)
except KeyError:
data = {'A': dict1['first_level'][i]['A'],
'B': dict1['first_level'][i]['B']}
df = df.append(data, ignore_index = True)
Is there a way to get the data straight from the JSON more efficiently or can I write the for loop more elegantly?
(Running through the dataset(~150k elements) takes over an hour. I'm using Python 3.6.3 64bits)
You could use https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Or use Spark & PySpark to convert to a dataframe pretty easily & manage your data that way but that might be more than you need.

Join Recarrays by attributes in Python

I am trying to join recarrys in python such that the same value joins to many elements. The following code works when it is a 1:1 ratio, but when I am trying to do many:1, it only joins one instance:
import numpy as np
import matplotlib
# First data structure
sex = np.array(['M', 'F', 'M', 'F', 'M', 'F'])
causes = np.array(['c1', 'c1', 'c2', 'c2', 'c3', 'c3'])
data1 = np.core.records.fromarrays([sex, causes], names='sex, causes')
# Second data structure
causes2 = np.array(['c1', 'c2', 'c3'])
analyst = np.array(['p1', 'p2', 'p3'])
data2 = np.core.records.fromarrays([causes2, analyst], names='causes, analyst')
# Join on Cause
all_data = matplotlib.mlab.rec_join(('causes'), data1, data2, jointype='leftouter')
What I would like the all_data recarray to contain is all of the data from data1 with the corresponding analyst indicated in data2.
There might be a good use of record array, but I thought python dict should be as good here... Want to know numpy way of doing this myself, too, if it is good.
dct = dict(zip(data2['causes'], data2['analyst']))
all_data = mlab.rec_append_fields(data1, 'analyst',
[dct[x] for x in data1['causes']])

Categories

Resources