Infer multivalent features with tfdv from pandas dataframe - python

I want to infer a schema with tensorflow data validation (tfdv) based on a pandas dataframe of the training data. The dataframe contains a column with a multivalent feature, where multiple values (or None) of the feature can be present at the same time.
Given the following dataframe:
df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
{'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
{'feat_1': 7, 'feat_2': None, 'feat_3': None}])
inferring and displaying the schema results in:
Thus, tfdv treats the 'feat_2' values as a single string instead of splitting them at the ',' to produce a domain of 'AA', 'BB':
If if save the values of feature as e.g., ['AA', 'BB'], the schema inference throws an error:
ArrowTypeError: ("Expected bytes, got a 'list' object", 'Conversion failed for column feat_2 with type object')
Is there any way to achieve this with tfdv?

A String will be interpreted as a String. Regarding your issue with the List, it might be related to this issue:
Currently only pandas columns of primitive types are supported.
Could not find anything more recent. Here is a workaround:
import pandas as pd
import tensorflow_data_validation as tfdv
df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
{'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
{'feat_1': 7, 'feat_2': None, 'feat_3': None}])
df['feat_2'] = df['feat_2'].str.split(',')
df = df.explode('feat_2').reset_index(drop=True)
train_stats = tfdv.generate_statistics_from_dataframe(df)
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Related

Is it possible to "explode" an array that contains multiple dictionaries using pandas or python?

Is it possible to "explode" an array that contains multiple dictionaries using pandas or python?
I am developing a code that returns these two arrays (simplified version):
data_for_dataframe = ["single nucleotide variant",
[{'assembly': 'GRCh38',
'start': '11016874',
'end': '11016874',
'ref': 'C',
'alt': 'T',
'risk_allele': 'T'},
{'assembly': 'GRCh37',
'start': '11076931',
'end': '11076931',
'ref': 'C',
'alt': 'T',
'risk_allele': 'T'}]]
columns = ["variant_type", "assemblies"]
So I created a pandas dataframe with using these two arrays - "data_for_dataframe" and "columns":
import pandas as pd
df = pd.DataFrame(data_for_dataframe, columns).transpose()
And the output was:
The type of the "variant_type" column is string and the type of the "assemblies" column is array. My question is whether it is possible, and if so, how, to "explode" the "assemblies" column and create a dataframe as shown in the following image:
Could you help me?
It's possible with a combination of apply() and explode().
exploded = df['assemblies'].explode().apply(pd.Series)
exploded['variant_type'] = df['variant_type']
Output:
assembly start end ref alt risk_allele variant_type
0 GRCh38 11016874 11016874 C T T single nucleotide variant
0 GRCh37 11076931 11076931 C T T single nucleotide variant

Convert a muti-valued dict into a pandas dataframe

I want to convert this dict into a pandas dataframe where each key becomes a column and values in the list become the rows:
my_dict:
{'Last updated': ['2021-05-18T15:24:19.000Z', '2021-05-18T15:24:19.000Z'],
'Symbol': ['BTC', 'BNB', 'XRP', 'ADA', 'BUSD'],
'Name': ['Bitcoin', 'Binance Coin', 'XRP', 'Cardano', 'Binance USD'],
'Rank': [1, 3, 7, 4, 25],
}
The lists in my_dict can also have some missing values, which should appear as NaNs in dataframe.
This is how I'm currently trying to append it into my dataframe:
df = pd.DataFrame(columns = ['Last updated',
'Symbol',
'Name',
'Rank',]
df = df.append(my_dict, ignore_index=True)
#print(df)
df.to_excel(r'\walletframe.xlsx', index = False, header = True)
But my output only has a single row containing all the values.
The answer was pretty simple, instead of using
df = df.append(my_dict)
I used
df = pd.DataFrame.from_dict(my_dict).T
Which transposes the dataframe so it doesn't has any missing values for columns.
Credits to #Ank who helped me find the solution!

Store DenseVector in DataFrame column in PySpark

I am trying to store a DenseVector into a DataFrame in a new column.
I tried the following code, but got an AttributeError saying 'numpy.ndarray' object has no attribute '_get_object_id'.
from pyspark.sql import functions
from pyspark.mllib.linalg import Vectors
df = spark.createDataFrame([{'name': 'Alice', 'age': 1},
{'name': 'Bob', 'age': 2}])
vec = Vectors.dense([1.0, 3.0, 2.9])
df.withColumn('vector', functions.lit(vec))
I'm hoping to store a vector per row for computation purpose. Any help is appreciated.
[Python 3.7.3, Spark version 2.4.3, via Jupyter All-Spark-Notebook]
EDIT
I tried to follow the answer here as suggested by Florian, but I could not adapt the udf to take in a custom pre-constructed vector.
conv = functions.udf(lambda x: DenseVector(x), VectorUDT())
# Same with
# conv = functions.udf(lambda x: x, VectorUDT())
df.withColumn('vector', conv(vec)).show()
I get this error :
TypeError: Invalid argument, not a string or column: [1.0,3.0,2.9] of type <class 'pyspark.mllib.linalg.DenseVector'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
You could wrap the creation of the udf inside a function, so it returns the udf with your vector. An example is given below, hope this helps!
import pyspark.sql.functions as F
from pyspark.ml.linalg import VectorUDT, DenseVector
df = spark.createDataFrame([{'name': 'Alice', 'age': 1},
{'name': 'Bob', 'age': 2}])
def vector_column(x):
return F.udf(lambda: x, VectorUDT())()
vec = DenseVector([1.0, 3.0, 2.9])
df.withColumn("vector", vector_column(vec)).show()
Output:
+---+-----+-------------+
|age| name| vector|
+---+-----+-------------+
| 1|Alice|[1.0,3.0,2.9]|
| 2| Bob|[1.0,3.0,2.9]|
+---+-----+-------------+

Dask categorize() won't work after using .loc

I'm having a serious issue using dask (dask version: 1.00, pandas version: 0.23.3). I am trying to load a dask dataframe from a CSV file, filter the results into two separate dataframes, and perform operations on both.
However, after the split the dataframes and try to set the category columns as 'known', they remain 'unknown'. Thus I cannot continue with my operations (which require category columns to be 'known'.)
NOTE: I have created a minimum example as suggested using pandas instead of read_csv().
import pandas as pd
import dask.dataframe as dd
# Specify dtypes
b_dtypes = {
'symbol': 'category',
'price': 'float64',
}
i_dtypes = {
'symbol': 'category',
'price': 'object'
}
# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
for column, dtype in dtypes.items():
if column in df.columns:
df[column] = df.loc[:, column].astype(dtype)
return df
# Set up our test data
data = [
['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)
#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#
# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]
# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)
# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()
# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)
#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#
UPDATE: So it seems that if I shift the 'npartitions' parameters to 1, then print() returns True in both cases. So this appears to be an issue with the partitions containing different categories. However loading both dataframes into only two partitions is not feasible, so is there a way I can tell dask to do some sort of re-sorting to make the categories consistent across partitions?
The answer for your problem is basically contained in doc. I'm referring to the part code commented by # categorize requires computation, and results in known categoricals I'll expand here because it seems to me you're misusing loc
import pandas as pd
import dask.dataframe as dd
# Set up our test data
data = [['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)
# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)
# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])
# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

What is the most efficient way to create a DataFrame from a JSON file in Python?

I have a JSON file that I want to convert into a DataFrame object in Python. I found a way to do the conversion but unfortunately it takes ages, and thus I'm asking if there are more efficient and elegant ways to do the conversion.
I use json library to open the JSON file as a dictionary which works fine:
import json
with open('path/file.json') as d:
file = json.load(d)
Here's some mock data that mimics the structure of the real data set:
dict1 = {'first_level':[{'A': 'abc',
'B': 123,
'C': [{'D' :[{'E': 'zyx'}]}]},
{'A': 'bcd',
'B': 234,
'C': [{'D' :[{'E': 'yxw'}]}]},
{'A': 'cde',
'B': 345},
{'A': 'def',
'B': 456,
'C': [{'D' :[{'E': 'xwv'}]}]}]}
Then I create an empty DataFrame and append the data that I'm interested in to it with a for loop:
df = pd.DataFrame(columns = ['A', 'B', 'C'])
for i in range(len(dict1['first_level'])):
try:
data = {'A': dict1['first_level'][i]['A'],
'B': dict1['first_level'][i]['B'],
'C': dict1['first_level'][i]['C'][0]['D'][0]['E']}
df = df.append(data, ignore_index = True)
except KeyError:
data = {'A': dict1['first_level'][i]['A'],
'B': dict1['first_level'][i]['B']}
df = df.append(data, ignore_index = True)
Is there a way to get the data straight from the JSON more efficiently or can I write the for loop more elegantly?
(Running through the dataset(~150k elements) takes over an hour. I'm using Python 3.6.3 64bits)
You could use https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Or use Spark & PySpark to convert to a dataframe pretty easily & manage your data that way but that might be more than you need.

Categories

Resources