Store DenseVector in DataFrame column in PySpark - python

I am trying to store a DenseVector into a DataFrame in a new column.
I tried the following code, but got an AttributeError saying 'numpy.ndarray' object has no attribute '_get_object_id'.
from pyspark.sql import functions
from pyspark.mllib.linalg import Vectors
df = spark.createDataFrame([{'name': 'Alice', 'age': 1},
{'name': 'Bob', 'age': 2}])
vec = Vectors.dense([1.0, 3.0, 2.9])
df.withColumn('vector', functions.lit(vec))
I'm hoping to store a vector per row for computation purpose. Any help is appreciated.
[Python 3.7.3, Spark version 2.4.3, via Jupyter All-Spark-Notebook]
EDIT
I tried to follow the answer here as suggested by Florian, but I could not adapt the udf to take in a custom pre-constructed vector.
conv = functions.udf(lambda x: DenseVector(x), VectorUDT())
# Same with
# conv = functions.udf(lambda x: x, VectorUDT())
df.withColumn('vector', conv(vec)).show()
I get this error :
TypeError: Invalid argument, not a string or column: [1.0,3.0,2.9] of type <class 'pyspark.mllib.linalg.DenseVector'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

You could wrap the creation of the udf inside a function, so it returns the udf with your vector. An example is given below, hope this helps!
import pyspark.sql.functions as F
from pyspark.ml.linalg import VectorUDT, DenseVector
df = spark.createDataFrame([{'name': 'Alice', 'age': 1},
{'name': 'Bob', 'age': 2}])
def vector_column(x):
return F.udf(lambda: x, VectorUDT())()
vec = DenseVector([1.0, 3.0, 2.9])
df.withColumn("vector", vector_column(vec)).show()
Output:
+---+-----+-------------+
|age| name| vector|
+---+-----+-------------+
| 1|Alice|[1.0,3.0,2.9]|
| 2| Bob|[1.0,3.0,2.9]|
+---+-----+-------------+

Related

Infer multivalent features with tfdv from pandas dataframe

I want to infer a schema with tensorflow data validation (tfdv) based on a pandas dataframe of the training data. The dataframe contains a column with a multivalent feature, where multiple values (or None) of the feature can be present at the same time.
Given the following dataframe:
df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
{'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
{'feat_1': 7, 'feat_2': None, 'feat_3': None}])
inferring and displaying the schema results in:
Thus, tfdv treats the 'feat_2' values as a single string instead of splitting them at the ',' to produce a domain of 'AA', 'BB':
If if save the values of feature as e.g., ['AA', 'BB'], the schema inference throws an error:
ArrowTypeError: ("Expected bytes, got a 'list' object", 'Conversion failed for column feat_2 with type object')
Is there any way to achieve this with tfdv?
A String will be interpreted as a String. Regarding your issue with the List, it might be related to this issue:
Currently only pandas columns of primitive types are supported.
Could not find anything more recent. Here is a workaround:
import pandas as pd
import tensorflow_data_validation as tfdv
df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
{'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
{'feat_1': 7, 'feat_2': None, 'feat_3': None}])
df['feat_2'] = df['feat_2'].str.split(',')
df = df.explode('feat_2').reset_index(drop=True)
train_stats = tfdv.generate_statistics_from_dataframe(df)
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Row wise operation in Pandas DataFrame

I have a Dataframe as
import pandas as pd
df = pd.DataFrame({
"First": ['First1', 'First2', 'First3'],
"Secnd": ['Secnd1', 'Secnd2', 'Secnd3']
)
df.index = ['Row1', 'Row2', 'Row3']
I would like to have a lambda function in apply method to create a list of dictionary (including index item) as below
[
{'Row1': ['First1', 'Secnd1']},
{'Row2': ['First2', 'Secnd2']},
{'Row3': ['First3', 'Secnd3']},
]
If I use something like .apply(lambda x: <some operation>) here, x does not include the index rather the values.
Cheers,
DD
To expand Hans Bambel's answer to get the exact desired output:
[{k: list(v.values())} for k, v in df.to_dict('index').items()]
You don't need apply here. You can just use the to_dict() function with the "index" argument:
df.to_dict("index")
This gives the output:
{'Row1': {'First': 'First1', 'Secnd': 'Secnd1'},
'Row2': {'First': 'First2', 'Secnd': 'Secnd2'},
'Row3': {'First': 'First3', 'Secnd': 'Secnd3'}}

MemoryError in Python when saving list to dataframe

I was trying to do the following, which is to save a python list that contains json strings into a dataframe in jupyternotebook
df = pd.io.json.json_normalize(mon_list)
df[['gfmsStr','_id']]
But then I received this error:
MemoryError
Then if I run other blocks, they all start to show the memory error. I am wondering what caused this and if there is anyway I can increase the memory to avoid the error.
Thanks!
update:
what's in mon_list is like the following:
mon_list[1]
[{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},
{'name': {'given': 'Mose', 'family': 'Regner'}},
{'id': 2, 'name': 'Faye Raker'}]
Do you really have a list? Or do you have a JSON file? What format is the "mon_list" variable?
This is how you convert a list to a Dataframe
# import pandas as pd
import pandas as pd
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
https://www.geeksforgeeks.org/create-a-pandas-dataframe-from-lists/

List of dict to Pandas DF - 'NoneType' object has no attribute 'keys'

I want to convert a list of dict into a pandas DF. One row look like this : {'id': 5102, 'lat': 41.9258689, 'lng': -91.4231934}
When I look at the type() I got int, float, float
temp_df = pd.DataFrame(geocode_list)
Then I got the error : AttributeError: 'NoneType' object has no attribute 'keys'
I don't know what cause this issue.
Generated your possible problem, check out the following fix:
import numpy as np
import pandas as pd
random = np.random.uniform(size=(100, 2))
data = [{'id': i, 'lon': x[0], 'lat': x[1]} for i, x in enumerate(random)]
# added invalid entry
data[20] = None
# filter out invalid entry
data = [i for i in data if i is not None]
# should work now
df = pd.DataFrame(data)
If your dict does not contain a None .then this will work:-
pd.DataFrame(list(geocode_list.items()), columns=['id', 'lat','lang'])

Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary

I have a pyspark Dataframe and I need to convert this into python dictionary.
Below code is reproducible:
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Alice', age=5, height=80),Row(name='Alice', age=10, height=80)])
df = rdd.toDF()
Once I have this dataframe, I need to convert it into dictionary.
I tried like this
df.set_index('name').to_dict()
But it gives error. How can I achieve this
Please see the example below:
>>> from pyspark.sql.functions import col
>>> df = (sc.textFile('data.txt')
.map(lambda line: line.split(","))
.toDF(['name','age','height'])
.select(col('name'), col('age').cast('int'), col('height').cast('int')))
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 5| 80|
| Bob| 5| 80|
|Alice| 10| 80|
+-----+---+------+
>>> list_persons = map(lambda row: row.asDict(), df.collect())
>>> list_persons
[
{'age': 5, 'name': u'Alice', 'height': 80},
{'age': 5, 'name': u'Bob', 'height': 80},
{'age': 10, 'name': u'Alice', 'height': 80}
]
>>> dict_persons = {person['name']: person for person in list_persons}
>>> dict_persons
{u'Bob': {'age': 5, 'name': u'Bob', 'height': 80}, u'Alice': {'age': 10, 'name': u'Alice', 'height': 80}}
The input that I'm using to test data.txt:
Alice,5,80
Bob,5,80
Alice,10,80
First we do the loading by using pyspark by reading the lines. Then we convert the lines to columns by splitting on the comma. Then we convert the native RDD to a DF and add names to the colume. Finally we convert to columns to the appropriate format.
Then we collect everything to the driver, and using some python list comprehension we convert the data to the form as preferred. We convert the Row object to a dictionary using the asDict() method. In the output we can observe that Alice is appearing only once, but this is of course because the key of Alice gets overwritten.
Please keep in mind that you want to do all the processing and filtering inside pypspark before returning the result to the driver.
Hope this helps, cheers.
You need to first convert to a pandas.DataFrame using toPandas(), then you can use the to_dict() method on the transposed dataframe with orient='list':
df.toPandas().set_index('name').T.to_dict('list')
# Out[1]: {u'Alice': [10, 80]}
RDDs have built in function asDict() that allows to represent each row as a dict.
If you have a dataframe df, then you need to convert it to an rdd and apply asDict().
new_rdd = df.rdd.map(lambda row: row.asDict(True))
One can then use the new_rdd to perform normal python map operations like:
# You can define normal python functions like below and plug them when needed
def transform(row):
# Add a new key to each row
row["new_key"] = "my_new_value"
return row
new_rdd = new_rdd.map(lambda row: transform(row))
One easy way can be to collect the row RDDs and iterate over it using dictionary comprehension. Here i will try to demonstrate something similar:
Lets assume a movie dataframe:
movie_df
movieId
avg_rating
1
3.92
10
3.5
100
2.79
100044
4.0
100068
3.5
100083
3.5
100106
3.5
100159
4.5
100163
2.9
100194
4.5
We can use dictionary comprehension and iterate over the row RDDs like below:
movie_dict = {int(row.asDict()['movieId']) : row.asDict()['avg_rating'] for row in movie_avg_rating.collect()}
print(movie_dict)
{1: 3.92,
10: 3.5,
100: 2.79,
100044: 4.0,
100068: 3.5,
100083: 3.5,
100106: 3.5,
100159: 4.5,
100163: 2.9,
100194: 4.5}

Categories

Resources