Converting rdd of numpy arrays to pyspark dataframe using Vectors [duplicate]

Converting rdd of numpy arrays to pyspark dataframe using Vectors [duplicate] - python

I have a DenseVector RDD like this
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])]
I want to convert this into a Dataframe. I tried like this
>>> spark.createDataFrame(frequencyDenseVectors, ['rawfeatures']).collect()
It gives an error like this
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 520, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 360, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 340, in _inferSchema
schema = _infer_schema(first)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 991, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 968, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <type 'numpy.ndarray'>
old Solution
frequencyVectors.map(lambda vector: DenseVector(vector.toArray()))
Edit 1 - Code Reproducible
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
from pyspark.sql.functions import split
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.mllib.linalg import SparseVector, DenseVector
sqlContext = SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
sc.setLogLevel('ERROR')
sentenceData = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
sentenceData = sentenceData.withColumn("sentence", split("sentence", "\s+"))
sentenceData.show()
vectorizer = CountVectorizer(inputCol="sentence", outputCol="rawfeatures").fit(sentenceData)
countVectors = vectorizer.transform(sentenceData).select("label", "rawfeatures")
idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel = idf.fit(countVectors)
tfidf = idfModel.transform(countVectors).select("label", "features")
frequencyDenseVectors = tfidf.rdd.map(lambda vector: [vector[0],DenseVector(vector[1].toArray())])
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])

You cannot convert RDD[Vector] directly. It should be mapped to a RDD of objects which can be interpreted as structs, for example RDD[Tuple[Vector]]:
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
Otherwise Spark will try to convert object __dict__ and create use unsupported NumPy array as a field.
from pyspark.ml.linalg import DenseVector
from pyspark.sql.types import _infer_schema
v = DenseVector([1, 2, 3])
_infer_schema(v)
TypeError Traceback (most recent call last)
...
TypeError: not supported type: <class 'numpy.ndarray'>
vs.
_infer_schema((v, ))
StructType(List(StructField(_1,VectorUDT,true)))
Notes:
In Spark 2.0 you have to use correct local types:
pyspark.ml.linalg when working DataFrame based pyspark.ml API.
pyspark.mllib.linalg when working RDD based pyspark.mllib API.
These two namespaces can no longer compatible and require explicit conversions (for example How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT).
Code provided in the edit is not equivalent to the one from the original question. You should be aware that tuple and list don't have the same semantics. If you map vector to pair use tuple and convert directly to DataFrame:
tfidf.rdd.map(
lambda row: (row[0], DenseVector(row[1].toArray()))
).toDF()
using tuple (product type) would work for nested structure as well but I doubt this is what you want:
(tfidf.rdd
.map(lambda row: (row[0], DenseVector(row[1].toArray())))
.map(lambda x: (x, ))
.toDF())
list at any other place than the top level row is interpreted as an ArrayType.
It is much cleaner to use an UDF for conversion (Spark Python: Standard scaler error "Do not support ... SparseVector").

I believe the problem here is that createDataframe does not take denseVactor as argument Please try to convert denseVector into corresponding collection [i.e. Array or List]. In scala and java
toArray()
method is available you can convert the denseVector in array or list then try to create dataFrame.

Related

Calculating Expected Value With Matrix Values

I have the following input data
class_p = [0.0234375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1748046875, 0.0439453125, 0.0, 0.35302734375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3828125]
league_p = [0.4765625, 0.0, 0.00634765625, 0.4658203125, 0.0, 0.0, 0.046875, 0.0, 0.0, 0.0029296875, 0.0, 0.0, 0.0, 0.0, 0.0]
a2_p = [0.1171875, 0.0, 0.0, 0.1171875, 0.0, 0.0078125, 0.30322265625, 0.31103515625, 0.0, 0.0, 0.0, 0.1435546875, 0.0, 0.0, 0.0]
p1_p = [0.0, 0.03125, 0.375, 0.09375, 0.0234375, 0.0, 0.46875, 0.0078125, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
p2_p = [0.3984375, 0.0, 0.0, 0.3828125, 0.08935546875, 0.08935546875, 0.023345947265625, 0.007720947265625, 0.0, 0.0, 0.0087890625, 0.00018310546875, 0.0, 0.0, 0.0]
class_v = [55, 75, 55, 75, 500, 10000, 55, 55, 55, 75, 75, 55, 55, 500, 55, 55, 75, 75, 55, 55, 55]
league_v = [0, 0, 0, 0, 0, 0, 0, 0, 40, 40, 40, 40, 1500, 1500, 3000]
a2_v= [0, 0, 0, 0, 0, 0, 0, 0, 40, 40, 40, 40, 1500, 1500, 3000]
p1_v = [0, 0, 0, 0, 0, 0, 0, 40, 40, 40, 40, 40, 1500, 1500, 3000]
p2_v = [0, 0, 0, 0, 0, 0, 0, 0, 40, 40, 40, 40, 1500, 1500, 3000]
With that data, I am generating the odds of each combination occurring.
As an example to generate the chance of a given combination
class_p[0]
league_p[6]
a2_p[11]
p1_p[7]
p2_p[3]
I would multiply their values with each other
0.0234375x0.046875x0.1435546875x0.0078125x0.3828125
That would give me 4.716785042546689510345458984375 × 10^-7
Since the given combination had class_p[0], league_p[6], a2_p[11], p1_p[7], p2_p[3], I would take the following values in the "values" arrays.
I would sum
class_v[0] + league_v[6] + a2_v[11] + p1_v[7] + p2_v[3]
That would give me 55+0+40+40+0 = 135
To finalize the process I would do
(0.0234375*0.046875*0.1435546875*0.0078125*0.3828125)*(55+0+40+40+0) = 0.00006367659807
The full final calc is
(0.0234375×0.046875×0.1435546875×0.0078125×0.3828125) (55 + 0 + 40 + 40 + 0)
(combintation_chance)*(combination_value)
I need to do this process for all possible combinations of combintation_chance
This should give me a column of values(1xN). If I sum the values of that column I reach the EV overall, by summing the EV of individual combinations.
Calculating combintation_chance is working just fine. My issue is how to line up the given combination with its corresponding value sum (combination_value). At the moment, I have additional identifiers attached to the *_p arrays and I then do a string comparison with them to determine which combination value to use. This is very slow for billions of comparisons, therefore I am exploring a better approach.
I am using python 3.8 & numpy 1.24
Edit
The question has been adjusted to include much more detail

Broadcasting
Ok, so it seems that this is a simple broadcasting problem.
You want a 5D-array of probabilities, times a 5d-array of values. And, of course, you want it without any for loop.
In numpy the classical way to have numpy do nested loops for you (which is, indeed, way faster than doing them yourself. First rule of numpy is "avoid at all cost to iterate over elements. No for loop"), is to use broadcasting.
Let's start with 2d example (as was your first intention. And that was a good idea. Problem was it was ambiguous, but restraining your question to 2d was not bad).
You have
class_p = np.array([0.0234375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1748046875, 0.0439453125, 0.0, 0.35302734375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3828125])
league_p = np.array([0.4765625, 0.0, 0.00634765625, 0.4658203125, 0.0, 0.0, 0.046875, 0.0, 0.0, 0.0029296875, 0.0, 0.0, 0.0, 0.0, 0.0])
One way (not the only one, but probably the one easier to adapt to any similar question) is to use broadcasting.
If you indeed convert class_p in a column, that is a 21×1 2D array, and league_p into a line, that is a 1×15 2D array, then, if you multiply both, result will be a 21x15 2D array, containing all combinations.
Because
np.array([[1],[2],[3]]) * np.array([[4,5]])
is
[[4,5],
[8,10],
[12,15]]
That's how broadcasting works.
There are several way to convert a 1D-array so a row or a column of a 2D-array. For example you could use .reshape. Like class_p.reshape(-1,1) and league_p.reshape(1,-1). But the fastest is to add a new axis. Like class_p[:,None] and league_p[None,:]. Note that the second way doesn't really create a new array. It is just a different view of the same array. This is way it is faster.
So, our 2D probability map is
class_p[:,None]*league_p[None,:]
Likewise, to have all 21×15 combination of sum of values, you can rely on the same broadcasting to perform additon
class_v[:,None]+league_v[None,:]
Broadcasting solution
So solution, in 2D, using broadcasting, is
class_p[:,None]*league_p[None,:] * (class_v[:,None] + league_v[None,:])
In 5D, with all your variables, it is still manageable (but don't add too much dimensions! it would soon become a huge result. And I suspect what you are really interested at the end is just the sum of all that), this time, not in one line (not that it couldn't be done that way, but, that would be a big line...)
pr = class_p[:,None,None,None,None]*league_p[None,:,None,None,None]*a2_p[None,None,:,None,None]*p1_p[None,None,None,:,None]*p2_p[None,None,None,None,:]
vl = class_v[:,None,None,None,None]+league_v[None,:,None,None,None]+a2_v[None,None,:,None,None]+p1_v[None,None,None,:,None]+p2_v[None,None,None,None,:]
pr*vl
add.outer and multiply.outer
As you can see, in 5D, it is a little bit tedious. But I wanted to show you the principle of broadcasting, before introducing another (not really shorter, but a bit less tedious) way. Way that was already given by Reinderien. But since it was before you clarified the question, it was not the good result, but principle is the same
In 2D
np.multiply.outer(class_p, league_p) * np.add.outer(class_v, league_v)
Unfortunately, those function take only 2 args. So in 5D, you have to chain them
pr = np.multiply.outer(class_p, np.multiply.outer(league_p, np.multiply.outer(a2_p, np.multiply.outer(p1_p, p2_p))))
vl = np.add.outer(class_v, np.add.outer(league_v, np.add.outer(a2_v, np.add.outer(p1_v, p2_v))))
pr * vl
Expected value
Note that if the aim of all this is to compute the expected "value" (whatever that value is), that is Σ p(i,j,k,l,m)×v(i,j,k,l,m), for all possible outcomes, then, doing it that way is probably not a good idea.
For your example, it is manageable. You are computing "only" 1 million possible outcomes that is 1 million probabilities (each of them being 4 multiplications) and 1 million associated values (4 additions each). And the performing 1 million multiplication between those 2 sets of 1 million probabilities and values. And then summing the result, that is one extra million addition. Altogether, that is only 10 millions elementary arithmetic operation. Not much for a modern computer, and response still feels instantaneous. But, yet, it is O(Nᵏ) is both cpu and memory. N being the typical length of an array, and k the number of variables.
But if you intend to add more dimensions (more variables, associated with more set of probabilities and set of values), then that is unnecessary explosive, in both CPU time, and memory (those 5D arrays of probabilities and values are stored), or simply if you intend to perform this computation more than once, that expected value can be computed way faster, using just O(Nk) operations.
I spare you the development (but it is just a matter of expanding sum Σᵢⱼₖₗₘ pᵢpⱼpₖpₗpₘ (vᵢ+vⱼ+vₖ+vₗ+vₘ)), you can compute it faster like this
P1 = class_p.sum()
PV1 = (class_p*class_v).sum()
P2 = league_p.sum()
PV2 = (league_p*league_v).sum()
P3 = a2_p.sum()
PV3 = (a2_p*a2_v).sum()
P4 = p1_p.sum()
PV4 = (p1_p*p1_v).sum()
P5 = p2_p.sum()
PV5 = (p2_p*p2_v).sum()
expectedValue = P1*P2*P3*P4*PV5 + P1*P2*P3*PV4*P5 + P1*P2*PV3*P4*P5 + P1*PV2*P3*P4*P5 + PV1*P2*P3*P4*P5
sameAs = (pr*vl).sum()
It appears more complicated because there are more lines. But each line is along 1 dimension only. So it is replacing an order of magnitude of n₁n₂n₃n₄n₅ operations by an order of magnitude of n₁+n₂+n₃+n₄+n₅ operations, where n₁,...,n₅ are the size of arrays of each of the 5 variables.
So, again, if your objective is to compute expected value, then, trying to compute the 5D arrays (as your question is), is a really costly way.

This doesn't make any attempt to cache intermediate results, etc.
import numpy as np
class_percentages = (0.0, 0.0, 0.0, 0.3, 0.50)
league_percentages = (0.1, 0.0, 0.2, 0.1, 0.05)
class_values = (50, 50, 50, 75, 100)
league_values = (0, 10, 10, 25, 75)
combined = np.add.outer(class_percentages, league_percentages)*np.add.outer(class_values, league_values)
print(combined)
Output:
[[ 5. 0. 12. 7.5 6.25]
[ 5. 0. 12. 7.5 6.25]
[ 5. 0. 12. 7.5 6.25]
[30. 25.5 42.5 40. 52.5 ]
[60. 55. 77. 75. 96.25]]

H5PY - How to store many 2D arrays of different dimensions

I would like to organize my collected data (from computer simulations) into a hdf5 file using Python.
I measured positions and velocities [x,y,z,vx,vy,vz] of all atoms within a certain space region over many time steps. The number of atoms, of course, varies from time step to time step.
A minimal example could look as follows:
[
[ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2] ],
[ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2], [x3,y3,z3,vx3,vy3,vz3] ]
]
(2 time steps,
first time step: 2 atoms,
second time step: 3 atoms)
My idea was to create a hdf5 dataset within Python which stores all the information. At each time step it should store a 2d array of alls positions/velocities of all atoms, i.e.
dataset[0] = [ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2] ]
dataset[1] = [ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2], [x3,y3,z3,vx3,vy3,vz3] ].
The idea is clear, I think. However, I struggle with the definition of the correct data type of the data set with varying array length.
My code looks like this:
import numpy as np
import h5py
file = h5py.File ('file.h5','w')
columnNo = 6
rowtype = np.dtype("%sfloat32" % columnNo)
dt = h5py.special_dtype( vlen=np.dtype(rowtype) )
dataset = file.create_dataset("dset", (2,), dtype=dt)
print dataset.value
testarray = np.array([[1.,2.,3.,2.,3.,4.],[1.,2.,3.,2.,3.,4.]])
print testarray
dataset[0] = testarray
print dataset[0]
This, however, does not work. When I run the script I get the error message "AttributeError: 'float' object has no attribute 'dtype'."
It seems that my defined dtype is wrong.
Does anybody see how it should be defined correctly?
Thanks very much,
Sven

The error in your case is buried, though it is clear it occurs when trying to assign the testarray to the dataset:
Traceback (most recent call last):
File "stack41465480.py", line 26, in <module>
dataset[0] = testarray
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-GhwtGD/h5py-2.6.0/h5py/_objects.c:2577)
...
File "h5py/_conv.pyx", line 712, in h5py._conv.ndarray2vlen (/build/h5py-GhwtGD/h5py-2.6.0/h5py/_conv.c:6171)
AttributeError: 'float' object has no attribute 'dtype'
I'm not skilled with the special_dtype and vlen, but I was able to write a numpy structured arrays to h5py.
import numpy as np
import h5py
file = h5py.File ('file.h5','w')
columnNo = 6
# rowtype = np.dtype("%sfloat32" % columnNo)
rowtype = np.dtype([('f0', '<f4',(6,))])
dt = h5py.special_dtype( vlen=np.dtype(rowtype) )
print('rowtype',rowtype)
print('dt',dt)
dataset = file.create_dataset("dset", (2,), dtype=rowtype)
print('value')
print(dataset.value[0])
arr = np.ones((2,),dtype=rowtype)
print(repr(arr))
dataset[0] = arr[0]
print(dataset.value)
testarray = np.array([([1.,2.,3.,2.,3.,4.],),([2.,3.,4.,1.,2.,3.],)], dtype=rowtype)
print(repr(testarray))
dataset[1] = testarray[1]
print(dataset.value)
print(dataset.value['f0'])
producing
1316:~/mypy$ python3 stack41465480.py
rowtype [('f0', '<f4', (6,))]
dt object
value
([0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)
array([([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],), ([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)],
dtype=[('f0', '<f4', (6,))])
[([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],) ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)]
array([([1.0, 2.0, 3.0, 2.0, 3.0, 4.0],), ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)],
dtype=[('f0', '<f4', (6,))])
[([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],) ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)]
[[ 1. 1. 1. 1. 1. 1.]
[ 2. 3. 4. 1. 2. 3.]]

Thanks for the quick answer. It helped a lot.
If I now simply change the data type of the data set to
dtype = dt,
I get what I would like to have.
Here, the Python code (for completeness):
import numpy as np
import h5py
file = h5py.File ('file.h5','w')
columnNo = 6
rowtype = np.dtype([('f0', '<f4',(6,))])
dt = h5py.special_dtype( vlen=np.dtype(rowtype) )
print('rowtype',rowtype)
print('dt',dt)
dataset = file.create_dataset("dset", (2,), dtype=dt)
# print('value')
# print(dataset.value[0])
arr = np.ones((3,),dtype=rowtype)
# print(repr(arr))
dataset[0] = arr
# print(dataset.value)
testarray = np.array([([1.,2.,3.,2.,3.,4.],),([2.,3.,4.,1.,2.,3.],)], dtype=rowtype)
# print(repr(testarray))
dataset[1] = testarray
print(dataset.value)
for i in range(2): print dataset[i]
And to corresponding output reads
('rowtype', dtype([('f0', '<f4', (6,))]))
('dt', dtype('O'))
[ array([([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],),
([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],), ([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)],
dtype=[('f0', '<f4', (6,))])
array([([1.0, 2.0, 3.0, 2.0, 3.0, 4.0],), ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)],
dtype=[('f0', '<f4', (6,))])]
[([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],) ([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)
([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)]
[([1.0, 2.0, 3.0, 2.0, 3.0, 4.0],) ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)]
Just to get it right: The problem in my original code was a bad definition of my rowtype data structure, right?
Best,
Sven

Similarity Measure/Matrix for data (recommender system)- Python

I am new to machine learning and am trying to try out the following problem.
Input is 2 arrays of descriptions with same length, and output is an array of similarity scores of first string from first array compared to first string in second array etc.
Each item in the array(numpy array) is a string of description. Can you write a function find out how similar between two strings by calculating how many identical and co-occurring word IDs there are, and assign it a score (one possible weight can be based on the frequency of co-occurrence vs sum of frequency of individual word ID). Then apply the function to two arrays to get an array of scores.
Please also let me know if there are other approaches you would want to to consider as well.
Thanks!
Data:
array(['0/1/2/3/4/5/6/5/7/8/9/3/10', '11/12/13/14/15/15/16/17/12',
'18/19/20/21/22/23/24/25',
'26/27/28/29/30/31/32/33/34/35/36/37/38/39/33/34/40/41',
'5/42/43/15/44/45/46/47/48/26/49/50/51/52/49/53/54/51/55/56/22',
'57/58/59/60/61/49/62/23/57/58/63/57/58', '64/65/66/63/67/68/69',
'70/71/72/73/74/75/76/77',
'78/79/80/81/82/83/84/85/86/87/88/89/90/91',
'33/34/92/93/94/95/85/96/97/98/99/60/85/86/100/101/102/103',
'104/105/106/107/108/109/110/86/107/111/112/113/5/114/110/113/115/116',
'117/118/119/120/121/12/122/123/124/125',
'14/126/127/128/122/129/130/131/132/29/54/29/129/49/3/133/134/135/136',
'137/138/139/140/141/142',
'143/144/145/146/147/148/149/150/151/152/4/153/154/155/156/157/158/128/159',
'160/161/162/163/131/2/164/165/166/167/168/169/49/170/109/171',
'172/173/174/175/176/177/73/178/104/179/180/179/181/173',
'182/144/183/179/73',
'184/163/68/185/163/8/186/187/188/54/189/190/191',
'181/192/0/1/193/194/22/195',
'113/196/197/198/68/199/68/200/201/202/203/201',
'204/205/206/207/208/209/68/200',
'163/210/211/122/212/213/214/215/216/217/100/101/160/139/218/76/179/219',
'220/221/222/223/5/68/224/225/54/225/226/227/5/221/222/223',
'214/228/5/6/5/215/228/228/229',
'230/231/232/233/122/215/128/214/128/234/234',
'235/236/191/237/92/93/238/239',
'13/14/44/44/240/241/242/49/54/243/244/245/55/56',
'220/21/246/38/247/201/248/73/160/249/250/203/201',
'214/49/251/252/253/254/255/256/257/258'],
dtype='|S127')
array(['151/308/309/310/311/215/312/160/313/214/49/12',
'314/315/316/275/317/42/318/319/320/212/49/170/179/29/54/29/321/322/323',
'324/325/62/220/326/194/327/328/218/76/241/329',
'330/29/22/103/331/314/68/80/49',
'78/332/85/96/97/227/333/4/334/188',
'57/335/336/34/187/337/21/338/212/213/339/340',
'341/342/167/343/8/254/154/61/344',
'2/292/345/346/42/347/348/348/100/349/202/161/263',
'283/39/312/350/26/351', '352/353/33/34/144/218/73/354/355',
'137/356/357/358/357/359/22/73/170/87/88/78/123/360/361/53/362',
'23/363/10/364/289/68/123/354/355',
'188/28/365/149/366/98/367/368/369/370/371/372/368',
'373/155/33/34/374/25/113/73', '104/375/81/82/168/169/81/82/18/19',
'179/376/377/378/179/87/88/379/20',
'380/85/381/333/382/215/128/383/384', '385/129/386/387/388',
'389/280/26/27/390/391/302/392/393/165/394/254/302/214/217/395/396',
'397/398/291/140/399/211/158/27/400', '401/402/92/93/68/80',
'77/129/183/265/403/404/405/406/60/407/162/408/409/410/411/412/413/156',
'129/295/90/259/38/39/119/414/415/416/14/318/417/418',
'419/420/421/422/423/23/424/241/421/425/58',
'426/244/427/5/428/49/76/429/430/431',
'257/432/433/167/100/101/434/435/436', '437/167/438/344/356/170',
'439/440/441/442/192/443/68/80/444/445/111', '446/312/23/447/448',
'385/129/218/449/450/451/22/452/125/129/453/212/128/454/455/456/457/377'],
dtype='|S127')

The following code should facilitate you with what you need in Python 3.x
import numpy as np
from collections import Counter
def jaccardSim(c1, c2):
cU = c1 | c2
cI = c1 & c2
sim = sum(cI.values()) / sum(cU.values())
return sim
def byteArraySim(b1, b2):
cA = [Counter(b1[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b1))]
cB = [Counter(b2[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b2))]
# Assuming both 'a' and 'b' are in the same length
cSim = [jaccardSim(cA[i], cB[i]) for i in range(len(a))]
return cSim # Array of similarities
Jaccard Similarity score is used in this implementation. You may other scores, such as cosine or hamming, to your liking.
Assuming that the arrays are stored in variables a and b, the resulting function byteArraySim(a,b) outputs the following similarity scores:
[0.0,
0.0,
0.0,
0.038461538461538464,
0.0,
0.041666666666666664,
0.0,
0.0,
0.0,
0.08,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.058823529411764705,
0.0,
0.0,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0]

Pandas sklearn one-hot encoding dataframe or numpy?

How can I transform a pandas data frame to sklearn one-hot-encoded (dataframe / numpy array) where some columns do not require encoding?
mydf = pd.DataFrame({'Target':[0,1,0,0,1, 1,1],
'GroupFoo':[1,1,2,2,3,1,2],
'GroupBar':[2,1,1,0,3,1,2],
'GroupBar2':[2,1,1,0,3,1,2],
'SomeOtherShouldBeUnaffected':[2,1,1,0,3,1,2]})
columnsToEncode = ['GroupFoo', 'GroupBar']
Is an already label encoded data frame and I would like to only encode the columns marked by columnsToEncode?
My problem is that I am unsure if a pd.Dataframe or the numpy array representation are better and how to re-merge the encoded part with the other one.
My attempts so far:
myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(X_train)
df = pd.concat([
df[~columnsToEncode], # select all other / numeric
# select category to one-hot encode
pd.Dataframe(encoder.transform(X_train[columnsToEncode]))#.toarray() # not sure what this is for
], axis=1).reindex_axis(X_train.columns, axis=1)
Notice: I am aware of Pandas: Get Dummies / http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html but that does not play well in a train / test split where I require such an encoding per fold.

This library provides several categorical encoders which make sklearn / numpy play nicely with pandas https://github.com/wdm0006/categorical_encoding
However, they do not yet support "handle unknown category"
for now I will use
myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(df[columnsToEncode])
pd.concat([df.drop(columnsToEncode, 1),
pd.DataFrame(myEncoder.transform(df[columnsToEncode]))], axis=1).reindex()
As this supports unknown datasets. For now, I will stick with half-pandas half-numpy because of the nice pandas labels. for the numeric columns.

For One Hot Encoding I recommend using ColumnTransformer and OneHotEncoder instead of get_dummies. That's because OneHotEncoder returns an object which can be used to encode unseen samples using the same mapping that you used on your training data.
The following code encodes all the columns provided in the columns_to_encode variable:
import pandas as pd
import numpy as np
df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300],
'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
'label': [1, 0, 0]})
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X:
array([[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 100],
[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 200],
[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 300]], dtype=object)
To avoid multicollinearity due to the dummy variable trap, I would also suggest removing one of the columns returned by each column that you encoded. The following code encodes all the columns provided in the columns_to_encode variable AND it removes the last column of each one hot encoded column:
import pandas as pd
import numpy as np
def sum_prev (l_in):
l_out = []
l_out.append(l_in[0])
for i in range(len(l_in)-1):
l_out.append(l_out[i] + l_in[i+1])
return [e - 1 for e in l_out]
df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300],
'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
'label': [1, 0, 0]})
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')
columns_to_encode = [df.iloc[:, del_idx].nunique() for del_idx in columns_to_encode]
columns_to_encode = sum_prev(columns_to_encode)
X = np.array(ct.fit_transform(X))
X = np.delete(X, columns_to_encode, 1)
X:
array([[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 100],
[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 200],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 300]], dtype=object)

I believe that this update to the initial answer is even better in order t perform dummy coding
import logging
import pandas as pd
from sklearn.base import TransformerMixin
log = logging.getLogger(__name__)
class CategoricalDummyCoder(TransformerMixin):
"""Identifies categorical columns by dtype of object and dummy codes them. Optionally a pandas.DataFrame
can be returned where categories are of pandas.Category dtype and not binarized for better coding strategies
than dummy coding."""
def __init__(self, only_categoricals=False):
self.categorical_variables = []
self.categories_per_column = {}
self.only_categoricals = only_categoricals
def fit(self, X, y):
self.categorical_variables = list(X.select_dtypes(include=['object']).columns)
logging.debug(f'identified the following categorical variables: {self.categorical_variables}')
for col in self.categorical_variables:
self.categories_per_column[col] = X[col].astype('category').cat.categories
logging.debug('fitted categories')
return self
def transform(self, X):
for col in self.categorical_variables:
logging.debug(f'transforming cat col: {col}')
X[col] = pd.Categorical(X[col], categories=self.categories_per_column[col])
if self.only_categoricals:
X[col] = X[col].cat.codes
if not self.only_categoricals:
return pd.get_dummies(X, sparse=True)
else:
return X

polyfit() got an unexpected keyword argument 'w'

I'm trying to use np.polyfit and I keep getting the error:
TypeError: polyfit() got an unexpected keyword argument 'w'
The documentation on that function clearly mentions this argument so I'm not sure whats going on. I'm using SciPy 0.12.0 and NumPy 1.6.1.
Here's a MWE that returns that error:
import numpy as np
x = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0])
weight = np.array([0.2, 0.8, 0.4, 0.6, 0.1, 0.3])
poli = np.polyfit(x, y, 3, w=weight)

This is the reference for your numpy version, the argument 'w' was only introduced in a later version.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting rdd of numpy arrays to pyspark dataframe using Vectors [duplicate] - python

Related

Calculating Expected Value With Matrix Values

H5PY - How to store many 2D arrays of different dimensions

Similarity Measure/Matrix for data (recommender system)- Python

Pandas sklearn one-hot encoding dataframe or numpy?

polyfit() got an unexpected keyword argument 'w'

Categories

Resources