Spark - calculate the sum of features for each sample - python

If I have a RDD which looks like below then I know how to calculate the sum of my features per sample data:
import numpy as np
from pyspark import SparkContext
x = np.arange(10) # first sample with 10 features [0 1 2 3 4 5 6 7 8 9]
y = np.arange(10) # second sample with 10 features [0 1 2 3 4 5 6 7 8 9]
z = (x,y)
sc = SparkContext()
rdd1 = sc.parallelize(z)
rdd1.sum()
The output will be an array like this: ([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18]), which is what I want.
My Question is:
If I construct a RDD by parsing a csv file as below, in which each element of the RDD is a tuple or list. How can I calculate the sum of each tuple/list elements (each feature) like above? If I use the sum I get this error:
Rdd : [(0.00217010083485, 0.00171658370653), (7.24521659993e-05, 4.18413109325e-06), ....]
TypeError: unsupported operand type(s) for +: 'int' and 'tuple'
[EDIT] To be more specific:
rdd = sc.parallelize([(1,3),(2,4)])
I want my output to be [3,7]. Each tuple is a data instance that I have, and each element of tuple is my feature. I want to calculate the sum of each feature per all data samples.

In that case, you will need the reduce method, zip the two consecutive tuples and add them element by element:
rdd.reduce(lambda x, y: [t1+t2 for t1, t2 in zip(x, y)])
# [3, 7]

You can do something like this:
z = zip(x, y)
#z is [(0, 0), (1, 1), (2, 2) ......]
map(np.sum, z)
that should do the trick.

Here, I just add the solution using PySpark dataframe for larger rdd that you have
rdd = sc.parallelize([(1, 3),(2, 4)])
df = rdd.toDF() # tranform rdd to dataframe
col_sum = df.groupby().sum().rdd.map(lambda x: x.asDict()).collect()[0]
[v for k, v in col_sum.asDict().items()] # sum of columns: [3, 7]

Related

Lists become pd.Series, the again lists with one dimension more

I have another problem with pandas, I will never make mine this library.
First, this is - I think - how zip() is supposed to work with lists:
import numpy as np
import pandas as pd
a = [1,2]
b = [3,4]
print(type(a))
print(type(b))
vv = zip([1,2], [3,4])
for i, v in enumerate(vv):
print(f"{i}: {v}")
with output:
<class 'list'>
<class 'list'>
0: (1, 3)
1: (2, 4)
Problem. I create a dataframe, with list elements (in the actual code the lists come from grouping ops and I cannot change them, basically they contain all the values in a dataframe grouped by a column).
# create dataframe
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df)
x y
0 [1, 2, 3] [4, 5, 6]
However, the lists are now pd.Series:
print(type(df["x"]))
<class 'pandas.core.series.Series'>
If I do this:
col1 = df["x"].tolist()
col2 = df["y"].tolist()
print(f"col1 is of type {type(col1)}, with length {len(col1)}, first el is {col1[0]} of type {type(col1[0])}")
col1 is of type <class 'list'>, width length 1, first el is [1, 2, 3] of type <class 'list'>
Basically, the tolist() returned a list of list (why?):
Indeed:
print("ZIP AND ITER")
vv = zip(col1, col2)
for v in zip(col1, col2):
print(v)
ZIP AND ITER
([1, 2, 3], [4, 5, 6])
I neeed only to compute this:
# this fails because x (y) is a list
# df['s'] = [np.sqrt(x**2 + y**2) for x, y in zip(df["x"], df["y"])]
I could add df["x"][0] that seems not very elegant.
Question:
How am I supposed to compute sqrt(x^2 + y^2) when x and y are in two columns df["x"] and df["y"]
This should calculate df['s']
df['s'] = df.apply(lambda row: [np.sqrt(x**2 + y**2) for x, y in zip(row["x"], row["y"])], axis=1)
Basically, the tolist() returned a list of list (why?):
Because your dataframe has only 1 row, with two columns and both columns contain a list for its value. So, returning that column as a list of its values, it would return a list with 1 element (the list that is the value).
I think you wanted to create a dataframe like this:
values = {'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}
x y
0 1 4
1 2 5
2 3 6
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df) # yields
x y
0 [1, 2, 3] [4, 5, 6]
An elegant solution to computing sqrt(x^2 + y^2) can be done by converting the dataframe as following:
new_df = df.iloc[0,:].apply(pd.Series).T.reset_index(drop=True)
This yields the follwoing output
x y
0 1 4
1 2 5
2 3 6
Now compute the sqrt(x^2 + y^2)
np.sqrt(new_df['x']**2 + new_df['y']**2)
This yields :
0 4.123106
1 5.385165
2 6.708204
dtype: float64

Loop Over every Nth item in Dictionary

can anyone advise how to loop over every Nth item in a dictionary?
Essentially I have a dictionary of dataframes and I want to be able to create a new dictionary based on every 3rd dataframe item (including the first) based on index positioning of the original. Once I have this I would like to concatenate the dataframes together.
So for example if I have 12 dataframes , I would like the new dataframe to contain the first,fourth,seventh,tenth etc..
Thanks in advance!
if the dict is required, you may use tuple of dict keys:
custom_dict = {
'first': 1,
'second': 2,
'third': 3,
'fourth': 4,
'fifth': 5,
'sixth': 6,
'seventh': 7,
'eighth': 8,
'nineth': 9,
'tenth': 10,
'eleventh': 11,
'twelveth': 12,
}
for key in tuple(custom_dict)[::3]:
print(custom_dict[key])
then, you may call pandas.concat:
df = pd.concat(
[
custom_dict[key]
for key in tuple(custom_dict)[::3]
],
# =========================================================================
# axis=0 # To Append One DataFrame to Another Vertically
# =========================================================================
axis=1 # To Append One DataFrame to Another Horisontally
)
assuming custom_dict[key] returns pandas.DataFrame, not int as in my code above.
What you ask it a bit strange. Anyway, you have two main options.
convert your dictionary values to list and slice that:
out = pd.concat(list(dfs.values())[::3])
output:
a b c
0 x x x
0 x x x
0 x x x
0 x x x
slice your dictionary keys and generate a subdictionary:
out = pd.concat({k: dfs[k] for k in list(dfs)[::3]})
output:
a b c
df1 0 x x x
df4 0 x x x
df7 0 x x x
df10 0 x x x
Used input:
dfs = {f'df{i+1}': pd.DataFrame([['x']*3], columns=['a', 'b', 'c']) for i in range(12)}

Split disorganized arrays with numpy

I am using the below code to read arrays from csv files.
x,y = np.loadtxt(filename, delimiter=';', unpack=True, skiprows=1, usecols=(1,2))
Being x and array that goes like this [5,5,5,0,1,1,2,3,3,4,5,5,5]
and y [111.0,111.1,111.2,111.3,111.4,111.5...]
I want to split both arrays accordingly using x.
So my expected output would be something like this:
[1,1,1,1,1..][111.4,111.5,111.6...]
[2,2,2,2,..][111.10,111.11,111.12...]
[5,5,5,5,5,...][111.0,111.1,111.2...111.20,111.21,111.22]
...
So that I can choose between the x values and it would return its respective y values
I've tried using np.split np.split(x,[21,1,2,3...]) but It doesn't seem to be working for me.
Despite the fact that my solution is probably not the most efficient one performance-wise, you can use it as a starting point for future investigations
import numpy as np
# some dummy data
x = np.array([5,5,5,0,1,1,2,3,3,4,5,5,5])
y = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12])
def split_by_ids(data: np.array, ids: np.array):
splits = [] # result storage
# get unique indicies with their counts from ids array
elems, counts = np.unique(ids, return_counts=True)
# go through each index and its count
for index, count in zip(elems, counts):
# create array of same index and grab corresponding values from data
splits.append((np.repeat(index, count), data[ids == index]))
return splits
split_result = split_by_ids(y, x)
for ids, values in split_result:
print(f'Ids: {ids}, Values: {values}')
Above code resulted in
Ids: [0], Values: [3]
Ids: [1 1], Values: [4 5]
Ids: [2], Values: [6]
Ids: [3 3], Values: [7 8]
Ids: [4], Values: [9]
Ids: [5 5 5 5 5 5], Values: [ 0 1 2 10 11 12]

Two arrays (one indicating an index, the other number of repetition) . I want to remove index based on the number of repetition (python)

I am working in a colab with some dataframes and I have two numpy arrays:
-First one indicates the index of a row.
-The other one indicates the number of repetitions (I did some methods before all this).
If I print both arrays I get something like this:
print(uniqueValues, occurCount)
OUTPUT: [ 13 33 66 ... 99907 99911 99928] [7 1 6 ... 1 6 4]
We can interpret it as: 13 is repeated 7 times, 33 is repeated 1 time....
Now the question:
How can i remove the index and the repetition from both arrays, based on the number of repetition?
Example:
if < 5 then remove element
Expected output:[ 13 66 ... 99911] [7 6 ... 6]
You can use the matching values from occurCount as a filter on uniqueValues and occurCount using boolean indexing:
uniqueValues = uniqueValues[occurCount >= 5]
occurCount = occurCount[occurCount >= 5]
For example:
import numpy as np
uniqueValues = np.array([13, 33, 66, 99907, 99911, 99928])
occurCount = np.array([7, 1, 6, 1, 6, 4])
uniqueValues = uniqueValues[occurCount >= 5]
occurCount = occurCount[occurCount >= 5]
print(uniqueValues )
print(occurCount)
Output:
[ 13 66 99911]
[7 6 6]
uniqueValues = np.array([13, 33, 66, 99907, 99911, 99928])
occurCount = np.array([7, 1, 6, 1, 6, 4])
np.array([uniqueValues, occurCount])[:, occurCount >= 5]
will return a 2 dim array with your results. but the logic is the same as pointed by Nick.
Create a new array where you will append the indexes for occurCount values that meet the criteria of <5. Then use these index value to delete these values from both arrays and store the new version of the array. Need to assign it to the variables because the original np arrays are immutable.
import numpy as np
uniqueValues = np.array([13, 33, 66, 99907, 99911, 99928])
occurCount = np.array([7, 1, 6, 1, 6, 4])
indexes = []
for index, item in enumerate(y):
if item < 5:
indexes.append(index)
y = np.delete(y, indexes)
x = np.delete(x, indexes)
print(x, y)

Use of sample and seed in numpy

die = pd.DataFrame([1, 2, 3, 4, 5, 6])
sum_of_dice = die.sample(n=2, replace=True).sum().loc[0]
print (sum_of_dice)
Can someone explain me what's .sum().loc[0] doing here?
It's always useful to print the intermediate steps to get an idea.
sum calculates the sum of the dataframe for each column.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
loc selects a group of rows/columns.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
sum returns a dataframe with one element, but as we need the sum in integer dtype not a dataframe, we use loc to get the first element.
import pandas as pd
die = pd.DataFrame([1, 2, 3, 4, 5, 6])
sum_of_dice = die.sample(n=2, replace=True)
print(sum_of_dice)
sum_of_dice = sum_of_dice.sum()
print('---')
print (sum_of_dice)
sum_of_dice = sum_of_dice.loc[0]
print('---')
print (sum_of_dice)
0
4 5
0 1
---
0 6
dtype: int64
---
6

Categories

Resources