PySpark: Fastest way of counting values in multiple columns

PySpark: Fastest way of counting values in multiple columns - python

I need to count a value in several columns and I want all those individual count for each column in a list.
Is there a faster/better way of doing this? Because my solution takes quite some time.
dataframe.cache()
list = [dataframe.filter(col(str(i)) == "value").count() for i in range(150)]

You can do a conditional count aggregation:
import pyspark.sql.functions as F
df2 = df.agg(*[
F.count(F.when(F.col(str(i)) == "value", 1)).alias(i)
for i in range(150)
])
result = df2.toPandas().transpose()[0].tolist()

You can try the following approach/design
write a map function for each row of the data frame like this:
VALUE = 'value'
def row_mapper(df_row):
return [each == VALUE for each in df_row]
write a reduce function for data frame that takes 2 two rows as input:
def reduce_rows(df_row1, df_row2):
return [x + y for x, y in zip(df_row1, df_row2)]
Note: these are simple python function to help you understand not some udf functions you can directly apply on PySpark.

Related

Python: Apply a function to multiple subsets of a dataframe (stored in a dictionary)

Regards,
Apologies if this question appears be to a duplicate of other questions. But I could find an answer that addresses my problem in its exactitude.
I split a dataframe, called "data", into multiple subsets that are stored in a dictionary of dataframes named "dfs" as follows:
# Partition DF
dfs = {}
chunk = 5
for n in range((data.shape[0] // chunk + 1)):
df_temp = data.iloc[n*chunk:(n+1)*chunk]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
Now, I would like to apply a pre-defined helper function called "fun_c" to EACH of the dataframes (that are stored in the dictionary object called "dfs").
Is it correct for me to apply the function to the dfs in one go, as follows(?):
result = fun_c(dfs)
If not, what would be the correct way of doing this?

it depends on the output you're looking for:
If you want a dict in the output, then you should apply the function to each dict item
result = dict({key: fun_c(val) for key, val in dfs.items()})
If you want a list of dataframes/values in the output, then apply the function to each dict value
result = [fun_c(val) for val in dfs.items()]
But this style isnt wrong either, you can iterate however you like inside the helper function as well:
def fun_c(dfs):
result = None
# either
for key, val in dfs.items():
pass
# or
for val in dfs.values():
pass
return result
Let me know if this helps!

Since you want this:
Now, I would like to apply a pre-defined helper function called
"fun_c" to EACH of the dataframes (that are stored in the dictionary
object called "dfs").
Let's say your dataframe dict looks like this and your helper function takes in a single dataframe.
dfs = {0 : df0, 1: df1, 2: df2, 3:df3}
Let's iterate through the dictionary, apply the fun_c function on each of the dataframes, and save the results in another dictionary having the same keys:
dfs_result = {k:fun_c[v] for k, v in dfs.items()}

How to access index inside function for applymap in pandas?

I am using a custom function in pandas that iterates over cells in a dataframe, finds the same row in a different dataframe, extracts it as a tuple, extracts a random value from that tuple, and then adds a user specified amount of noise to the value and returns it to the original dataframe. I was hoping to find a way to do this that uses applymap, is it possible? I couldn't find a way using applymap, so I used itertuples, but an applymap solution should be more efficient.
import pandas as pd
# Mock data creation
key = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4,5,6], 'col3':[7,8,9]})
results = pd.DataFrame(np.zeros((3,3)))
def apply_value(value):
key_index = # <-- THIS IS WHERE I NEED A WAY TO ACCESS INDEX
key_tup = key.iloc[key_index]
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
return random_value
results = results.applymap(apply_value)

If I understood your problem correctly, this piece of code should work. The problem is that applymap does not hold the index of the dataframe, so what you have to do is to apply nested apply functions: the first iterates over rows, and we get the key from there, and the second iterates over columns in each row. Hope it helps. Let me know if it does :D
# Mock data creation
key = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4,5,6], 'col3':[7,8,9]})
results = pd.DataFrame(np.zeros((3,3)))
def apply_value(value, key_index):
key_tup= key.loc[key_index]
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
return random_value
results = results.apply(lambda x: x.apply(lambda d: apply_value(d, x.name)), axis=1)

Strictly you don't need to access row-index inside your function, there are other simpler ways to implement this.
You can probably do without it entirely, you don't even need do a pandas JOIN/merge of rows of key.
But first, you need to fix your example data, if key is really supposed to be a dataframe of tuples.
So you want to:
sweep over each column with apply(... , axis=1)
lookup the value of each cell key.loc[key_index]...
...which is supposed to give you a tuple key_tup, but in your example key was a simple dataframe, not a dataframe of tuples
key_tup = key.iloc[key_index]
the business with:
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
can be simplified to just:
np.random.choice(key_tup)
in which case you likely don't need to declare apply_value()

Is there an equivalent Python function similar to complete.cases in R

I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.

Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".

I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.

A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']

How to compute hash of all the columns in Pandas Dataframe?

df.apply is a method that can apply a certain function to all the columns in a dataframe, or the required columns. However, my aim is to compute the hash of a string: this string is the concatenation of all the values in a row corresponding to all the columns. My current code is returning NaN.
The current code is:
df["row_hash"] = df["row_hash"].apply(self.hash_string)
The function self.hash_string is:
def hash_string(self, value):
return (sha1(str(value).encode('utf-8')).hexdigest())
Yes, it would be easier to merge all columns of Pandas dataframe but current answer couldn't help me either.
The file that I am reading is(the first 10 rows):
16012,16013,16014,16015,16016,16017,16018,16019,16020,16021,16022
16013,16014,16015,16016,16017,16018,16019,16020,16021,16022,16023
16014,16015,16016,16017,16018,16019,16020,16021,16022,16023,16024
16015,16016,16017,16018,16019,16020,16021,16022,16023,16024,16025
16016,16017,16018,16019,16020,16021,16022,16023,16024,16025,16026
The col names are: col_test_1, col_test_2, .... , col_test_11

You can create a new column, which is concatenation of all others:
df['new'] = df.astype(str).values.sum(axis=1)
And then apply your hash function on it
df["row_hash"] = df["new"].apply(self.hash_string)
or this one-row should work:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(hash_string)
However, not sure if you need a separate function here, so:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(lambda x: sha1(str(x).encode('utf-8')).hexdigest())

You can use apply twice, first on the row elements then on the result:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(self.hash_string)
Sidenote: I don't understand why you are defining hash_string as an instance method (instead of a plain function), since it doesn't use the self argument. In case you have problems can just pass it as function:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(lambda value: sha1(str(value).encode('utf-8')).hexdigest())

Dynamically add columns to dataframe via apply

The following code applies a function f to a dataframe column data_df["c"] and concats the results to the original dataframe, i.e. concatenating 1024 columns to the dataframe data_df.
data_df = apply_and_concat(data_df, "c", lambda x: f(x, y), [y + "-dim" + str(i) for i in range(0,1024)])
def apply_and_concat(df, field, func, column_names):
return pd.concat((
df,
df[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
The problem is that I want to execute this dynamically, meaning that I don't know how many columns it returns. freturns a list. Is there any better or easier way to add these columns without the need to specify the number of columns before?

Your use of pd.concat(df, df.apply(...), axis=1) already solves the main task well. It seems like your main question really boils down to "how do I name an unknown number of columns", where you're happy to use a name based on sequential integers. For that, use itertools.count():
import itertools
f_modified = lambda x: dict(zip(
('{}-dim{}'.format(y, i) for i in itertools.count()),
f(x, y)
))
Then use f_modified instead of f. That way, you get a dictionary instead of a list, with an arbitrary number of dynamically generated names as keys. When converting this dictionary to a Series, you'll end up with the keys being used as the index, so you don't need to provide an explicit list as the index, and hence don't need to know the number of columns in advance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark: Fastest way of counting values in multiple columns - python

I need to count a value in several columns and I want all those individual count for each column in a list. Is there a faster/better way of doing this? Because my solution takes quite some time. dataframe.cache() list = [dataframe.filter(col(str(i)) == "value").count() for i in range(150)]

You can do a conditional count aggregation: import pyspark.sql.functions as F df2 = df.agg(*[ F.count(F.when(F.col(str(i)) == "value", 1)).alias(i) for i in range(150) ]) result = df2.toPandas().transpose()[0].tolist()

Related

Python: Apply a function to multiple subsets of a dataframe (stored in a dictionary)

How to access index inside function for applymap in pandas?

Is there an equivalent Python function similar to complete.cases in R

How to compute hash of all the columns in Pandas Dataframe?

Dynamically add columns to dataframe via apply

Categories

Resources