I'm trying to calculate the proportion of a specific value occurring in a specific column within subgroups.
Sample dataframe
pdf = pd.DataFrame({
'id': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'letter': ['L', 'A', 'L', 'L', 'L', 'L', 'L', 'A', 'L', 'L']
})
df = spark.createDataFrame(pdf)
df.show()
I tried to rely on this answer but with the following code
df\
.groupby('id')\
.agg((count(col('letter') == 'L') / count(col('letter'))).alias('prop'))\
.show()
I obtained a column full of 1.0, even when I changed 'L' to 'A'.
My desired output is, for each group, the proportion of 'L' values within the group:
+---+--------+
| id| prop|
+---+--------+
| 1| 0.75|
| 2| 1.0|
| 3| 0.66667|
+---+--------+
You can use sum with when instead to count the occurrences of L:
df.groupby('id')\
.agg((F.sum(F.when(F.col('letter') == 'L', 1)) / F.count(F.col('letter'))).alias('prop'))\
.show()
This will give you the proportion only in non-null values. If you want to calculate on all rows, divide by count("*")instead of count(col('letter')).
Before you count, you need to mask the non-L letters with nulls using when:
df\
.groupby('id')\
.agg((count(when(col('letter') == 'L', 1)) / count(col('letter'))).alias('prop'))\
.show()
Note that count will only count non-null entries. It does not only count true entries, as you had expected in your code. Your code is more suitable if you're using count_if from Spark SQL.
Related
I have a PySpark DataFrame and I want to map values of a column.
Sample dataset:
data = [(1, 'N'), \
(2, 'N'), \
(3, 'C'), \
(4, 'S'), \
(5, 'North'), \
(6, 'Central'), \
(7, 'Central'), \
(8, 'South')
]
columns = ["ID", "City"]
df = spark.createDataFrame(data = data, schema = columns)
The mapping dictionary is:
{'N': 'North', 'C': 'Central', 'S': 'South'}
And I use the following code:
from pyspark.sql import functions as F
from itertools import chain
mapping_dict = {'N': 'North', 'C': 'Central', 'S': 'South'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*mapping_dict.items())])
df_new = df.withColumn('City_New', mapping_expr[df['City']])
And the results are:
As you can see, I get Null values for rows which I don't include their values in the mapping dictionary. To solve this, I can define mapping dictionary by:
{'N': 'North', 'C': 'Central', 'S': 'South', \
'North': 'North', 'Central': 'Central', 'South': 'South'}
However, if there are many unique values in the dataset, it is hard to define a mapping dictionary.
Is there any better way for this purpose?
you can use a coalesce.
here's how it'd look like.
# create separate case whens for each key-value pair
map_whens = [func.when(func.upper('city') == k.upper(), v) for k, v in map_dict.items()]
# [Column<'CASE WHEN (upper(city) = N) THEN North END'>,
# Column<'CASE WHEN (upper(city) = C) THEN Central END'>,
# Column<'CASE WHEN (upper(city) = S) THEN South END'>]
# pass case whens to coalesce with last value as `city` field
data_sdf. \
withColumn('city_new', func.coalesce(*map_whens, 'city')). \
show()
# +---+-------+--------+
# | id| city|city_new|
# +---+-------+--------+
# | 1| N| North|
# | 2| N| North|
# | 3| C| Central|
# | 4| S| South|
# | 5| North| North|
# | 6|Central| Central|
# | 7|Central| Central|
# | 8| South| South|
# +---+-------+--------+
Given a random dataset, I need to find rows related to the first row.
|Row|Foo|Bar|Baz|Qux|
|---|---|---|---|---|
| 0 | A |A🔴 |A | A |
| 1 | B | B | B | B |
| 2 | C | C | C |D🟠|
| 3 | D |A🔴 | D |D🟠|
I should get the related rows which are 0, 2, and 3 because 0['Bar'] == 3['Bar'] and 3['Qux'] == 2['Qux'].
I can just iterate over the columns to get the similarities but that would be slow and inefficient and I would also need to iterate again if there are new similarities.
I hope someone can point me to the right direction like which pandas concept should I be looking at or which functions can help me solve this problem of retrieving intersecting data. Do I even need to use pandas?
Edit:
Providing the solution as suggested by #goodside. This solution will loop until there are no more new matched index found.
table = [
['A', 'A', 'A', 'A'],
['B', 'B', 'B', 'B'],
['C', 'C', 'C', 'D'],
['D', 'A', 'D', 'D']
]
comparators = [0]
while True:
for idx_row, row in enumerate(table):
if idx_row in comparators:
continue
for idx_col, cell in enumerate(row):
for comparator in comparators:
if cell == table[comparator][idx_col]:
comparators.append(idx_row)
break
else:
continue
break
else:
continue
break
else:
break
for item in comparators:
print(table[item])
This is a graph problem. You can use networkx:
# get the list of connected nodes per column
def get_edges(s):
return df['Row'].groupby(s).agg(frozenset)
edges = set(df.apply(get_edges).stack())
edges = list(map(set, edges))
# [{2}, {2, 3}, {0, 3}, {3}, {1}, {0}]
from itertools import pairwise, chain
# pairwise is python ≥ 3.10, see the doc for a recipe for older versions
# create the graph
import networkx as nx
G = nx.from_edgelist(chain.from_iterable(pairwise(e) for e in edges))
G.add_nodes_from(set.union(*edges))
# get the connected components
list(nx.connected_components(G))
Output: [{0, 2, 3}, {1}]
NB. You can read more on the logic to create the graph in this question of mine.
Used input:
df = pd.DataFrame({'Row': [0, 1, 2, 3],
'Foo': ['A', 'B', 'C', 'D'],
'Bar': ['A', 'B', 'C', 'A'],
'Baz': ['A', 'B', 'C', 'D'],
'Qux': ['A', 'B', 'D', 'D']})
I have a dataframe that looks like his
_____________________
|col1 | col2 | col3 |
---------------------
| a | b | c |
| d | b | c |
| e | f | g |
| h | f | j |
---------------------
I want to get a dictionary structure that looks as follows
{
b : { col1: [a,d], col2: b, col3: c},
f : { col1: [e, h], col2: f, col3: [g, j]}
}
I have seen this answer. But it seems like overkill for what I want to do as it converts every value of the key inside the nested dictionary into a list. I would only like to convert col1 into a list when creating the dictionary. Is this possible?
Use custom lambda function for return unique values in list if there is multiple them else scalar in lambda function:
d = (df.set_index('col2', drop=False)
.groupby(level=0)
.agg(lambda x: list(set(x)) if len(set(x)) > 1 else list(set(x))[0])
.to_dict('index'))
print (d)
{'b': {'col1': ['d', 'a'], 'col2': 'b', 'col3': 'c'},
'f': {'col1': ['h', 'e'], 'col2': 'f', 'col3': ['j', 'g']}}
If order is important use dict.fromkeys for remove duplicates:
d = (df.set_index('col2', drop=False)
.groupby(level=0)
.agg(lambda x: list(dict.fromkeys(x)) if len(set(x)) > 1 else list(set(x))[0])
.to_dict('index'))
print (d)
{'b': {'col1': ['a', 'd'], 'col2': 'b', 'col3': 'c'},
'f': {'col1': ['e', 'h'], 'col2': 'f', 'col3': ['g', 'j']}}
I have two data-frames and I want to populate new column values in data-frame1 based on matching Zipcode and date from another data-frame2.
The sample input and desired output are given below. The date formats are not the same. Dataframe 1 has more than 100k records and data-frame2 has columns for every month.
Any suggestions would be of great help since I am a newbie to python.
you are looking for pd.merge. Here is an example which shows how you can use it.
df1 = pd.DataFrame({'x1': [1, 2, 3, 4, 5, 6],
'y': ['a', 'b', 'c', 'd', 'e', 'f']})
df2 = pd.DataFrame({'x2': [1, 2, 3, 4, 5, 6],
'y': ['h', 'i', 'j', 'k', 'l', 'm']})
pd.merge(df1, df2, left_on='x1', right_on='x2')
I'm trying to simplify pandas and python syntax when executing a basic Pandas operation.
I have 4 columns:
a_id
a_score
b_id
b_score
I create a new label called doc_type based on the following:
a >= b, doc_type: a
b > a, doc_type: b
Im struggling in how to calculate in Pandas where a exists but b doesn't, in this case then a needs to be the label. Right now it returns the else statement or b.
I needed to create 2 additional comparison which at scale may be efficient as I already compare the data before. Looking how to improve it.
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
print df
# Replace empty string with NaN
m_score = r['a_score'] >= r['b_score']
m_doc = (r['a_id'].isnull() & r['b_id'].isnull())
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
r['doc_type'] = numpy.where(m_score, 'a',
numpy.where(m_doc, numpy.nan, 'b'))
# Additional lines looking for improvement:
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].notnull())] = 'b'
df['doc_type'].loc[(df['a_id'].notnull() & df['b_id'].isnull())] = 'a'
print df
Use numpy.where, assuming your logic is:
Both exist, the doc_type will be the one with higher score;
One missing, the doc_type will be the one not null;
Both missing, the doc_type will be null;
Added an extra edge case at the last line:
import numpy as np
df = df.replace('', np.nan)
df['doc_type'] = np.where(df.b_id.isnull() | (df.a_score >= df.b_score),
np.where(df.a_id.isnull(), None, 'a'), 'b')
df
Not sure I fully understand all conditions or if this has any particular edge cases, but I think you can just do an np.argmax on the columns and swap the values for 'a' or 'b' when you're done:
In [21]: import numpy as np
In [22]: df['doc_type'] = pd.Series(np.argmax(df[["a_score", "b_score"]].values, axis=1)).replace({0: 'a', 1: 'b'})
In [23]: df
Out[23]:
a_id a_score b_id b_score doc_type
0 A 1 a 0.10 a
1 B 2 b 0.20 a
2 C 3 c 3.10 b
3 D 4 d 4.10 b
4 2 e 5.00 b
5 F f 5.99 a
6 G 7 NaN a
Use the apply method in pandas with a custom function, trying out on your dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
df = df.replace('',np.NaN)
def func(row):
if np.isnan(row.a_score) and np.isnan(row.b_score):
return np.NaN
elif np.isnan(row.b_score) and not(np.isnan(row.a_score)):
return 'a'
elif not(np.isnan(row.b_score)) and np.isnan(row.a_score):
return 'a'
elif row.a_score>=row.b_score:
return 'a'
elif row.b_score>row.a_score:
return 'b'
df['doc_type'] = df.apply(func,axis=1)
You can make the function as complicated as you need and include any amount of comparisons and add more conditions later if you need to.