Confused about the usage of .apply and lambda - python

After encountering this code:
I was confused about the usage of both .apply and lambda. Firstly does .apply apply the desired change to all elements in all the columns specified or each column one by one? Secondly, does x in lambda x: iterate through every element in specified columns or columns separately? Thirdly, does x.min or x.max give us the minimum or maximum of all the elements in specified columns or minimum and maximum elements of each column separately? Any answer explaining the whole process would make me more than grateful.
Thanks.

I think here is the best avoid apply - loops under the hood and working with subset of DataFrame by columns from list:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
c = ['B','C','D']
So first select minimal values of selected columns and similar maximal:
print (df[c].min())
B 4
C 2
D 0
dtype: int64
Then subtract and divide:
print ((df[c] - df[c].min()))
B C D
0 0 5 1
1 1 6 3
2 0 7 5
3 1 2 7
4 1 0 1
5 0 1 0
print (df[c].max() - df[c].min())
B 1
C 7
D 7
dtype: int64
df[c] = (df[c] - df[c].min()) / (df[c].max() - df[c].min())
print (df)
A B C D E F
0 a 0.0 0.714286 0.142857 5 a
1 b 1.0 0.857143 0.428571 3 a
2 c 0.0 1.000000 0.714286 6 a
3 d 1.0 0.285714 1.000000 9 b
4 e 1.0 0.000000 0.142857 2 b
5 f 0.0 0.142857 0.000000 4 b
EDIT:
For debug apply is best create custom function:
def f(x):
#for each loop return column
print (x)
#return scalar - min
print (x.min())
#return new Series - column
print ((x-x.min())/ (x.max() - x.min()))
return (x-x.min())/ (x.max() - x.min())
df[c] = df[c].apply(f)
print (df)

Check if the data are really being normalised. Because x.min and x.max may simply take the min and max of a single value, hence no normalisation would occur.

Related

Python dataframe:: get count across two columns for each unique value in either column

I have a python dataframe with columns, 'Expected' vs 'Actual' that shows a product (A,B,C or D) for each record
ID
Expected
Actual
1
A
B
2
A
A
3
C
B
4
B
D
5
C
D
6
A
A
7
B
B
8
A
D
I want to get a count from both columns for each unique value found in both columns (both columns dont share all the same products). So the result should look like this,
Value
Expected
Actual
A
4
2
B
2
3
C
2
0
D
0
3
Thank you for all your help
You can use apply and value_counts
df = pd.DataFrame({'Expected':['A','A','C','B','C','A','B','A'],'Actual':['B','A','B','D','D','A','B','D']})
df.apply(pd.Series.value_counts).fillna(0)
output:
Expected Actual
A 4.0 2.0
B 2.0 3.0
C 2.0 0.0
D 0.0 3.0
I would do it following way
import pandas as pd
df = pd.DataFrame({'Expected':['A','A','C','B','C','A','B','A'],'Actual':['B','A','B','D','D','A','B','D']})
ecnt = df['Expected'].value_counts()
acnt = df['Actual'].value_counts()
known = sorted(set(df['Expected']).union(df['Actual']))
cntdf = pd.DataFrame({'Value':known,'Expected':[ecnt.get(k,0) for k in known],'Actual':[acnt.get(k,0) for k in known]})
print(cntdf)
output
Value Expected Actual
0 A 4 2
1 B 2 3
2 C 2 0
3 D 0 3
Explanation: main idea here is having separate value counts for Expected column and Actual column. If you wish to rather have Value as Index of your pandas.DataFrame you can do
...
cntdf = pd.DataFrame([acnt,ecnt]).T.fillna(0)
print(cntdf)
output
Actual Expected
D 3.0 0.0
B 3.0 2.0
A 2.0 4.0
C 0.0 2.0

pandas - groupby a column and get the max length of another string column with nulls

I have a pandas DataFrame like this:
source text_column
0 a abcdefghi
1 a abcde
2 b qwertyiop
3 c plmnkoijb
4 a NaN
5 c abcde
6 b qwertyiop
7 b qazxswedcdcvfr
and I would like to get the length of text_column after grouping source column, like below:
source something
a 9
b 14
c 9
Here's what I have tried till now and all of them generate error:
>>> # first creating the group by object
>>> text_group = mydf.groupby(by=['source'])
>>> # now try to get the max length of "text_column" by each "source"
>>> text_group['text_column'].map(len).max()
>>> text_group['text_column'].len().max()
>>> text_group['text_column'].str.len().max()
How do I get the max length of text_column with another column grouped by.
and to avoid creating new question, how do I also get the 2nd biggest length and the respective values(1st and 2nd largest sentences in text_column).
First idea is use lambda function with Series.str.len and max:
df = (df.groupby('source')['text_column']
.agg(lambda x: x.str.len().max())
.reset_index(name='something'))
print (df)
source something
0 a 9.0
1 b 14.0
2 c 9.0
Or you can first use Series.str.len and then aggregate max:
df = (df['text_column'].str.len()
.groupby(df['source'])
.max()
.reset_index(name='something'))
print (df)
Also if need integers first use DataFrame.dropna:
df = (df.dropna(subset=['text_column'])
.assign(text_column=lambda x: x['text_column'].str.len())
.groupby('source', as_index=False)['text_column']
.max())
print (df)
source text_column
0 a 9
1 b 14
2 c 9
EDIT: for first and second top values use DataFrame.sort_values with GroupBy.head:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.sort_values(['source','something'], ascending=[True, False])
.groupby('source', as_index=False)
.head(2))
print (df1)
source text_column something
0 a abcdefghi 9
1 a abcde 5
7 b qazxswedcdcvfr 14
2 b qwertyiop 9
3 c plmnkoijb 9
5 c abcde 5
Alternative solution with SeriesGroupBy.nlargest, obviously slowier:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.groupby('source')['something']
.nlargest(2)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
source something
0 a 9
1 a 5
2 b 14
3 b 9
4 c 9
5 c 5
Last solution for new columns by top1, top2:
df=df.dropna(subset=['text_column']).assign(something=lambda x: x['text_column'].str.len())
df = df.sort_values(['source','something'], ascending=[True, False])
df['g'] = df.groupby('source').cumcount().add(1)
df = (df[df['g'].le(2)].pivot('source','g','something')
.add_prefix('top')
.rename_axis(index=None, columns=None))
print (df)
top1 top2
a 9 5
b 14 9
c 9 5
Just get the lengths first with assign and str.len:
df.assign(text_column=df['text_column'].str.len()).groupby('source', as_index=False).max()
source text_column
0 a 9.0
1 b 14.0
2 c 9.0
>>>
The easiest solution to me looks sth like this (tested) - you do not actually need a groupby:
df['str_len'] = df.text_column.str.len()
df.sort_values(['str_len'], ascending=False)\
.drop_duplicates(['source'])\
.drop(columns='text_column')
source str_len
7 b 14.0
0 a 9.0
3 c 9.0
regarding your 2nd question, I think a groupby serves you well:
top_x = 2
df.groupby('source', as_index=False)\
.apply(lambda sourcedf: sourcedf.sort_values('str_len').nlargest(top_x, columns='str_len', keep='all'))\
.drop(columns='text_column')

Pandas merge dividing a value each time it's merged

I have 2 dataframes from the following
df :
Name
1 A
2 B
3 C
4 C
5 D
6 D
7 D
8 D
and df_value :
Name Value
1 A 50
2 B 100
3 C 200
4 D 800
I want to merge both dataframes (into df), but with the new Value being worth the df_value Value divided by the number of occurences of Name in df
Output :
Name Value
1 A 50
2 B 100
3 C 100
4 C 100
5 D 200
6 D 200
7 D 200
8 D 200
A appears once, has a Value of 50 in df_value, so its value is 50. Same logic for B.
C appears 2 times, has a value of 200 in df_value, so its value is 200 / 2 = 100
D appears 4 times, has a value of 800 in df_value, so its value is 800 / 4 = 200
I'm pretty sure there's a really easy way to do that but I can't find it.
Thanks in advance.
Use Series.map by Name column and Series from df_value and divide mapped values of Series.value_counts:
df['Value'] = (df['Name'].map(df_value.set_index('Name')['Value'])
.div(df['Name'].map(df['Name'].value_counts())))
print (df)
Name Value
1 A 50.0
2 B 100.0
3 C 100.0
4 C 100.0
5 D 200.0
6 D 200.0
7 D 200.0
8 D 200.0
Another solution, thank you #sammywemmy is mapping by already divided values:
df1.assign(Value=df1.Name.map(df2.set_index("Name").Value.div(df1.Name.value_counts())))
Solution with merge is possible, also added anothe alternative for counts by GroupBy.transform:
df['Value'] = (df.merge(df_value, on='Name', how='left')['Value']
.div(df.groupby('Name')['Name'].transform('size')))
If it is important to keep existing dataframes as is and there is no restriction of using 2 lines of code:
df1 = df.merge(df_value, on='Name', how='left')
df1['Value'] = df1.groupby('Name')[['Value']].transform(lambda x: x/len(x))
Otherwise one liner solution that modifies existing 'df' a bit.
df['Value'] = df.merge(df_value, on='Name', how='left').groupby('Name')[['Value']].transform(lambda x: x/len(x))
Both give same output with different variable names:
Name Value
0 A 50.0
1 B 100.0
2 C 100.0
3 C 100.0
4 D 200.0
5 D 200.0
6 D 200.0
7 D 200.0

Find and match elements in a column and change the values of corresponding rows in another column

I have a DataFrame that looks like this:
df = pd.DataFrame({'ID':['A','B','A','C','C'], 'value':[2,4,9,1,3.5]})
df
ID value
0 A 2.0
1 B 4.0
2 A 9.0
3 C 1.0
4 C 3.5
What I need to do is to go through ID column and for each unique value, find that row, and multiply the corresponding row in value column based on the reference that I have.
For example, if I have the following reference:
if A multiply by 10
if B multiply by 3
if C multiply by 2
Then the desired output would be:
df
ID value
0 A 2.0*10
1 B 4.0*3
2 A 9.0*10
3 C 1.0*2
4 C 3.5*2
Thanks in advance.
Use Series.map with dictionary for Series used for multiple column value:
d = {'A':10, 'B':3,'C':2}
df['value'] = df['value'].mul(df['ID'].map(d))
print (df)
ID value
0 A 20.0
1 B 12.0
2 A 90.0
3 C 2.0
4 C 7.0
Detail:
print (df['ID'].map(d))
0 10
1 3
2 10
3 2
4 2
Name: ID, dtype: int64

Group DataFrame, apply function with inputs then add result back to original

Can't find this question anywhere, so just try here instead:
What I'm trying to do is basically alter an existing DataFrame object using groupby-functionality, and a self-written function:
benchmark =
x y z field_1
1 1 3 a
1 2 5 b
9 2 4 a
1 2 5 c
4 6 1 c
What I want to do, is to groupby field_1, apply a function using specific columns as input, in this case columns x and y, then add back the result to the original DataFrame benchmark as a new column called new_field. The function itself is dependent on the value in field_1, i.e. field_1=a will yield a different result compared to field_1=b etc. (hence the grouping to start with).
Pseudo-code would be something like:
1. grouped_data = benchmark.groupby(['field_1'])
2. apply own_function to grouped_data; with inputs ('x', 'y', grouped_data)
3. add back result from function to benchmark as column 'new_field'
Thanks,
ALTERATION:
benchmark =
x y z field_1
1 1 3 a
1 2 5 b
9 2 4 a
1 2 5 c
4 6 1 c
Elaboration:
I also have a DataFrame separate_data containing separate values for x,
separate_data =
x a b c
1 1 3 7
2 2 5 6
3 2 4 4
4 2 5 9
5 6 1 10
that will need to be interpolated onto the existing benchmark DataFrame. Which column in separate_data that should be used for interpolation is dependent on column field_1 in benchmark (i.e. values in set (a,b,c) above). The interpolated value in the new column, is based on x-value in benchmark.
Result:
benchmark =
x y z field_1 field_new
1 1 3 a interpolate using separate_data with x=1 and col=a
1 2 5 b interpolate using separate_data with x=1 and col=b
9 2 4 a ... etc
1 2 5 c ...
4 6 1 c ...
Makes sense?
EDIT:
I think you need reshape separate_data first by set_index + stack, set index names by rename_axis and set name of Serie by rename.
Then is possible groupby by both levels and use some function.
Then join it to benchmark with default left join:
separate_data1 =separate_data.set_index('x').stack().rename_axis(('x','field_1')).rename('d')
print (separate_data1)
x field_1
1 a 1
b 3
c 7
2 a 2
b 5
c 6
3 a 2
b 4
c 4
4 a 2
b 5
c 9
5 a 6
b 1
c 10
Name: d, dtype: int64
If necessary use some function, mainly if some duplicates in pairs x with field_1 it return nice unique pairs:
def func(x):
#sample function
return x / 2 + x ** 2
separate_data1 = separate_data1.groupby(level=['x','field_1']).apply(func)
print (separate_data1)
x field_1
1 a 1.5
b 10.5
c 52.5
2 a 5.0
b 27.5
c 39.0
3 a 5.0
b 18.0
c 18.0
4 a 5.0
b 27.5
c 85.5
5 a 39.0
b 1.5
c 105.0
Name: d, dtype: float64
benchmark = benchmark.join(separate_data1, on=['x','field_1'])
print (benchmark)
x y z field_1 d
0 1 1 3 a 1.5
1 1 2 5 b 10.5
2 9 2 4 a NaN
3 1 2 5 c 52.5
4 4 6 1 c 85.5
I think you cannot use transform because multiple columns which are read together.
So use apply:
df1 = benchmark.groupby(['field_1']).apply(func)
And then for new column are multiple solutions, e.g. use join (default left join) or map.
Sample solution with both method is here.
Or is possible use flexible apply which can return new DataFrame with new column.
Try something like this:
groups = benchmark.groupby(benchmark["field_1"])
benchmark = benchmark.join(groups.apply(your_function), on="field_1")
In your_function you would create the new column using the other columns that you need, e.g. average them, sum them, etc.
Documentation for apply.
Documentation for join.
Here is a working example:
# Sample function that sums x and y, then append the field as string.
def func(x, y, z):
return (x + y).astype(str) + z
benchmark['new_field'] = benchmark.groupby('field_1')\
.apply(lambda x: func(x['x'], x['y'], x['field_1']))\
.reset_index(level = 0, drop = True)
Result:
benchmark
Out[139]:
x y z field_1 new_field
0 1 1 3 a 2a
1 1 2 5 b 3b
2 9 2 4 a 11a
3 1 2 5 c 3c
4 4 6 1 c 10c

Categories

Resources