Overwrite NaNs in a column based on identifier

Overwrite NaNs in a column based on identifier - python

I have a dataframe that contains some NaN-values in a t-column. The values in the t-column belong to a certain id and should be the same per id:
df = pd.DataFrame({"t" : [4, 4, 1, 1, float('nan'), 2, 2, 2, float('nan'), 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})
Therefore, I would like to overwrite the NaN in t with the non-NaN in t for the respective id and ultimately end up with
df = pd.DataFrame({"t" : [4, 4, 1, 1, 2, 2, 2, 2, 10, 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})

New strategy... Create a map by dropping na and reassign using loc and mask.
import pandas as pd
df = pd.DataFrame({"t" : [4, 4, 1, 1, float('nan'), 2, 2, 2, float('nan'), 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})
# create mask
m = pd.isna(df['t'])
# create map
#d = df[~m].set_index('id')['t'].drop_duplicates()
d = df[~m].set_index('id')['t'].to_dict()
# assign map to the slice of the dataframe containing nan
df.loc[m,'t'] = df.loc[m,'id'].map(d)
print(df)
df returns:
id t
0 1 4.0
1 1 4.0
2 2 1.0
3 2 1.0
4 3 2.0
5 3 2.0
6 3 2.0
7 3 2.0
8 4 10.0
9 4 10.0

Use sort_values with groupby and transform for same column with first:
df['t'] = df.sort_values(['id','t']).groupby('id')['t'].transform('first')
Alternative solution is map by Series created by dropna with drop_duplicates:
df['t'] = df['id'].map(df.dropna(subset=['t']).drop_duplicates('id').set_index('id')['t'])
print (df)
id t
0 1 4.0
1 1 4.0
2 2 1.0
3 2 1.0
4 3 2.0
5 3 2.0
6 3 2.0
7 3 2.0
8 4 10.0
9 4 10.0

Related

Use of index in pandas DataFrame for groupby and aggregation

I want to aggregate a single column DataFrame and count the number of elements. However, I always end up with an empty DataFrame:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[46]:
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5]
If I add a second column, I get the desired result:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5], "B":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[45]:
B
A
1 1
2 1
3 1
4 1
5 3
Can you explain the reason for this?

Give this a shot:
import pandas as pd
print(pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A")["A"].count())
prints
A
1 1
2 1
3 1
4 1
5 3

You have to add the grouped by column in your result:
import pandas as pd
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").A.count()
Output:
A
1 1
2 1
3 1
4 1
5 3

Pandas drop duplicated values partially

I have a dataframe as
df=pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[7, 2, 2, 5, 7, 2, 2]})
I would like to drop the duplicated values from columns A and C. However, I want it to work partially.
If I use
df.drop_duplicates(subset=['A','C'], keep='first')
It will drop row 2, 5, 6. However, I only want to drop row 2 and 6. The desired results are like:
df=pd.DataFrame({'A':[1, 3, 4, 5, 3],
'B':[0, 2, 4, 5, 6],
'C':[7, 2, 5, 7, 2]})

Here's how you can do this, using shift:
df.loc[(df[["A", "C"]].shift() != df[["A", "C"]]).any(axis=1)].reset_index(drop=True)
Output:
A B C
0 1 0 7
1 3 2 2
2 4 4 5
3 5 5 7
4 3 6 2
This question is a nice reference.

You can just keep every second repetition of A, C pair:
df=df.loc[df.groupby(["A", "C"]).cumcount()%2==0]
Outputs:
A B C
0 1 0 7
1 3 2 2
3 4 4 5
4 5 5 7
5 3 6 2

Group array elements according occurence, keep order, get first and last indices

I wonder if there is a nicer way in pandas to achieve the same:
x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2]
x = np.asarray(x)
df = pd.DataFrame(columns=['id', 'start', 'end'])
if len(x) > 1:
i = 0
for j in range(1, len(x)):
if x[j] == x[j-1]:
continue
else:
df.loc[len(df)] = [x[i], i, j-1]
i = j;
df.loc[len(df)] = [x[i], i, j]
else:
df.loc[len(df)] = [x[0], 0, 0]
The output looks like this
[1 1 1 2 2 2 3 3 3 5 5 1 1 2 2]
id start end
0 1 0 2
1 2 3 5
2 3 6 8
3 5 9 10
4 1 11 12
5 2 13 14
Thanks for helpful hints.

Here's a way you could do it using numpy:
x = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2])
# Search for all consecutive non equal values in the array
vals = x[x != np.roll(x, 1)]
# array([1, 2, 3, 5, 1, 2])
# Indices where changes in x occur
d = np.flatnonzero(np.diff(x) != 0)
# array([ 2, 5, 8, 10, 12])
start = np.hstack([0] + [d+1])
# array([ 0, 3, 6, 9, 11, 13])
end = np.hstack([d, len(x)-1])
# array([ 2, 5, 8, 10, 12, 14])
pd.DataFrame({'id':vals, 'start':start, 'end':end})
id start end
0 1 0 2
1 2 3 5
2 3 6 8
3 5 9 10
4 1 11 12
5 2 13 14

Another solution:
df= pd.DataFrame(data=[1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2],columns=['id'])
g=df.groupby((df.id!=df.id.shift()).cumsum())['id']
df_new=pd.concat([g.first(),g.apply(lambda x: x.duplicated(keep='last').idxmax()),\
g.apply(lambda x: x.duplicated(keep='last').idxmin())],axis=1)
df_new.columns=['id','start','end']
print(df_new)
id start end
id
1 1 0 2
2 2 3 5
3 3 6 8
4 5 9 10
5 1 11 12
6 2 13 14

You could do the following, using only pandas:
import numpy as np
import pandas as pd
x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2]
s = pd.Series(x)
# store group-by to avoid repetition
groups = s.groupby((s != s.shift()).cumsum())
# get id and size for each group
ids, size = groups.first(), groups.size()
# get start
start = size.cumsum().shift().fillna(0).astype(np.int32)
# get end
end = (start + size - 1)
df = pd.DataFrame({'id': ids, 'start': start, 'end': end}, columns=['id', 'start', 'end'])
print(df)
Output
id start end
1 1 0 2
2 2 3 5
3 3 6 8
4 5 9 10
5 1 11 12
6 2 13 14

using itertools.groupby
import pandas as pd
from itertools import groupby
x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2]
l = []
for i in [list(g) for _,g in groupby(enumerate(x), lambda x:x[1])]:
l.append( (i[0][1], i[0][0], i[-1][0]) )
print (pd.DataFrame(l, columns=['id','start','end']))
Output:
id start end
0 1 0 2
1 2 3 5
2 3 6 8
3 5 9 10
4 1 11 12
5 2 13 14

Duplicating rows with certain value in a column

I have to duplicate rows that have a certain value in a column and replace the value with another value.
For instance, I have this data:
import pandas as pd
df = pd.DataFrame({'Date': [1, 2, 3, 4], 'B': [1, 2, 3, 2], 'C': ['A','B','C','D']})
Now, I want to duplicate the rows that have 2 in column 'B' then change 2 to 4
df = pd.DataFrame({'Date': [1, 2, 2, 3, 4, 4], 'B': [1, 2, 4, 3, 2, 4], 'C': ['A','B','B','C','D','D']})
Please help me on this one. Thank you.

You can use append, to append the rows where B == 2, which you can extract using loc, but also reassigning B to 4 using assign. If order matters, you can then order by C (to reproduce your desired frame):
>>> df.append(df[df.B.eq(2)].assign(B=4)).sort_values('C')
B C Date
0 1 A 1
1 2 B 2
1 4 B 2
2 3 C 3
3 2 D 4
3 4 D 4

pandas grouped with aggregation stats across all dataframe columns

I am grouping data in a pandas dataframe and using some aggregation functions to generate results data. Input data:
A B C D E F
0 aa 5 3 2 2 2
1 aa 3 2 2 3 3
2 ac 2 0 2 7 7
3 ac 9 2 3 8 8
4 ac 2 3 7 3 3
5 ad 0 0 0 1 1
6 ad 9 9 9 9 9
7 ad 6 6 6 6 6
8 ad 3 3 3 3 3
The pandas grouped function seems to only operate on one column at a time but I want to generate the statistic on all columns in my df. For example, I can use the function grouped['C'].agg([np.mean, len]) to generate the statistics on column 'C' but what if I want to generate these statistics on all columns A - F?
The output from this is:
A count_C mean_C
0 aa 2 2.500000
1 ac 3 1.666667
2 ad 4 4.500000
But what I want is:
A count_B mean_B count_C mean_C count_D mean_D etc...
0 aa 2 4.000000 2 2.500000 2 2.0 etc...
1 ac 3 4.333333 3 2.500000 3 4.0
2 ad 4 4.500000 4 2.500000 4 4.5
Is there any easy way to do the group by with aggregation in a single command? If not, is there an easy way to iterate over all columns and merge in new aggregation statistics results for each column?
Here's my full code so far:
import pandas as pd
import numpy as np
import pprint as pp
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
#group, aggregate, convert object to df, sort index
grouped = test_dataframe.groupby(['A'])
grouped_stats = grouped['C'].agg([np.mean, len])
grouped_stats = pd.DataFrame(grouped_stats).reset_index()
grouped_stats.rename(columns = {'mean':'mean_C', 'len':'count_C'}, inplace=True)
grouped_stats.sort_index(axis=1, inplace=True)
print "Input: "
pp.pprint(test_dataframe)
print "Output: "
pp.pprint(grouped_stats)

You don't have to call grouped['B'] grouped['C'] one by one, simply pass your entire groupby object and pandas will apply the aggregate functions to all columns.
import pandas as pd
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
agg_funcs = ['count', 'mean']
test_dataframe = test_dataframe.groupby(['A']).agg(agg_funcs)
columns = 'B C D E F'.split()
names = [y + '_' + x for x in columns for y in agg_funcs]
test_dataframe.columns = names
Out[89]:
count_B mean_B count_C mean_C count_D mean_D count_E mean_E count_F mean_F
A
aa 2 4.0000 2 2.5000 2 2.0 2 2.50 2 2.50
ac 3 4.3333 3 1.6667 3 4.0 3 6.00 3 6.00
ad 4 4.5000 4 4.5000 4 4.5 4 4.75 4 4.75

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Overwrite NaNs in a column based on identifier - python

Related

Use of index in pandas DataFrame for groupby and aggregation

Pandas drop duplicated values partially

Group array elements according occurence, keep order, get first and last indices

Duplicating rows with certain value in a column

pandas grouped with aggregation stats across all dataframe columns

Categories

Resources