Overwrite NaNs in a column based on identifier - python

I have a dataframe that contains some NaN-values in a t-column. The values in the t-column belong to a certain id and should be the same per id:
df = pd.DataFrame({"t" : [4, 4, 1, 1, float('nan'), 2, 2, 2, float('nan'), 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})
Therefore, I would like to overwrite the NaN in t with the non-NaN in t for the respective id and ultimately end up with
df = pd.DataFrame({"t" : [4, 4, 1, 1, 2, 2, 2, 2, 10, 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})

New strategy... Create a map by dropping na and reassign using loc and mask.
import pandas as pd
df = pd.DataFrame({"t" : [4, 4, 1, 1, float('nan'), 2, 2, 2, float('nan'), 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})
# create mask
m = pd.isna(df['t'])
# create map
#d = df[~m].set_index('id')['t'].drop_duplicates()
d = df[~m].set_index('id')['t'].to_dict()
# assign map to the slice of the dataframe containing nan
df.loc[m,'t'] = df.loc[m,'id'].map(d)
print(df)
df returns:
id t
0 1 4.0
1 1 4.0
2 2 1.0
3 2 1.0
4 3 2.0
5 3 2.0
6 3 2.0
7 3 2.0
8 4 10.0
9 4 10.0

Use sort_values with groupby and transform for same column with first:
df['t'] = df.sort_values(['id','t']).groupby('id')['t'].transform('first')
Alternative solution is map by Series created by dropna with drop_duplicates:
df['t'] = df['id'].map(df.dropna(subset=['t']).drop_duplicates('id').set_index('id')['t'])
print (df)
id t
0 1 4.0
1 1 4.0
2 2 1.0
3 2 1.0
4 3 2.0
5 3 2.0
6 3 2.0
7 3 2.0
8 4 10.0
9 4 10.0

Related

Use of index in pandas DataFrame for groupby and aggregation

I want to aggregate a single column DataFrame and count the number of elements. However, I always end up with an empty DataFrame:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[46]:
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5]
If I add a second column, I get the desired result:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5], "B":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[45]:
B
A
1 1
2 1
3 1
4 1
5 3
Can you explain the reason for this?
Give this a shot:
import pandas as pd
print(pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A")["A"].count())
prints
A
1 1
2 1
3 1
4 1
5 3
You have to add the grouped by column in your result:
import pandas as pd
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").A.count()
Output:
A
1 1
2 1
3 1
4 1
5 3

Pandas drop duplicated values partially

I have a dataframe as
df=pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[7, 2, 2, 5, 7, 2, 2]})
I would like to drop the duplicated values from columns A and C. However, I want it to work partially.
If I use
df.drop_duplicates(subset=['A','C'], keep='first')
It will drop row 2, 5, 6. However, I only want to drop row 2 and 6. The desired results are like:
df=pd.DataFrame({'A':[1, 3, 4, 5, 3],
'B':[0, 2, 4, 5, 6],
'C':[7, 2, 5, 7, 2]})
Here's how you can do this, using shift:
df.loc[(df[["A", "C"]].shift() != df[["A", "C"]]).any(axis=1)].reset_index(drop=True)
Output:
A B C
0 1 0 7
1 3 2 2
2 4 4 5
3 5 5 7
4 3 6 2
This question is a nice reference.
You can just keep every second repetition of A, C pair:
df=df.loc[df.groupby(["A", "C"]).cumcount()%2==0]
Outputs:
A B C
0 1 0 7
1 3 2 2
3 4 4 5
4 5 5 7
5 3 6 2

Group array elements according occurence, keep order, get first and last indices

I wonder if there is a nicer way in pandas to achieve the same:
x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2]
x = np.asarray(x)
df = pd.DataFrame(columns=['id', 'start', 'end'])
if len(x) > 1:
i = 0
for j in range(1, len(x)):
if x[j] == x[j-1]:
continue
else:
df.loc[len(df)] = [x[i], i, j-1]
i = j;
df.loc[len(df)] = [x[i], i, j]
else:
df.loc[len(df)] = [x[0], 0, 0]
The output looks like this
[1 1 1 2 2 2 3 3 3 5 5 1 1 2 2]
id start end
0 1 0 2
1 2 3 5
2 3 6 8
3 5 9 10
4 1 11 12
5 2 13 14
Thanks for helpful hints.
Here's a way you could do it using numpy:
x = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2])
# Search for all consecutive non equal values in the array
vals = x[x != np.roll(x, 1)]
# array([1, 2, 3, 5, 1, 2])
# Indices where changes in x occur
d = np.flatnonzero(np.diff(x) != 0)
# array([ 2, 5, 8, 10, 12])
start = np.hstack([0] + [d+1])
# array([ 0, 3, 6, 9, 11, 13])
end = np.hstack([d, len(x)-1])
# array([ 2, 5, 8, 10, 12, 14])
pd.DataFrame({'id':vals, 'start':start, 'end':end})
id start end
0 1 0 2
1 2 3 5
2 3 6 8
3 5 9 10
4 1 11 12
5 2 13 14
Another solution:
df= pd.DataFrame(data=[1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2],columns=['id'])
g=df.groupby((df.id!=df.id.shift()).cumsum())['id']
df_new=pd.concat([g.first(),g.apply(lambda x: x.duplicated(keep='last').idxmax()),\
g.apply(lambda x: x.duplicated(keep='last').idxmin())],axis=1)
df_new.columns=['id','start','end']
print(df_new)
id start end
id
1 1 0 2
2 2 3 5
3 3 6 8
4 5 9 10
5 1 11 12
6 2 13 14
You could do the following, using only pandas:
import numpy as np
import pandas as pd
x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2]
s = pd.Series(x)
# store group-by to avoid repetition
groups = s.groupby((s != s.shift()).cumsum())
# get id and size for each group
ids, size = groups.first(), groups.size()
# get start
start = size.cumsum().shift().fillna(0).astype(np.int32)
# get end
end = (start + size - 1)
df = pd.DataFrame({'id': ids, 'start': start, 'end': end}, columns=['id', 'start', 'end'])
print(df)
Output
id start end
1 1 0 2
2 2 3 5
3 3 6 8
4 5 9 10
5 1 11 12
6 2 13 14
using itertools.groupby
import pandas as pd
from itertools import groupby
x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 5, 5, 1, 1, 2, 2]
l = []
for i in [list(g) for _,g in groupby(enumerate(x), lambda x:x[1])]:
l.append( (i[0][1], i[0][0], i[-1][0]) )
print (pd.DataFrame(l, columns=['id','start','end']))
Output:
id start end
0 1 0 2
1 2 3 5
2 3 6 8
3 5 9 10
4 1 11 12
5 2 13 14

Duplicating rows with certain value in a column

I have to duplicate rows that have a certain value in a column and replace the value with another value.
For instance, I have this data:
import pandas as pd
df = pd.DataFrame({'Date': [1, 2, 3, 4], 'B': [1, 2, 3, 2], 'C': ['A','B','C','D']})
Now, I want to duplicate the rows that have 2 in column 'B' then change 2 to 4
df = pd.DataFrame({'Date': [1, 2, 2, 3, 4, 4], 'B': [1, 2, 4, 3, 2, 4], 'C': ['A','B','B','C','D','D']})
Please help me on this one. Thank you.
You can use append, to append the rows where B == 2, which you can extract using loc, but also reassigning B to 4 using assign. If order matters, you can then order by C (to reproduce your desired frame):
>>> df.append(df[df.B.eq(2)].assign(B=4)).sort_values('C')
B C Date
0 1 A 1
1 2 B 2
1 4 B 2
2 3 C 3
3 2 D 4
3 4 D 4

pandas grouped with aggregation stats across all dataframe columns

I am grouping data in a pandas dataframe and using some aggregation functions to generate results data. Input data:
A B C D E F
0 aa 5 3 2 2 2
1 aa 3 2 2 3 3
2 ac 2 0 2 7 7
3 ac 9 2 3 8 8
4 ac 2 3 7 3 3
5 ad 0 0 0 1 1
6 ad 9 9 9 9 9
7 ad 6 6 6 6 6
8 ad 3 3 3 3 3
The pandas grouped function seems to only operate on one column at a time but I want to generate the statistic on all columns in my df. For example, I can use the function grouped['C'].agg([np.mean, len]) to generate the statistics on column 'C' but what if I want to generate these statistics on all columns A - F?
The output from this is:
A count_C mean_C
0 aa 2 2.500000
1 ac 3 1.666667
2 ad 4 4.500000
But what I want is:
A count_B mean_B count_C mean_C count_D mean_D etc...
0 aa 2 4.000000 2 2.500000 2 2.0 etc...
1 ac 3 4.333333 3 2.500000 3 4.0
2 ad 4 4.500000 4 2.500000 4 4.5
Is there any easy way to do the group by with aggregation in a single command? If not, is there an easy way to iterate over all columns and merge in new aggregation statistics results for each column?
Here's my full code so far:
import pandas as pd
import numpy as np
import pprint as pp
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
#group, aggregate, convert object to df, sort index
grouped = test_dataframe.groupby(['A'])
grouped_stats = grouped['C'].agg([np.mean, len])
grouped_stats = pd.DataFrame(grouped_stats).reset_index()
grouped_stats.rename(columns = {'mean':'mean_C', 'len':'count_C'}, inplace=True)
grouped_stats.sort_index(axis=1, inplace=True)
print "Input: "
pp.pprint(test_dataframe)
print "Output: "
pp.pprint(grouped_stats)
You don't have to call grouped['B'] grouped['C'] one by one, simply pass your entire groupby object and pandas will apply the aggregate functions to all columns.
import pandas as pd
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
agg_funcs = ['count', 'mean']
test_dataframe = test_dataframe.groupby(['A']).agg(agg_funcs)
columns = 'B C D E F'.split()
names = [y + '_' + x for x in columns for y in agg_funcs]
test_dataframe.columns = names
Out[89]:
count_B mean_B count_C mean_C count_D mean_D count_E mean_E count_F mean_F
A
aa 2 4.0000 2 2.5000 2 2.0 2 2.50 2 2.50
ac 3 4.3333 3 1.6667 3 4.0 3 6.00 3 6.00
ad 4 4.5000 4 4.5000 4 4.5 4 4.75 4 4.75

Categories

Resources