ffill not retaining all columns - python

I have a df like this:
Key Class
1 Green
1 NaN
1 NaN
2 Red
2 NaN
2 NaN
and I want to forward fill Class. Im using this code:
df=df.Class.fillna(method='ffill')
and this returns:
Green
Green
Green
Red
Red
Red
how can I retain the Key column while doing this?

df['class'] = df.Class.fillna(method='ffill')
in your code you're assigning the whole dataframe to be the result , so instead you need to assign only the class column
or another way is to do the following
In [126]:
df.ffill()
Out[126]:
Key Class
0 1 Green
1 1 Green
2 1 Green
3 2 Red
4 2 Red
5 2 Red
you can set also the inplace parameter to be true if you don't want to create a new copy from your df
df.ffill(inplace=True)

Related

Remove duplicates in the pandas dataframe with respect to columns in python using dynamic columns list? [duplicate]

I'm aiming to drop duplicates values in a df. However, I want to drop where two values in separate columns are the same. Using below, I want to drop where Value and Item are duplicates. However, I want to keep the row where ['Group1'] == df['Group'].
Note: df = df.drop_duplicates(['Value', 'Item']) will not always be ideal as it depends on the ordering in Group. For instance, duplicates found in Item 80.0 and 260.0, the first row should be kept, but the second row should be kept for Item 310.0. I don't want to sort values here either as the strings could change. For example the groups could be Blue and Green, which would alter the intended output.
df = pd.DataFrame({
'Value' : ['X','X','Y','Z','D','D','E','E','X'],
'Item' : [80.0,80.0,200.0,210.0,260.0,260.0,300.0,300.0,310.0],
'Group' : ['Red','Green','Red','Green','Red','Green','Green','Red','Green'],
'Group1' : ['Red','Red','Red','Red','Red','Red','Red','Red','Red'],
'Group2' : ['Green','Green','Green','Green','Green','Green','Green','Green','Green'],
})
df = df[df['Group1'] == df['Group']].drop_duplicates(subset = ['Item','Value'])
If I perform df = df.drop_duplicates(['Value', 'Item']), the output is:
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
2 Y 200.0 Red Red Green
3 Z 210.0 Green Red Green
4 D 260.0 Red Red Green
6 E 300.0 Green Red Green # incorrect
8 X 310.0 Green Red Green
intended output:
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
1 Y 200.0 Red Red Green
2 Z 210.0 Green Red Green
3 D 260.0 Red Red Green
4 E 300.0 Red Red Green
5 X 310.0 Green Red Green
df1 = df.drop_duplicates(subset = ['Item','Value'])
df2 = df[df['Group'] == df['Group1']]
Dataframe df1 drop duplicates row on columns Item and Value.
Dataframe df2 keeps the rows where the value between column Group and column Group1 is the same.
I want to keep the row where ['Group1'] == df['Group'].
One left thing you need do is to replace values of dataframe df1 with the values of dataframe df2, if their both Item and Value column values are the same.
pandas.DataFrame.update() can modify in place using non-NA values from another DataFrame. You can use it like:
df1.set_index(['Value', 'Item'], inplace=True)
df1.update(df2.set_index(['Value', 'Item']))
df1.reset_index(inplace=True) # to recover the initial structure
# print(df1)
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
1 Y 200.0 Red Red Green
2 Z 210.0 Green Red Green
3 D 260.0 Red Red Green
4 E 300.0 Red Red Green
5 X 310.0 Green Red Green
Besides update, you can use the index of the dataframe df2 to slice df1 and then assign.
df1.set_index(['Value', 'Item'], inplace=True)
df2.set_index(['Value', 'Item'], inplace=True)
df1.loc[df2.index] = df2
df1.reset_index(inplace=True)
df = pd.concat([df[df.Group == df.Group1],df[df.Group != df.Group1]]).drop_duplicates(subset = ['Item','Value']).sort_index()
Output
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
2 Y 200.0 Red Red Green
3 Z 210.0 Green Red Green
4 D 260.0 Red Red Green
7 E 300.0 Red Red Green
8 X 310.0 Green Red Green

How to replace empty value with value from another row

I am dealing with a dataframe that has some empty cells. For example:
df = pd.DataFrame(data={'id': [1, 2, 3, 1, 2, 3], 'id2': [1,1,1,2,2,2], 'color': ["red", "", "green", "yellow", "", "blue"], 'rate':["good","","good","average","","good"]})
id id2 color rate
0 1 1 red good
1 2 1
2 3 1 green good
3 1 2 yellow average
4 2 2
5 3 2 blue good
For both the columns "color" and "rate", I would like to replace the empty rows with values from another row where id is 1. Therefore, I would like the dataframe to look like this in the end:
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
I prefer not to replace the empty cells with values from the previous row. I would like to indicate the id and replace the empty cells with rows that have the specific id.
IIUC you can groupby and transform with first, and finally assign to empty values:
df.loc[df["color"].eq(""), ["color", "rate"]] = df.groupby(df["id"].eq(1).cumsum())[["color","rate"]].transform("first")
print (df)
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
Its only works if the number of rows with empty values in [color,rate] is equal to the number of rows with id==1. Expand further on the question of whether this is not the intention.
If I understood you correctly:
empty_rows = df.loc[df['color']=='',['color','rate']].index
df.loc[empty_rows, ['color','rate']] = df.loc[df['id']==1,['color','rate']].values
Result df:

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)
sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.

Parsing Pandas Series From Another Series

Am trying to parse a series of text, using a series of numbers like the code below, but all i get in return is a series of NaN's.
import numpy as np
import pandas as pd
numData = np.array([4,6,4,3,6])
txtData = np.array(['bluebox','yellowbox','greybox','redbox','orangebox'])
n = pd.Series(numData)
t = pd.Series(txtData)
x = t.str[:n]
print (x)
output is
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
I would like the output to be
0 blue
1 yellow
2 grey
3 red
4 orange
Is there an easy way to do this.
You can use a simple list comprehension if in reality you can't chop off the last 3 characters and need to rely on your slice ranges. You will need error handling if your data aren't guaranteed to be all strings, or if end can exceed the length of the string.
pd.Series([x[:end] for x,end in zip(t,n)], index=t.index)
0 blue
1 yellow
2 grey
3 red
4 orange
dtype: object
You can pd.Series.str.slice
t.str.slice(stop=-3)
# short hand for this is t.str[:-3]
0 blue
1 yellow
2 grey
3 red
4 orange
dtype: object
Or cast numData as an iterator using iter and use slice
it = iter(numData)
t.map(lambda x:x[slice(next(it))])
0 blue
1 yellow
2 grey
3 red
4 orange
dtype: object
numdata_iter = iter(numData)
x = t.apply(lambda text: text[:next(numdata_iter)])
We turn the numData into an iterator and then invoke next on it for each slicing in apply.

Count the amount of NaNs in each group

In this dataframe, I'm trying to count how many NaN's there are for each color within the color column.
This is what the sample data looks like. In reality, there's 100k rows.
color value
0 blue 10
1 blue NaN
2 red NaN
3 red NaN
4 red 8
5 red NaN
6 yellow 2
I'd like the output to look like this:
color count
0 blue 1
1 red 3
2 yellow 0
You can use DataFrame.isna, GroupBy the column color and sum to add up all True rows in each group:
df.value.isna().groupby(df.color).sum().reset_index()
color value
0 blue 1.0
1 red 3.0
2 yellow 0.0
Also you may use agg() and isnull() or isna() as follows:
df.groupby('color').agg({'value': lambda x: x.isnull().sum()}).reset_index()
Use isna().sum()
df.groupby('color').value.apply(lambda x: x.isna().sum())
color
blue 1
red 3
yellow 0
A usage from size and count
g=df.groupby('color')['value']
g.size()-g.count()
Out[115]:
color
blue 1
red 3
yellow 0
Name: value, dtype: int64

Categories

Resources