Pandas - Recode variable based on value_counts() results - python

This is my dataframe:
df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 3, 4],
'col2': [1, 3, 2, 4, 6, 5, 7]})
I try to recode values based on how often they appear in my dataset, here I want to relabel every value which occurs only once to "other". This is the desired output:
#desired
"col1": [1,1,1,2,2,"other", "other"]
I tried this but it did not work:
df["recoded"] = np.where(df["col1"].value_counts() > 1, df["col1"], "other")
My idea is to save the value counts and filter them and then loop over the result array, but this seems overly complicated. Is there an easy "pythonic/pandas" way to archieve this?

You are close - need Series.map for same length of Series like original DataFrame:
df["recoded"] = np.where(df["col1"].map(df["col1"].value_counts()) > 1, df["col1"], "other")
Or use GroupBy.transform with count values by GroupBy.size:
df["recoded"] = np.where(df.groupby('col1')["col1"].transform('size') > 1,
df["col1"],
"other")
If need check duplicates use Series.duplicated with keep=False for return mask by all duplicates:
df["recoded"] = np.where(df["col1"].duplicated(keep=False), df["col1"], "other")
print (df)
0 1 1 1
1 1 3 1
2 1 2 1
3 2 4 2
4 2 6 2
5 3 5 other
6 4 7 other

Related

Pandas group by function to do different methods if index in list

I'm wondering if its possible to create your own groupby function that runs a different method for the output in a single column depending on if the index is in some list or not. For example:
df = pd.DataFrame({'ID' : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'Data' : [5, 7, 6, 13, 14, 11, 10, 2, 4, 3]})
some_list = [2, 3]
I want to group by ID column, and return an average of the Data column (df.groupby('ID').mean() for most values) However, if ID is in some_list then I would like the average to be calculated as the sum of Data divided by 4 (df.groupby('ID').sum()/4). The output for the above would look as below:
ID Data
0 1 6.00
1 2 12.00
2 3 2.25
I know I could do both methods separately and join into one column after doing the groupby, but I was wondering if its possible to do this in one step? Maybe with df.groupby('ID').apply(function)?
I've looked at this question, but it didn't help me.
Try groupby with apply and a condition:
df.groupby('ID', as_index=False)['Data'].apply(lambda x: x.sum() / 4 if x.name in some_list else x.mean())
Output:
ID Data
0 1 6.00
1 2 12.00
2 3 2.25
If performance is important dont use groupby.apply, you can filter rows for aggregate sum with division and for aggregate mean:
s = df[df['ID'].isin(some_list)].groupby('ID')['Data'].sum().div(4)
df = s.combine_first(df.groupby('ID')['Data'].mean()).reset_index()
print (df)
ID Data
0 1 6.00
1 2 12.00
2 3 2.25
Here is alternative solution:
df = df.groupby('ID')['Data'].agg(['sum','mean']).reset_index()
df['Value'] = np.where(df['ID'].isin(some_list), df.pop('sum').div(4), df.pop('mean'))
print (df)
ID Value
0 1 6.00
1 2 12.00
2 3 2.25

How to filter DataFrame using isna?

This seems super basic and yet I am failing to filter this dataframe. As you can see from the screenshot I load a very basic set of data. I check if any values in column 'Col3' is na. And finally I try to filter the dataframe using that. I am hoping to get returned just the second column (with index 1). But as you can see I get all 5 rows but the values for Col3 are now all NaN.
I am using Python 3.7.3 and Pandas 1.1.4
Trying wwnde's suggestion to use brackets instead of .loc did not seem to work:
Try
data( now that you didnt give me sample data)
df = pd.DataFrame({'group1': [0, 0, 1, 1], 'group2': [2, 2, 3, 4],
'base': [0, 1, 2, 3], 'x1': [3, 4, 5, 6], 'x2': [np.nan, 6, np.nan, 8]})
df[df['x2'].isna()]
group1 group2 base x1 x2
0 0 2 0 3 NaN
2 1 3 2 5 NaN
Use loc accessor if you need to call particular columns
df.loc[df['x2'].isna(),:'base']#base and preceding columns
group1 group2 base
0 0 2 0
2 1 3 2
or
df.loc[df['x2'].isna(),['base','x1']]
base x1
0 0 3
2 2 5

Excluding Zeros from DataFrame

I am trying to get rid of values that are 0.000000 in my data frame so i can find the min/max values, excluding zero.
My dataframe, Answer 2 looks like this:
and when i try to exclude zeros using the code below, I am still getting the same dataframe with the zeros intact:
no_zero=Answer2.loc[(Answer2!=0.000000).any(1)]
no_zero
Any idea on how I can remove zeros?
You could replace the zeros with Nan:
import numpy
df = df.replace(0, numpy.nan)
then define your own min and max functions that ignore Nan values:
def non_zero_min(series):
return series.dropna().min()
df.apply(non_zero_min, axis=1)
For example, where df is:
df = pd.DataFrame({"A": [0, 2, 5, 5, 7], "B":[1, 5, 0, 1, 7]})
A B
0 0 1
1 2 5
2 5 0
3 5 1
4 7 7
You can do this to change to np.NaN where there is a 0 for every value in the dataframe.
df[df != 0]
Then you can find the minimum or maximum value by using: pandas.Series.min() and pandas.Series.max()
For example:
df = df[df != 0]
df.A.min() # 1
df.B.min() # 1

Python, Pandas: The Merged Sum of Some Rows according to Column value

I have the following pandas DataFrame example. I try to to have sum of some spesific rows. I have researched how to carry out, however I could not find the solution. Could you give a direction, please? The example is as below. I thought that I can apply group by and sum but there is column (Value_3) that I would not like to sum of these, just keeping same. Value 3 is constant value, shaped due to Machine and Shift value.
data = {'Machine':['Mch_1', 'Mch_1', 'Mch_1', 'Mch_1', 'Mch_2', 'Mch_2'], 'Shift':['Day', 'Day', 'Night', 'Night', 'Night', 'Night'], 'Value_1':[1, 2, 0, 0, 1, 3], 'Value_2':[0, 2, 2, 1, 3, 0], 'Value_3':[5, 5, 2, 2, 6, 6]}
df = pd.DataFrame(data)
Output:
Mch_1__Day__1__0__5
Mch_1__Day__2__2__5
Mch_1__Night_0__2__2
Mch_1__Night_0__1__2
Mch_2__Night_1__3__6
Mch_2__Night_3__0__6
What I would like to have is like as showed in dataframe.
expected = {'Machine':['Mch_1', 'Mch_1', 'Mch_2'], 'Shift':['Day', 'Night', 'Night'], 'Value_1':[3, 0, 4], 'Value_2':[2, 3, 3]}
df_expected = pd.DataFrame(expected)
df_expected
Output:
Mch_1__Day__3__2__5
Mch_1__Night_0__3__2
Mch_2__Night_4__3__6
Thank you very much.
First idea is pass dictionary for aggregate functions, for last column is possible use first or last function:
d = {'Value_1':'sum','Value_2':'sum','Value_3':'first'}
df1 = df.groupby(['Machine','Shift'], as_index=False).agg(d)
If want more dynamic solution it means sum all columns without Value_3 create dyctionary by all columns without specified in list with dict.from_keys and Index.difference:
d = dict.fromkeys(df.columns.difference(['Machine','Shift', 'Value_3']), 'sum')
d['Value_3'] = 'first'
df1 = df.groupby(['Machine','Shift'], as_index=False).agg(d)
print (df1)
Machine Shift Value_1 Value_2 Value_3
0 Mch_1 Day 3 2 5
1 Mch_1 Night 0 3 2
2 Mch_2 Night 4 3 6

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Categories

Resources