Using python df.replace with dict does not permanently change values - python

I generated a DataFrame that includes a column called "pred_categories" with numerical values of 0, 1, 2, and 3. See below:
fileids pred_categories
0 /Saf/DA192069.txt 3
1 /Med/DA000038.txt 2
2 /Med/DA000040.txt 2
3 /Saf/DA191905.txt 3
4 /Med/DA180730.txt 2
I wrote a dict:
di = {3: "SAF", 2: "MED", 1: "FAC", 0: "ENV"}
And it works at first:
df.replace({'pred_categories': di})
Out[16]:
fileids pred_categories
0 /Saf/DA192069.txt SAF
1 /Med/DA000038.txt MED
2 /Med/DA000040.txt MED
3 /Saf/DA191905.txt SAF
4 /Med/DA180730.txt MED
5 /Saf/DA192307.txt SAF
6 /Env/DA178021.txt ENV
7 /Fac/DA358334.txt FAC
8 /Env/DA178049.txt ENV
9 /Env/DA178020.txt ENV
10 /Env/DA178031.txt ENV
11 /Med/DA000050.txt MED
12 /Med/DA180720.txt MED
13 /Med/DA000010.txt MED
14 /Fac/DA358391.txt FAC
but then when checking
df.head()
it doesn't seem to permanently "save" it in the DataFrame. Any pointers on what I'm doing wrong?
print(df)
fileids pred_categories
0 /Saf/DA192069.txt 3
1 /Med/DA000038.txt 2
2 /Med/DA000040.txt 2
3 /Saf/DA191905.txt 3
4 /Med/DA180730.txt 2
5 /Saf/DA192307.txt 3
6 /Env/DA178021.txt 0
7 /Fac/DA358334.txt 1
8 /Env/DA178049.txt 0
9 /Env/DA178020.txt 0
10 /Env/DA178031.txt 0
11 /Med/DA000050.txt 2
12 /Med/DA180720.txt 2
13 /Med/DA000010.txt 2
14 /Fac/DA358391.txt 1

Per default .replace() returns changed DF, but it doesn't change it in place, so you have to do it this way:
df = df.replace({'pred_categories': di})
or
df.replace({'pred_categories': di}, inplace=True)

Related

Drop rows if value in column changes

Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

How to calculate amounts that row values greater than a specific value in pandas?

How to calculate amounts that row values greater than a specific value in pandas?
For example, I have a Pandas DataFrame dff. I want to count row values greater than 0.
dff = pd.DataFrame(np.random.randn(9,3),columns=['a','b','c'])
dff
a b c
0 -0.047753 -1.172751 0.428752
1 -0.763297 -0.539290 1.004502
2 -0.845018 1.780180 1.354705
3 -0.044451 0.271344 0.166762
4 -0.230092 -0.684156 -0.448916
5 -0.137938 1.403581 0.570804
6 -0.259851 0.589898 0.099670
7 0.642413 -0.762344 -0.167562
8 1.940560 -1.276856 0.361775
I am using an inefficient way. How to be more efficient?
dff['count'] = 0
for m in range(len(dff)):
og = 0
for i in dff.columns:
if dff[i][m] > 0:
og += 1
dff['count'][m] = og
dff
a b c count
0 -0.047753 -1.172751 0.428752 1
1 -0.763297 -0.539290 1.004502 1
2 -0.845018 1.780180 1.354705 2
3 -0.044451 0.271344 0.166762 2
4 -0.230092 -0.684156 -0.448916 0
5 -0.137938 1.403581 0.570804 2
6 -0.259851 0.589898 0.099670 2
7 0.642413 -0.762344 -0.167562 1
8 1.940560 -1.276856 0.361775 2
You can create a boolean mask of your DataFrame, that is True wherever a value is greater than your threshold (in this case 0), and then use sum along the first axis.
dff.gt(0).sum(1)
0 1
1 1
2 2
3 2
4 0
5 2
6 2
7 1
8 2
dtype: int64

Compute average of the pandas df conditioned on a parameter

I have the following df:
import numpy as np
import pandas as pd
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df =
0 1 2 3 lvl
0 0.928623 0.868600 0.854186 0.129116 0
1 0.667870 0.901285 0.539412 0.883890 0
2 0.384494 0.697995 0.242959 0.725847 0
3 0.993400 0.695436 0.596957 0.142975 0
4 0.518237 0.550585 0.426362 0.766760 0
5 0.359842 0.417702 0.873988 0.217259 0
6 0.820216 0.823426 0.585223 0.553131 0
7 0.492683 0.401155 0.479228 0.506862 0
..............................................
3 0.505096 0.426465 0.356006 0.584958 3
4 0.145472 0.558932 0.636995 0.318406 3
5 0.957969 0.068841 0.612658 0.184291 3
6 0.059908 0.298270 0.334564 0.738438 3
7 0.662056 0.074136 0.244039 0.848246 3
8 0.997610 0.043430 0.774946 0.097294 3
9 0.795873 0.977817 0.780772 0.849418 3
0 0.577173 0.430014 0.133300 0.760223 4
1 0.916126 0.623035 0.240492 0.638203 4
2 0.165028 0.626054 0.225580 0.356118 4
3 0.104375 0.137684 0.084631 0.987290 4
4 0.934663 0.835608 0.764334 0.651370 4
5 0.743265 0.072671 0.911947 0.925644 4
6 0.212196 0.587033 0.230939 0.994131 4
7 0.945275 0.238572 0.696123 0.536136 4
8 0.989021 0.073608 0.720132 0.254656 4
9 0.513966 0.666534 0.270577 0.055597 4
I am learning neat pandas functionality and thus wondering, what is the easiest way to compute average along lvl column?
What I mean is:
(df[df.lvl ==0 ] + df[df.lvl ==1 ] + df[df.lvl ==2 ] + df[df.lvl ==3 ] + df[df.lvl ==4 ]) / 5
The desired output should be a table of shape (10,4), without the column lvl, where each element is the average of 5 elements (with lvl = [0,1,2,3,4]. I hope it helps.
I think need:
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
#print (df)
df1 = (df[df.lvl ==0 ] + df[df.lvl ==1 ] +
df[df.lvl ==2 ] + df[df.lvl ==3 ] +
df[df.lvl ==4 ]) / 5
print (df1)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
df = df.set_index('lvl')
df2 = df.groupby(df.groupby('lvl').cumcount()).mean()
print (df2)
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
EDIT:
If each subset of DataFrame have index from 0 to len(subset):
df2 = df.mean(level=0)
print (df2)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
The groupby function is exactly what you want. It will group based on a condition, in this case where 'lvl' is the same, and then apply the mean function to the values for each column in that group.
df.groupby('lvl').mean()
it seems like you want to group by the index and take average of all the columns except lvl
i.e.
df.groupby(df.index)[[0,1,2,3]].mean()
For a dataframe generated using
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df.groupby(df.index)[[0,1,2,3]].mean()
outputs:
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
which is identical to the output from
df.groupby(df.groupby('lvl').cumcount()).mean()
without resorting to double groupby.
IMO this is cleaner to read and will for large dataframe, will be much faster.

Groupby on condition and calculate sum of subgroups

Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15
It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15
Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15
You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15
Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15

Divide part of a dataframe by another while keeping columns that are not being divided

I have two data frames as below:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.161456 0.033139 0.991840 2.111023 0.846197
1 1 10 0.636140 1.024235 36.333741 16.074662 3.142135
2 1 13 0.605840 0.034337 2.085061 2.125908 0.069698
3 1 14 0.038481 0.152382 4.608259 4.960007 0.162162
4 1 5 0.035628 0.087637 1.397457 0.768467 0.052605
5 1 6 0.114375 0.020196 0.220193 7.662065 0.077727
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.305224 0.542488 66.428382 73.615079 10.342252
1 1 10 0.814696 1.246165 73.802644 58.064363 11.179206
2 1 13 0.556437 0.517383 50.555948 51.913547 9.412299
3 1 14 0.314058 1.148754 56.165767 61.261950 9.142128
4 1 5 0.499129 0.460813 40.182454 41.770906 8.263437
5 1 6 0.300203 0.784065 47.359506 52.841821 9.833513
I want to divide the numerical values in the selected cells of the first by the second and I am using the following code:
df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
However, this way I lose the information from the column "Sample_name".
C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 0.528977 0.061088 0.014931 0.028677 0.081819
1 0.780831 0.821909 0.492309 0.276842 0.281070
2 1.088785 0.066367 0.041243 0.040951 0.007405
3 0.122529 0.132650 0.082047 0.080964 0.017738
4 0.071381 0.190178 0.034778 0.018397 0.006366
5 0.380993 0.025759 0.004649 0.145000 0.007904
How can I perform the division while keeping the column "Sample_name" in the resulting dataframe?
You can selectively overwrite using loc, the same way that you're already performing the division:
df1_int.loc[:,'C14-Cer':] = df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
This preserves the sample_name col:
In [12]:
df.loc[:,'C14-Cer':] = df.loc[:,'C14-Cer':].div(df1.loc[:,'C14-Cer':])
df
Out[12]:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
index
0 1 1 0.528975 0.061087 0.014931 0.028677 0.081819
1 1 10 0.780831 0.821910 0.492309 0.276842 0.281070
2 1 13 1.088785 0.066367 0.041243 0.040951 0.007405
3 1 14 0.122528 0.132650 0.082047 0.080964 0.017738
4 1 5 0.071380 0.190179 0.034778 0.018397 0.006366
5 1 6 0.380992 0.025758 0.004649 0.145000 0.007904

Categories

Resources