Use loc and iloc together in pandas - python

Say I have the following dataframe, and I want to change the two elements in column c that correspond to the first two elements in column a that are equal to 1 to equal 2.
>>> df = pd.DataFrame({"a" : [1,1,1,1,2,2,2,2], "b" : [2,3,1,4,5,6,7,2], "c" : [1,2,3,4,5,6,7,8]})
>>> df.loc[df["a"] == 1, "c"].iloc[0:2] = 2
>>> df
a b c
0 1 2 1
1 1 3 2
2 1 1 3
3 1 4 4
4 2 5 5
5 2 6 6
6 2 7 7
7 2 2 8
The code in the second line doesn't work because iloc sets a copy, so the original dataframe is not modified. How would I do this?

A dirty way would be:
df.loc[df[df['a'] == 1][:2].index, 'c'] = 2

You can use Index.isin:
import pandas as pd
df = pd.DataFrame({"a" : [1,1,1,1,2,2,2,2],
"b" : [2,3,1,4,5,6,7,2],
"c" : [1,2,3,4,5,6,7,8]})
#more general index
df.index = df.index + 10
print (df)
a b c
10 1 2 1
11 1 3 2
12 1 1 3
13 1 4 4
14 2 5 5
15 2 6 6
16 2 7 7
17 2 2 8
print (df.index.isin(df.index[:2]))
[ True True False False False False False False]
df.loc[(df["a"] == 1) & (df.index.isin(df.index[:2])), "c"] = 2
print (df)
a b c
10 1 2 2
11 1 3 2
12 1 1 3
13 1 4 4
14 2 5 5
15 2 6 6
16 2 7 7
17 2 2 8
If index is nice (starts from 0 without duplicates):
df.loc[(df["a"] == 1) & (df.index < 2), "c"] = 2
print (df)
a b c
0 1 2 2
1 1 3 2
2 1 1 3
3 1 4 4
4 2 5 5
5 2 6 6
6 2 7 7
7 2 2 8
Another solution:
mask = df["a"] == 1
mask = mask & (mask.cumsum() < 3)
df.loc[mask.index[:2], "c"] = 2
print (df)
a b c
0 1 2 2
1 1 3 2
2 1 1 3
3 1 4 4
4 2 5 5
5 2 6 6
6 2 7 7
7 2 2 8

Related

How to identify one column with continuous number and same value of another column?

I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6

How to add dataframe to multiindex dataframe at specific location

I'm organizing data from separate files into one portable, multiindex
dataframe, with multiindex ("A", "B", "C"). Some of the info is gathered
from the filenames read in, and should populate the "A", and "B", of the
multiindex. "C" should take the form of the index of the file read in.
The columns should take the form of the columns read in.
Let's say the files read in become:
df1
0 1 2 3 4
0 0 9 9 8 5
1 0 8 2 1 2
2 9 1 6 4 3
3 1 4 1 4 4
4 5 4 6 6 2
df2
0 1 2 3 4
0 4 5 0 7 3
1 8 2 9 1 0
2 5 9 1 6 6
3 4 1 4 6 5
4 3 0 0 8 8
How do I get to this end result:
multiindex_df
0 1 2 3 4
A B C
1 1 0 0 9 9 8 5
1 0 8 2 1 2
2 9 1 6 4 3
3 1 4 1 4 4
4 5 4 6 6 2
1 2 0 4 5 0 7 3
1 8 2 9 1 0
2 5 9 1 6 6
3 4 1 4 6 5
4 3 0 0 8 8
Starting from:
import pandas as pd
import numpy as np
multiindex_df = pd.DataFrame(
index=pd.MultiIndex.from_arrays(
[[], [], []], names=["A", "B", "C"]))
df1 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df1_a = 1
df1_b = 1
df2 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df2_a = 1
df2_b = 2
breakpoint()
This is what I have in mind, but gives a key error:
multiindex_df.loc[(df1_a, df1_b, slice(None))] = df1
multiindex_df.loc[(df2_a, df2_b, slice(None))] = df2
You could do this as follows:
multiindex_df = pd.concat([df1, df2], keys=[1,2])
multiindex_df = pd.concat([multiindex_df], keys=[1])
multiindex_df.index.names = ['A','B','C']
print(multiindex_df)
0 1 2 3 4
A B C
1 1 0 0 9 9 8 5
1 0 8 2 1 2
2 9 1 6 4 3
3 1 4 1 4 4
4 5 4 6 6 2
2 0 4 5 0 7 3
1 8 2 9 1 0
2 5 9 1 6 6
3 4 1 4 6 5
4 3 0 0 8 8
Alternatively, you could do it like below:
# collect your dfs inside a dict
dfs = {'1': df1, '2': df2}
# create list for index tuples
multi_index = []
for idx, val in enumerate(dfs):
for x in dfs[val].index:
# append tuple to row, e.g (1,1,0), (1,1,1) etc.
multi_index.append((1,idx+1,x))
# concat your dfs
multiindex_df_two = pd.concat([df1, df2])
# create multiindex from tuples, and add names
multiindex_df_two.index = pd.MultiIndex.from_tuples(multi_index, names=['A','B','C'])
# check
multiindex_df.equals(multiindex_df_two) # True
Ouroboros' answer made
me realize that instead of trying to fit the read-in file dfs into a
formatted df, the cleaner solution is to format each individual file df,
then concat.
In order to do that, I have to re-format the file df indices into a
multiindex, which prompted this question and
answer.
Having that down, and using Ouroboros'
answer, the solution
becomes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df1_a = 1
df1_b = 1
df1.index = pd.MultiIndex.from_product(
[[df1_a], [df1_b], df1.index], names=["A", "B", "C"])
df2 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df2_a = 1
df2_b = 2
df2.index = pd.MultiIndex.from_product(
[[df2_a], [df2_b], df2.index], names=["A", "B", "C"])
multiindex_df = pd.concat([df1, df2])
Which is obviously well suited for a loop.
Output:
df1
0 1 2 3 4
0 5 3 1 1 3
1 8 9 7 5 6
2 8 6 6 7 7
3 3 4 9 7 2
4 3 2 1 6 2
df2
0 1 2 3 4
0 5 0 6 9 3
1 7 5 5 9 6
2 2 1 9 6 3
3 9 4 3 7 0
4 5 9 5 9 6
multiiindex_df
0 1 2 3 4
A B C
1 1 0 5 3 1 1 3
1 8 9 7 5 6
2 8 6 6 7 7
3 3 4 9 7 2
4 3 2 1 6 2
2 0 5 0 6 9 3
1 7 5 5 9 6
2 2 1 9 6 3
3 9 4 3 7 0
4 5 9 5 9 6

Sort a subset of columns of a pandas dataframe alphabetically by column name

I'm having trouble finding the solution to a fairly simple problem.
I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually).
Example df:
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
df.head()
subject timepoint c d a b
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
How could I rearrange the column names to generate a df.head() that looks like this:
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names.
Thanks in advance.
You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together:
>>> pd.concat([df[['subject','timepoint']],
df[df.columns.difference(['subject', 'timepoint'])]\
.sort_index(axis=1)],ignore_index=False,axis=1)
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
5 1 6 7 7 7 7
6 2 1 3 3 3 3
7 2 2 4 4 4 4
8 2 3 1 1 1 1
9 2 4 2 2 2 2
10 2 5 3 3 3 3
11 2 6 4 4 4 4
12 3 1 5 5 5 5
13 3 2 4 4 4 4
14 3 4 5 5 5 5
15 4 1 8 8 8 8
16 4 2 4 4 4 4
17 4 3 5 5 5 5
18 4 4 6 6 6 6
19 4 5 2 2 2 2
20 4 6 3 3 3 3
Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame.
import numpy as np
first_cols = ['subject', 'timepoint']
#first_cols = df.columns[0:2].tolist() # OR determine first two
other_cols = np.sort(df.columns.difference(first_cols)).tolist()
df = df.loc[:, first_cols+other_cols]
print(df.head())
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols]
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
cols = df.columns.tolist()
cols = cols[:2] + sorted(cols[2:])
df = df[cols]

Python Pandas keep maximum 3 consecutive duplicates

I have this table:
import pandas as pd
list1 = [1,1,2,2,3,3,3,3,4,1,1,1,1,2,2]
df = pd.DataFrame(list1)
df.columns = ['A']
I want to keep maximum 3 consecutive duplicates, or keep all in case there's less than 3 (or no) duplicates.
The result should look like this:
list2 = [1,1,2,2,3,3,3,4,1,1,1,2,2]
result = pd.DataFrame(list2)
result.columns = ['A']
Use GroupBy.head with consecutive Series create by compare shifted values for not equal and cumulative sum by Series.cumsum:
df1 = df.groupby(df.A.ne(df.A.shift()).cumsum()).head(3)
print (df1)
A
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Detail:
print (df.A.ne(df.A.shift()).cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 6
14 6
Name: A, dtype: int32
Last us do
df[df.groupby(df[0].diff().ne(0).cumsum())[0].cumcount()<3]
0
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Solving with itertools.groupby which groups only consecutive duplicates , then slicing 3 elements:
import itertools
pd.Series(itertools.chain.from_iterable([*g][:3] for i,g in itertools.groupby(df['A'])))
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 1
9 1
10 1
11 2
12 2
dtype: int64

How to set value to a cell filtered by rows in python DataFrame?

import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=['A','B','C'])
df[df['B']%2 ==0]['C'] = 5
I am expecting this code to change the value of columns C to 5, wherever B is even. But it is not working.
It returns the table as follow
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I am expecting it to return
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
If need change value of column in DataFrame is necessary DataFrame.loc with condition and column name:
df.loc[df['B']%2 ==0, 'C'] = 5
print (df)
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
Your solution is nice example of chained indexing - docs.
You could just change the order to:
df['C'][df['B']%2 == 0] = 5
And it also works
Using numpy where
df['C'] = np.where(df['B']%2 == 0, 5, df['C'])
Output
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12

Categories

Resources