Selecting particular values from a column in a dataframe - python

I have a dataset with only two columns. I would like to extract a small part out of it based on some condition on one column. Consider this as my dataset.
A B
1 10
1 9
2 11
3 12
3 11
4 9
Suppose I want to extract only those rows which have values in B from 10 - 12. so I would get a new dataset as:
A B
1 10
2 11
3 12
3 11
I tried using df.loc[df["B"] == range(10, 12)] but it dose not work, can someone help me with this?

You can use .between
In [1031]: df.loc[df.B.between(10, 12)]
Out[1031]:
A B
0 1 10
2 2 11
3 3 12
4 3 11
Or, isin
In [1032]: df.loc[df.B.isin(range(10, 13))]
Out[1032]:
A B
0 1 10
2 2 11
3 3 12
4 3 11
Or, query
In [1033]: df.query('10 <= B <= 12')
Out[1033]:
A B
0 1 10
2 2 11
3 3 12
4 3 11
Or, good'ol boolean
In [1034]: df.loc[(df.B >= 10) & (df.B <= 12)]
Out[1034]:
A B
0 1 10
2 2 11
3 3 12
4 3 11

Here's one more (not using .loc() or .query()) which looks more like the initial (unsuccessful) attempt:
df[df.B.isin(range(10,13))]

Related

How to identify one column with continuous number and same value of another column?

I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6

How to Convert the row unique values in to columns

I have this dataFrame
dd = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],'feature':[10,10,20,20,10,10,20,20],'h':['h_30','h_60','h_30','h_60','h_30','h_60','h_30','h_60'],'count':[1,2,3,4,5,6,7,8]})
a feature h count
0 1 10 h_30 1
1 1 10 h_60 2
2 1 20 h_30 3
3 1 20 h_60 4
4 2 10 h_30 5
5 2 10 h_60 6
6 2 20 h_30 7
7 2 20 h_60 8
My expected output is I want to shift my h column unique values into column and use count numbers as values
like this
a feature h_30 h_60
0 1 10 1 2
1 1 20 3 4
2 2 10 5 6
3 2 20 7 8
I tried this but got an error saying ValueError: Length of passed values is 8, index implies 2
dd.pivot(index = ['a','feature'],columns ='h',values = 'count' )
df.pivot does not accept list of columns as index for versions below 1.1.0
Changed in version 1.1.0: Also accept list of index names.
Try this:
import pandas as pd
pd.pivot_table(
dd, index=["a", "feature"], columns="h", values="count"
).reset_index().rename_axis(None, 1)

Python Pandas keep maximum 3 consecutive duplicates

I have this table:
import pandas as pd
list1 = [1,1,2,2,3,3,3,3,4,1,1,1,1,2,2]
df = pd.DataFrame(list1)
df.columns = ['A']
I want to keep maximum 3 consecutive duplicates, or keep all in case there's less than 3 (or no) duplicates.
The result should look like this:
list2 = [1,1,2,2,3,3,3,4,1,1,1,2,2]
result = pd.DataFrame(list2)
result.columns = ['A']
Use GroupBy.head with consecutive Series create by compare shifted values for not equal and cumulative sum by Series.cumsum:
df1 = df.groupby(df.A.ne(df.A.shift()).cumsum()).head(3)
print (df1)
A
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Detail:
print (df.A.ne(df.A.shift()).cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 6
14 6
Name: A, dtype: int32
Last us do
df[df.groupby(df[0].diff().ne(0).cumsum())[0].cumcount()<3]
0
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Solving with itertools.groupby which groups only consecutive duplicates , then slicing 3 elements:
import itertools
pd.Series(itertools.chain.from_iterable([*g][:3] for i,g in itertools.groupby(df['A'])))
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 1
9 1
10 1
11 2
12 2
dtype: int64

How to set value to a cell filtered by rows in python DataFrame?

import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=['A','B','C'])
df[df['B']%2 ==0]['C'] = 5
I am expecting this code to change the value of columns C to 5, wherever B is even. But it is not working.
It returns the table as follow
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I am expecting it to return
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
If need change value of column in DataFrame is necessary DataFrame.loc with condition and column name:
df.loc[df['B']%2 ==0, 'C'] = 5
print (df)
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
Your solution is nice example of chained indexing - docs.
You could just change the order to:
df['C'][df['B']%2 == 0] = 5
And it also works
Using numpy where
df['C'] = np.where(df['B']%2 == 0, 5, df['C'])
Output
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12

Separate DataFrame into N (almost) equal segments

Say I have a data frame that looks like this:
Id ColA
1 2
2 2
3 3
4 5
5 10
6 12
7 18
8 20
9 25
10 26
I would like my code to create a new column at the end of the DataFrame that divides the total # of obvservations by 5 ranging from 5 to 1.
Id ColA Segment
1 2 5
2 2 5
3 3 4
4 5 4
5 10 3
6 12 3
7 18 2
8 20 2
9 25 1
10 26 1
I tried the following code but doesn't work:
df['segment'] = pd.qcut(df['Id'],5)
I also want to know what would happpen if the total of my observations was not dividable by 5.
Actually, you were closer to the answer than you think. This will work regardless of whether len(df) is a multiple of 5 or not.
bins = 5
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 4
4 5 10 3
5 6 12 3
6 7 18 2
7 8 20 2
8 9 25 1
9 10 26 1
Where,
pd.qcut(df['Id'], bins).cat.codes
0 0
1 0
2 1
3 2
4 3
5 4
6 4
dtype: int8
Represents the categorical intervals returned by pd.qcut as integer values.
Another example, for a DataFrame with 7 rows.
df = df.head(7).copy()
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 3
4 5 10 2
5 6 12 1
6 7 18 1
This should work:
df['segment'] = np.linspace(1, 6, len(df), False, dtype=int)
It creates a list of int between 1 and 5 of the size of your array. If you want from 5 to 1, just add [::-1] at the end of the line.

Categories

Resources