Say I have a data frame that looks like this:
Id ColA
1 2
2 2
3 3
4 5
5 10
6 12
7 18
8 20
9 25
10 26
I would like my code to create a new column at the end of the DataFrame that divides the total # of obvservations by 5 ranging from 5 to 1.
Id ColA Segment
1 2 5
2 2 5
3 3 4
4 5 4
5 10 3
6 12 3
7 18 2
8 20 2
9 25 1
10 26 1
I tried the following code but doesn't work:
df['segment'] = pd.qcut(df['Id'],5)
I also want to know what would happpen if the total of my observations was not dividable by 5.
Actually, you were closer to the answer than you think. This will work regardless of whether len(df) is a multiple of 5 or not.
bins = 5
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 4
4 5 10 3
5 6 12 3
6 7 18 2
7 8 20 2
8 9 25 1
9 10 26 1
Where,
pd.qcut(df['Id'], bins).cat.codes
0 0
1 0
2 1
3 2
4 3
5 4
6 4
dtype: int8
Represents the categorical intervals returned by pd.qcut as integer values.
Another example, for a DataFrame with 7 rows.
df = df.head(7).copy()
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 3
4 5 10 2
5 6 12 1
6 7 18 1
This should work:
df['segment'] = np.linspace(1, 6, len(df), False, dtype=int)
It creates a list of int between 1 and 5 of the size of your array. If you want from 5 to 1, just add [::-1] at the end of the line.
Related
I have a dataframe like below having patients stay in ICU (in hours) that is shown by ICULOS.
df # Main dataframe
dfy = df.copy()
dfy
P_ID
ICULOS
Count
1
1
5
1
2
5
1
3
5
1
4
5
1
5
5
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
3
1
3
3
2
3
3
3
3
4
1
7
4
2
7
4
3
7
4
4
7
4
5
7
4
6
7
4
7
7
I calculated their ICULOS Count and placed in the new column named Count using the code:
dfy['Count'] = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
Now, I want to remove those patients based on P_ID whose Count is less than 8. (Note, I want to remove whole patient record). So, after removing the patients with Count < 8, Only the P_ID = 2 will remain as the count is 9.
The desired output:
P_ID
ICULOS
Count
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
I tried the following code, but for some reason, it is not working for me. It did worked for me but when I re-run the code after few days, it is giving me 0 result. Can someone suggest a better code? Thanks.
dfy = dfy.drop_duplicates(subset=['P_ID'],keep='first')
lis1 = dfy['P_ID'].tolist()
Icu_less_8 = dfy.loc[dfy['Count'] < 8]
lis2 = Icu_less_8.P_ID.to_list()
lis_3 = [k for k in tqdm_notebook(lis1) if k not in lis2]
# removing those patients who have ICULOS of less than 8 hours
df_1 = pd.DataFrame()
for l in tqdm_notebook(lis_3, desc = 'Progress'):
df_1 = df_1.append(df.loc[df['P_ID']==l])
You can directly filter rows in transform using Series.ge:
In [1521]: dfy[dfy.groupby(['P_ID'])['ICULOS'].transform('count').ge(8)]
Out[1521]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
EDIT after OP's comment: For multiple conditions, do:
In [1533]: x = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
In [1539]: dfy.loc[x[x.ge(8) & x.le(72)].index]
Out[1539]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.
You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
Here is sample dataset:
id a
0 5 1
1 5 0
2 5 4
3 5 6
4 5 2
5 5 3
6 9 0
7 9 1
8 9 6
9 9 2
10 9 4
From the dataset, I want to generate a column sum. For first 3 rows: sum=sum+a(group by id). From 4th row, each row contains the cumulative sum of the previous 3 rows of a value(group by id). Loop through each row.
Desired Output:
id a sum
0 5 1 1
1 5 0 1
2 5 4 5
3 5 6 5
4 5 2 10
5 5 3 12
6 9 0 0
7 9 1 1
8 9 6 7
9 9 2 7
10 9 4 9
Code I tried:
df['sum']=df['a'].rolling(min_periods=1, window=3).groupby(df['id']).cumsum()
You can define a functiona like the below:
def cumsum_last3(DF):
nrow=DF.shape[0]
DF["sum"]=0
DF["sum"].iloc[0]=DF["a"].iloc[0]
DF["sum"].iloc[1]=DF["a"].iloc[0]+DF["a"].iloc[1]
for a in range(nrow-2):
cums=np.sum(DF["a"].iloc[a:a+3])
DF["sum"].iloc[a+2]=cums
return DF
DF_cums=cumsum_last3(DF)
DF_cums
I have this table:
import pandas as pd
list1 = [1,1,2,2,3,3,3,3,4,1,1,1,1,2,2]
df = pd.DataFrame(list1)
df.columns = ['A']
I want to keep maximum 3 consecutive duplicates, or keep all in case there's less than 3 (or no) duplicates.
The result should look like this:
list2 = [1,1,2,2,3,3,3,4,1,1,1,2,2]
result = pd.DataFrame(list2)
result.columns = ['A']
Use GroupBy.head with consecutive Series create by compare shifted values for not equal and cumulative sum by Series.cumsum:
df1 = df.groupby(df.A.ne(df.A.shift()).cumsum()).head(3)
print (df1)
A
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Detail:
print (df.A.ne(df.A.shift()).cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 6
14 6
Name: A, dtype: int32
Last us do
df[df.groupby(df[0].diff().ne(0).cumsum())[0].cumcount()<3]
0
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Solving with itertools.groupby which groups only consecutive duplicates , then slicing 3 elements:
import itertools
pd.Series(itertools.chain.from_iterable([*g][:3] for i,g in itertools.groupby(df['A'])))
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 1
9 1
10 1
11 2
12 2
dtype: int64
I have a pandas datastructure, which I create like this:
test_inputs = pd.read_csv("../input/test.csv", delimiter=',')
Its shape
print(test_inputs.shape)
is this
(28000, 784)
I would like to print a subset of its rows, like this:
print(test_inputs[100:200, :])
print(test_inputs[100:200, :].shape)
However, I am getting:
TypeError: unhashable type: 'slice'
Any idea what could be wrong?
Indexing in pandas is really confusing, as it looks like list indexing but it is not. You need to use .iloc, which is indexing by position
print(test_inputs.iloc[100:200, :])
And if you don't use column selection you can omit it
print(test_inputs.iloc[100:200])
P.S. Using .loc is not what you want, as it would look not for the row number, but for the row index (which can be filled we anything, not even numbers, not even unique). Ranges in .loc will find rows with index value 100 and 200, and return the lines between. If you just created the DataFrame .iloc and .loc may give the same result, but using .loc in this case is a very bad practice as it will lead you to difficult to understand problem when the index will change for some reason (for example you'll select some subset of rows, and from that moment the row number and index will not be the same).
P.P.S. You can use test_inputs[100:200], but not test_inputs[100:200, :] because pandas designers tried to combine different popular approaches into one construction. And test_input['column'] equals to test_input.loc[:, 'column'], but surprisingly slicing with integers test_input[100:200] equals to test_inputs.iloc[100:200] (while slicing with not integer values is equivalent to loc row slicing). And if you pass a pair of values to [] it considers as a tuple for multilevel column indexing so multi_level_columns_df['level_1', 'level_2'] is equivalent to multi_level_columns_df.loc[:, ('level_1', 'level_2')]. That is why your original construction led to the error: slice can't be used as a part of multilevel index.
There is more possible solutions, but output is not same:
loc selects by labels, but iloc and slicing without function, the start bounds is included, while the upper bound is excluded, docs - select by positions:
test_inputs = pd.DataFrame(np.random.randint(10, size=(28, 7)))
print(test_inputs.loc[10:20])
0 1 2 3 4 5 6
10 3 2 0 6 6 0 0
11 5 0 2 4 1 5 2
12 5 3 5 4 1 3 5
13 9 5 6 6 5 0 1
14 7 0 7 4 2 2 5
15 2 4 3 3 7 2 3
16 8 9 6 0 5 3 4
17 1 1 0 7 2 7 7
18 1 2 2 3 5 8 7
19 5 1 1 0 1 8 9
20 3 6 7 3 9 7 1
print(test_inputs.iloc[10:20])
0 1 2 3 4 5 6
10 3 2 0 6 6 0 0
11 5 0 2 4 1 5 2
12 5 3 5 4 1 3 5
13 9 5 6 6 5 0 1
14 7 0 7 4 2 2 5
15 2 4 3 3 7 2 3
16 8 9 6 0 5 3 4
17 1 1 0 7 2 7 7
18 1 2 2 3 5 8 7
19 5 1 1 0 1 8 9
print(test_inputs[10:20])
0 1 2 3 4 5 6
10 3 2 0 6 6 0 0
11 5 0 2 4 1 5 2
12 5 3 5 4 1 3 5
13 9 5 6 6 5 0 1
14 7 0 7 4 2 2 5
15 2 4 3 3 7 2 3
16 8 9 6 0 5 3 4
17 1 1 0 7 2 7 7
18 1 2 2 3 5 8 7
19 5 1 1 0 1 8 9
print(test_inputs.values[100:200, :])
print(test_inputs.values[100:200, :].shape)
This code is also working for me.
I was facing the same problem. Even the above solutions couldn't fix it. It was some problem with pandas, What I did was I changed the array into a numpy array that fixed the issue.
import pandas as pd
import numpy as np
test_inputs = pd.read_csv("../input/test.csv", delimiter=',')
test_inputs = np.asarray(test_inputs)