I have a dataframe like below having patients stay in ICU (in hours) that is shown by ICULOS.
df # Main dataframe
dfy = df.copy()
dfy
P_ID
ICULOS
Count
1
1
5
1
2
5
1
3
5
1
4
5
1
5
5
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
3
1
3
3
2
3
3
3
3
4
1
7
4
2
7
4
3
7
4
4
7
4
5
7
4
6
7
4
7
7
I calculated their ICULOS Count and placed in the new column named Count using the code:
dfy['Count'] = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
Now, I want to remove those patients based on P_ID whose Count is less than 8. (Note, I want to remove whole patient record). So, after removing the patients with Count < 8, Only the P_ID = 2 will remain as the count is 9.
The desired output:
P_ID
ICULOS
Count
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
I tried the following code, but for some reason, it is not working for me. It did worked for me but when I re-run the code after few days, it is giving me 0 result. Can someone suggest a better code? Thanks.
dfy = dfy.drop_duplicates(subset=['P_ID'],keep='first')
lis1 = dfy['P_ID'].tolist()
Icu_less_8 = dfy.loc[dfy['Count'] < 8]
lis2 = Icu_less_8.P_ID.to_list()
lis_3 = [k for k in tqdm_notebook(lis1) if k not in lis2]
# removing those patients who have ICULOS of less than 8 hours
df_1 = pd.DataFrame()
for l in tqdm_notebook(lis_3, desc = 'Progress'):
df_1 = df_1.append(df.loc[df['P_ID']==l])
You can directly filter rows in transform using Series.ge:
In [1521]: dfy[dfy.groupby(['P_ID'])['ICULOS'].transform('count').ge(8)]
Out[1521]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
EDIT after OP's comment: For multiple conditions, do:
In [1533]: x = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
In [1539]: dfy.loc[x[x.ge(8) & x.le(72)].index]
Out[1539]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
Here is sample dataset:
id a
0 5 1
1 5 0
2 5 4
3 5 6
4 5 2
5 5 3
6 9 0
7 9 1
8 9 6
9 9 2
10 9 4
From the dataset, I want to generate a column sum. For first 3 rows: sum=sum+a(group by id). From 4th row, each row contains the cumulative sum of the previous 3 rows of a value(group by id). Loop through each row.
Desired Output:
id a sum
0 5 1 1
1 5 0 1
2 5 4 5
3 5 6 5
4 5 2 10
5 5 3 12
6 9 0 0
7 9 1 1
8 9 6 7
9 9 2 7
10 9 4 9
Code I tried:
df['sum']=df['a'].rolling(min_periods=1, window=3).groupby(df['id']).cumsum()
You can define a functiona like the below:
def cumsum_last3(DF):
nrow=DF.shape[0]
DF["sum"]=0
DF["sum"].iloc[0]=DF["a"].iloc[0]
DF["sum"].iloc[1]=DF["a"].iloc[0]+DF["a"].iloc[1]
for a in range(nrow-2):
cums=np.sum(DF["a"].iloc[a:a+3])
DF["sum"].iloc[a+2]=cums
return DF
DF_cums=cumsum_last3(DF)
DF_cums
I have a large dataframe (5 rows x 92,579 columns) in the following format:
1 2 3 4 5 6 7 8 9 10 11 ... 92569 92570 92571 92572 92573 92574 92575 92576 92577 92578 92579
0 10 9 8 5 5 10 1 1 6 2 3 ... 9 1 8 3 2 5 5 5 2 2 8
1 3 1 7 4 4 3 8 8 3 6 7 ... 1 8 7 5 6 4 4 4 2 6 7
2 6 4 2 9 7 6 5 5 6 7 2 ... 4 5 2 6 6 9 5 9 3 10 2
3 3 8 4 4 7 3 1 1 3 7 6 ... 8 1 5 7 2 4 1 4 6 10 2
4 4 6 5 5 5 4 1 1 4 8 10 ... 6 1 7 3 6 5 5 5 8 2 9
Each of the entries ranges from 1 to 10 (representing an assignment to one of 10 clusters).
I want to create a 92579 x 92579 matrix that represents how many times (ie. in how many rows) the variables in columns i and j have the same value. For example, variables 4 and 5 have the same value in 3 rows, so entries i_{4,5} and i_{5,4} of the co-occurrence matrix should be 3.
I only need the upper triangular portion of the desired matrix (since it will be symmetric).
I've looked at similar questions here, but they don't address both of these issues:
How to do this efficiently for a very large matrix
How to do this for non-binary entries
I know how to remove the Index, using the .to_string(index=False). But I'm not able to figure out how to remove the column names.
matrix = [
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9]
]
def print_sudoku(s):
df = pd.DataFrame(s)
dff = df.to_string(index=False)
print(dff)
print_sudoku(matrix)
The result is this.
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
I want to remove the first row, which is the row of column names.
You can use header=False when converting to string: df.to_string(index=False, header=False)
ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_string.html
What is the fastest way of sorting the city_sales_rep Dataframe by the index 'city', mind that there is a multi-index in place. The order of the the index should be exactly the order in which the index is ordered in the second Dataframe city.
If there an easy and fast way to do this sorting in one go?
A = pd.DataFrame(np.random.randint(2,10,(10,3)))
A.columns = ['city','sales_rep','sales']
city_sales_rep = A.groupby(['city','sales_rep']).sum()
city = A.groupby(['city'])['sales'].sum().sort_values()
Which leads in my example to city_sales_rep:
sales
city sales_rep
2 9 5
4 5 2
7 5
9 2
5 4 4
6 8 6
9 9
7 2 2
3 8
6 4
And city
sales
city
5 4
2 5
4 9
7 14
6 15
While this seems to do what I want, it feels horribly inefficient:
city.join(city_sales_rep.reset_index(level=1),lsuffix='_x')[['sales_rep','sales']].reset_index().set_index(['city','sales_rep'])
P.S.: Edits to the title are welcome, I feel like its somewhat clunky.
One possible solution, but not sure about performance:
np.random.seed(2019)
A = pd.DataFrame(np.random.randint(2,10,(10,3)))
A.columns = ['city','sales_rep','sales']
city_sales_rep = A.groupby(['city','sales_rep']).sum()
a = np.argsort(city_sales_rep.groupby(['city'])['sales'].transform('sum'))
city_sales_rep = city_sales_rep.iloc[a]
print (city_sales_rep)
sales
city sales_rep
8 2 7
7 2 8
6 2 9
9 6 4
7 9
2 2 9
4 9
6 7
7 5
Another soluton with new column:
city_sales_rep = A.groupby(['city','sales_rep']).sum()
city_sales_rep['new'] = city_sales_rep.groupby(['city'])['sales'].transform('sum')
city_sales_rep = city_sales_rep.sort_values('new')
print (city_sales_rep)
sales new
city sales_rep
8 2 7 7
7 2 8 8
6 2 9 9
9 6 4 13
7 9 13
2 2 9 30
4 9 30
6 7 30
7 5 30
If possible duplicated sum for different cities and use pandas 0.23.0+ is possible sorting by level and column together, check docs
city_sales_rep = city_sales_rep.sort_values(['new','city'])
print (city_sales_rep)
sales new
city sales_rep
8 2 7 7
7 2 8 8
6 2 9 9
9 6 4 13
7 9 13
2 2 9 30
4 9 30
6 7 30
7 5 30