Bundle columns of a DataFrame into hierarchical index - python

If I have preexisting columns (say 12 columns, all with unique names), and I want to organize them into two "header" columns, such as 8 assigned to Detail and 4 assigned to Summary, what is the most effective approach besides sorting them, manually creating a new index, and then swapping out the indices?
Happy to provide more example detail, but that's the gist of what is pretty generic problem.

Need to use multi-index of columns capability. It's important to rename() columns before reindex() so no data is lost.
df = pd.DataFrame({f"col-{i}":[random.randint(1,10) for i in range(10)] for i in range(12)})
header = [f"col-{i}" for i in range(8)]
header
# build a multi-index
mi = pd.MultiIndex.from_tuples([tuple(["Header" if c in header else "Detail", c])
for c in df.columns], names=('Category', 'Name'))
# rename before reindex to prevent data loss
df = df.rename(columns={c:mi[i] for i,c in enumerate(df.columns)}).reindex(columns=mi)
print(df.to_string())
output
Category Header Detail
Name col-0 col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 col-9 col-10 col-11
0 5 5 6 1 8 3 8 6 8 2 8 10
1 2 7 10 5 2 10 5 10 10 7 6 1
2 10 1 1 2 7 9 2 9 4 4 7 6
3 8 10 1 3 3 4 10 10 9 7 6 8
4 6 8 7 2 5 4 3 3 7 9 8 6
5 6 4 4 4 1 5 8 4 4 1 6 8
6 3 7 3 8 8 4 6 1 5 10 5 10
7 5 1 10 9 9 7 8 2 6 7 10 4
8 2 2 1 4 8 8 7 2 5 9 9 9
9 8 6 5 6 2 8 2 8 10 7 9 3

Related

Calculate count of a column based on other column in python dataframe

I have a dataframe like below having patients stay in ICU (in hours) that is shown by ICULOS.
df # Main dataframe
dfy = df.copy()
dfy
P_ID
ICULOS
Count
1
1
5
1
2
5
1
3
5
1
4
5
1
5
5
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
3
1
3
3
2
3
3
3
3
4
1
7
4
2
7
4
3
7
4
4
7
4
5
7
4
6
7
4
7
7
I calculated their ICULOS Count and placed in the new column named Count using the code:
dfy['Count'] = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
Now, I want to remove those patients based on P_ID whose Count is less than 8. (Note, I want to remove whole patient record). So, after removing the patients with Count < 8, Only the P_ID = 2 will remain as the count is 9.
The desired output:
P_ID
ICULOS
Count
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
I tried the following code, but for some reason, it is not working for me. It did worked for me but when I re-run the code after few days, it is giving me 0 result. Can someone suggest a better code? Thanks.
dfy = dfy.drop_duplicates(subset=['P_ID'],keep='first')
lis1 = dfy['P_ID'].tolist()
Icu_less_8 = dfy.loc[dfy['Count'] < 8]
lis2 = Icu_less_8.P_ID.to_list()
lis_3 = [k for k in tqdm_notebook(lis1) if k not in lis2]
# removing those patients who have ICULOS of less than 8 hours
df_1 = pd.DataFrame()
for l in tqdm_notebook(lis_3, desc = 'Progress'):
df_1 = df_1.append(df.loc[df['P_ID']==l])
You can directly filter rows in transform using Series.ge:
In [1521]: dfy[dfy.groupby(['P_ID'])['ICULOS'].transform('count').ge(8)]
Out[1521]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
EDIT after OP's comment: For multiple conditions, do:
In [1533]: x = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
In [1539]: dfy.loc[x[x.ge(8) & x.le(72)].index]
Out[1539]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9

Pandas- Cumilative Sum of previous row values

Here is sample dataset:
id a
0 5 1
1 5 0
2 5 4
3 5 6
4 5 2
5 5 3
6 9 0
7 9 1
8 9 6
9 9 2
10 9 4
From the dataset, I want to generate a column sum. For first 3 rows: sum=sum+a(group by id). From 4th row, each row contains the cumulative sum of the previous 3 rows of a value(group by id). Loop through each row.
Desired Output:
id a sum
0 5 1 1
1 5 0 1
2 5 4 5
3 5 6 5
4 5 2 10
5 5 3 12
6 9 0 0
7 9 1 1
8 9 6 7
9 9 2 7
10 9 4 9
Code I tried:
df['sum']=df['a'].rolling(min_periods=1, window=3).groupby(df['id']).cumsum()
You can define a functiona like the below:
def cumsum_last3(DF):
nrow=DF.shape[0]
DF["sum"]=0
DF["sum"].iloc[0]=DF["a"].iloc[0]
DF["sum"].iloc[1]=DF["a"].iloc[0]+DF["a"].iloc[1]
for a in range(nrow-2):
cums=np.sum(DF["a"].iloc[a:a+3])
DF["sum"].iloc[a+2]=cums
return DF
DF_cums=cumsum_last3(DF)
DF_cums

Efficiently calculate co-occurence matrix

I have a large dataframe (5 rows x 92,579 columns) in the following format:
1 2 3 4 5 6 7 8 9 10 11 ... 92569 92570 92571 92572 92573 92574 92575 92576 92577 92578 92579
0 10 9 8 5 5 10 1 1 6 2 3 ... 9 1 8 3 2 5 5 5 2 2 8
1 3 1 7 4 4 3 8 8 3 6 7 ... 1 8 7 5 6 4 4 4 2 6 7
2 6 4 2 9 7 6 5 5 6 7 2 ... 4 5 2 6 6 9 5 9 3 10 2
3 3 8 4 4 7 3 1 1 3 7 6 ... 8 1 5 7 2 4 1 4 6 10 2
4 4 6 5 5 5 4 1 1 4 8 10 ... 6 1 7 3 6 5 5 5 8 2 9
Each of the entries ranges from 1 to 10 (representing an assignment to one of 10 clusters).
I want to create a 92579 x 92579 matrix that represents how many times (ie. in how many rows) the variables in columns i and j have the same value. For example, variables 4 and 5 have the same value in 3 rows, so entries i_{4,5} and i_{5,4} of the co-occurrence matrix should be 3.
I only need the upper triangular portion of the desired matrix (since it will be symmetric).
I've looked at similar questions here, but they don't address both of these issues:
How to do this efficiently for a very large matrix
How to do this for non-binary entries

How to print a Panda DataFrame in Jupyter Notebook where it doesn't print the Index or the Column Name

I know how to remove the Index, using the .to_string(index=False). But I'm not able to figure out how to remove the column names.
matrix = [
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9]
]
def print_sudoku(s):
df = pd.DataFrame(s)
dff = df.to_string(index=False)
print(dff)
print_sudoku(matrix)
The result is this.
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
I want to remove the first row, which is the row of column names.
You can use header=False when converting to string: df.to_string(index=False, header=False)
ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_string.html

Pandas sort multi-index based on aggregation of a column if I were to consider one level of the index only

What is the fastest way of sorting the city_sales_rep Dataframe by the index 'city', mind that there is a multi-index in place. The order of the the index should be exactly the order in which the index is ordered in the second Dataframe city.
If there an easy and fast way to do this sorting in one go?
A = pd.DataFrame(np.random.randint(2,10,(10,3)))
A.columns = ['city','sales_rep','sales']
city_sales_rep = A.groupby(['city','sales_rep']).sum()
city = A.groupby(['city'])['sales'].sum().sort_values()
Which leads in my example to city_sales_rep:
sales
city sales_rep
2 9 5
4 5 2
7 5
9 2
5 4 4
6 8 6
9 9
7 2 2
3 8
6 4
And city
sales
city
5 4
2 5
4 9
7 14
6 15
While this seems to do what I want, it feels horribly inefficient:
city.join(city_sales_rep.reset_index(level=1),lsuffix='_x')[['sales_rep','sales']].reset_index().set_index(['city','sales_rep'])
P.S.: Edits to the title are welcome, I feel like its somewhat clunky.
One possible solution, but not sure about performance:
np.random.seed(2019)
A = pd.DataFrame(np.random.randint(2,10,(10,3)))
A.columns = ['city','sales_rep','sales']
city_sales_rep = A.groupby(['city','sales_rep']).sum()
a = np.argsort(city_sales_rep.groupby(['city'])['sales'].transform('sum'))
city_sales_rep = city_sales_rep.iloc[a]
print (city_sales_rep)
sales
city sales_rep
8 2 7
7 2 8
6 2 9
9 6 4
7 9
2 2 9
4 9
6 7
7 5
Another soluton with new column:
city_sales_rep = A.groupby(['city','sales_rep']).sum()
city_sales_rep['new'] = city_sales_rep.groupby(['city'])['sales'].transform('sum')
city_sales_rep = city_sales_rep.sort_values('new')
print (city_sales_rep)
sales new
city sales_rep
8 2 7 7
7 2 8 8
6 2 9 9
9 6 4 13
7 9 13
2 2 9 30
4 9 30
6 7 30
7 5 30
If possible duplicated sum for different cities and use pandas 0.23.0+ is possible sorting by level and column together, check docs
city_sales_rep = city_sales_rep.sort_values(['new','city'])
print (city_sales_rep)
sales new
city sales_rep
8 2 7 7
7 2 8 8
6 2 9 9
9 6 4 13
7 9 13
2 2 9 30
4 9 30
6 7 30
7 5 30

Categories

Resources