I have a large dataframe (5 rows x 92,579 columns) in the following format:
1 2 3 4 5 6 7 8 9 10 11 ... 92569 92570 92571 92572 92573 92574 92575 92576 92577 92578 92579
0 10 9 8 5 5 10 1 1 6 2 3 ... 9 1 8 3 2 5 5 5 2 2 8
1 3 1 7 4 4 3 8 8 3 6 7 ... 1 8 7 5 6 4 4 4 2 6 7
2 6 4 2 9 7 6 5 5 6 7 2 ... 4 5 2 6 6 9 5 9 3 10 2
3 3 8 4 4 7 3 1 1 3 7 6 ... 8 1 5 7 2 4 1 4 6 10 2
4 4 6 5 5 5 4 1 1 4 8 10 ... 6 1 7 3 6 5 5 5 8 2 9
Each of the entries ranges from 1 to 10 (representing an assignment to one of 10 clusters).
I want to create a 92579 x 92579 matrix that represents how many times (ie. in how many rows) the variables in columns i and j have the same value. For example, variables 4 and 5 have the same value in 3 rows, so entries i_{4,5} and i_{5,4} of the co-occurrence matrix should be 3.
I only need the upper triangular portion of the desired matrix (since it will be symmetric).
I've looked at similar questions here, but they don't address both of these issues:
How to do this efficiently for a very large matrix
How to do this for non-binary entries
Related
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 9 months ago.
Suppose I have the following dataframe
import pandas as pd
df = pd.DataFrame({'a': [1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4],
'b': [3,4,3,7,5,9,4,2,5,6,7,8,4,2,4,5,8,0]})
a b
0 1 3
1 1 4
2 1 3
3 2 7
4 2 5
5 2 9
6 2 4
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 4 4
13 4 2
14 4 4
15 4 5
16 4 8
17 4 0
And I would like to make a new column c with values 1 to n where n depends on the value of column a as follow:
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
While I can write it using a for loop, my data frame is huge and it's computationally costly, is there any efficient to generate such column? Thanks.
Use groupby_cumcount:
df['c'] = df.groupby('a').cumcount().add(1)
print(df)
# Output
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.
You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
Here is sample dataset:
id a
0 5 1
1 5 0
2 5 4
3 5 6
4 5 2
5 5 3
6 9 0
7 9 1
8 9 6
9 9 2
10 9 4
From the dataset, I want to generate a column sum. For first 3 rows: sum=sum+a(group by id). From 4th row, each row contains the cumulative sum of the previous 3 rows of a value(group by id). Loop through each row.
Desired Output:
id a sum
0 5 1 1
1 5 0 1
2 5 4 5
3 5 6 5
4 5 2 10
5 5 3 12
6 9 0 0
7 9 1 1
8 9 6 7
9 9 2 7
10 9 4 9
Code I tried:
df['sum']=df['a'].rolling(min_periods=1, window=3).groupby(df['id']).cumsum()
You can define a functiona like the below:
def cumsum_last3(DF):
nrow=DF.shape[0]
DF["sum"]=0
DF["sum"].iloc[0]=DF["a"].iloc[0]
DF["sum"].iloc[1]=DF["a"].iloc[0]+DF["a"].iloc[1]
for a in range(nrow-2):
cums=np.sum(DF["a"].iloc[a:a+3])
DF["sum"].iloc[a+2]=cums
return DF
DF_cums=cumsum_last3(DF)
DF_cums
If I have preexisting columns (say 12 columns, all with unique names), and I want to organize them into two "header" columns, such as 8 assigned to Detail and 4 assigned to Summary, what is the most effective approach besides sorting them, manually creating a new index, and then swapping out the indices?
Happy to provide more example detail, but that's the gist of what is pretty generic problem.
Need to use multi-index of columns capability. It's important to rename() columns before reindex() so no data is lost.
df = pd.DataFrame({f"col-{i}":[random.randint(1,10) for i in range(10)] for i in range(12)})
header = [f"col-{i}" for i in range(8)]
header
# build a multi-index
mi = pd.MultiIndex.from_tuples([tuple(["Header" if c in header else "Detail", c])
for c in df.columns], names=('Category', 'Name'))
# rename before reindex to prevent data loss
df = df.rename(columns={c:mi[i] for i,c in enumerate(df.columns)}).reindex(columns=mi)
print(df.to_string())
output
Category Header Detail
Name col-0 col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 col-9 col-10 col-11
0 5 5 6 1 8 3 8 6 8 2 8 10
1 2 7 10 5 2 10 5 10 10 7 6 1
2 10 1 1 2 7 9 2 9 4 4 7 6
3 8 10 1 3 3 4 10 10 9 7 6 8
4 6 8 7 2 5 4 3 3 7 9 8 6
5 6 4 4 4 1 5 8 4 4 1 6 8
6 3 7 3 8 8 4 6 1 5 10 5 10
7 5 1 10 9 9 7 8 2 6 7 10 4
8 2 2 1 4 8 8 7 2 5 9 9 9
9 8 6 5 6 2 8 2 8 10 7 9 3
Assume I have a dataframe like this, for example:
0 1 2 3 4 5 6 7 8 9
0 8 9 2 1 6 2 6 8 6 3
1 1 1 8 3 1 6 3 6 3 9
2 1 4 3 5 9 3 5 9 2 3
3 4 6 3 8 4 3 1 5 1 1
4 1 8 5 3 9 6 1 7 2 2
5 6 6 7 9 1 8 2 3 2 8
6 8 3 6 9 9 5 8 4 7 7
7 8 3 3 8 7 1 4 9 7 2
8 7 6 1 4 8 1 6 9 6 6
9 3 3 2 4 8 1 8 1 1 8
10 7 7 5 7 1 4 1 8 8 6
11 6 3 2 7 6 5 7 4 8 7
I would like to put rows to certain "blocks" of given length and the flatten them to single rows. So for example, if the block length would be 3, the result here would be:
0 1 2 3 4 5 6 7 8 9 10 ... 19 20 21 22 23 24 25 26 27 28 29
2 8 9 2 1 6 2 6 8 6 3 1 ... 9 1 4 3 5 9 3 5 9 2 3
5 4 6 3 8 4 3 1 5 1 1 1 ... 2 6 6 7 9 1 8 2 3 2 8
8 8 3 6 9 9 5 8 4 7 7 8 ... 2 7 6 1 4 8 1 6 9 6 6
11 3 3 2 4 8 1 8 1 1 8 7 ... 6 6 3 2 7 6 5 7 4 8 7
How to achieve this?
I think need reshape:
n_blocks =3
df = pd.DataFrame(df.values.reshape(-1, n_blocks *df.shape[1]))
print (df)
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 \
0 8 9 2 1 6 2 6 8 6 3 ... 1 4 3 5 9 3 5 9
1 4 6 3 8 4 3 1 5 1 1 ... 6 6 7 9 1 8 2 3
2 8 3 6 9 9 5 8 4 7 7 ... 7 6 1 4 8 1 6 9
3 3 3 2 4 8 1 8 1 1 8 ... 6 3 2 7 6 5 7 4
28 29
0 2 3
1 2 8
2 6 6
3 8 7
[4 rows x 30 columns]
I found this solution, maybe someone comes up with a better one:
def toBlocks(df, blocklen):
shifted = [df.shift(periods=p) for p in range(blocklen)]
return pd.concat(shifted, axis=1)[blocklen-1:]