Related
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 9 months ago.
Suppose I have the following dataframe
import pandas as pd
df = pd.DataFrame({'a': [1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4],
'b': [3,4,3,7,5,9,4,2,5,6,7,8,4,2,4,5,8,0]})
a b
0 1 3
1 1 4
2 1 3
3 2 7
4 2 5
5 2 9
6 2 4
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 4 4
13 4 2
14 4 4
15 4 5
16 4 8
17 4 0
And I would like to make a new column c with values 1 to n where n depends on the value of column a as follow:
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
While I can write it using a for loop, my data frame is huge and it's computationally costly, is there any efficient to generate such column? Thanks.
Use groupby_cumcount:
df['c'] = df.groupby('a').cumcount().add(1)
print(df)
# Output
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.
You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
First, here is my code:
import random
Tablero = []
for i in range(0,20):
Tablero.append( [0]*9 )
for Fila in range(20):
for Columna in range(9):
Tablero[Fila][Columna] = random.randint(1,6)
for i in range (0,20):
for j in range (0, 9) :
print(str.rjust(str(Tablero[i][j]), 4), end = "")
print();
input()
And the current output:
3 5 5 2 6 2 2 3 6
6 2 5 3 4 1 6 4 5
1 5 4 2 1 4 6 6 6
4 6 6 2 1 3 5 6 1
5 2 2 1 1 4 5 6 1
3 6 3 1 6 2 6 2 6
1 1 5 2 6 5 3 6 5
1 2 1 6 1 3 4 1 6
5 2 3 2 1 6 3 4 3
4 6 6 6 6 6 5 1 2
4 4 5 4 3 6 5 1 3
4 5 1 2 2 3 5 2 5
1 2 4 5 2 2 3 2 4
2 1 5 3 4 1 2 4 6
1 3 3 4 4 4 6 5 6
6 6 2 2 3 1 6 3 3
1 5 6 4 2 6 2 1 4
2 3 5 5 6 6 2 4 4
5 5 1 6 1 2 5 4 4
2 3 6 5 4 6 2 2 3
Problem: there are cases where the same numbers appear more than 3 times in a row, this only matters for the COLUMNS.
Example:
1 3 3 4 4 4 6 5 6
And it should be:
1 3 3 4 4 2 4 2 6
I am trying this code for like a week... I was able to do it separately but adding it to this program did not work.
A simple code would be greatly appreciated.
EDIT: The program is similar to the game candy crush. In this program, only the columns matter when adding points, that is to say, only these have to be aligned (3,3,3) would be equivalent to a certain score. Therefore, the program should not generate this : 1 3 3 4 4 4 6 5 6, because otherwise it would play on its own. Since then, one must enter the row and column of a number which will be eliminated and that in that position will fall the number that is above.
df = pd.DataFrame({'site':[1,1,1,1,1,1,1,1,1,1], 'parm':[8,8,8,8,8,9,9,9,9,9],
'date':[1,2,3,4,5,1,2,3,4,5], 'obs':[1,1,2,3,3,3,5,5,6,6]})
Output
site parm date obs
0 1 8 1 1
1 1 8 2 1
2 1 8 3 2
3 1 8 4 3
4 1 8 5 3
5 1 9 1 3
6 1 9 2 5
7 1 9 3 5
8 1 9 4 6
9 1 9 5 6
I want to count repeating, sequential "obs" values within a "site" and "parm". I have this code which is close:
df['consecutive'] = df.parm.groupby((df.obs != df.obs.shift()).cumsum()).transform('size')
Output
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 3
4 1 8 5 3 3
5 1 9 1 3 3
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
It creates the new column with the count. The gap is when the parm changes from 8 to 9 it includes the parm 9 in the parm 8 count. The expected output is:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
You need to throw site, parm as indicated in the question into groupby:
df['consecutive'] = (df.groupby([df.obs.ne(df.obs.shift()).cumsum(),
'site', 'parm']
)
['obs'].transform('size')
)
Output:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
I have a data like this:
republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
from source.
I would like to change all different distinct values from all of the data (dataframe) into numeric values in most efficient way.
In the above mentioned example I would like to transform republican-> 1 and democrat -> 2, y ->3, n->4 and ? -> 5 (or NULL).
I tried to use the following:
# Convert string column to integer
def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup
However, I'm not sure if using Pandas can be more efficient or there are some other better solutions for it. (This should be generic to any source of data).
Here is the transform of data into dataframe using Pandas:
import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset = pd.read_csv(file_path, header=None)
v = df.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
pd.DataFrame(f)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 1 2 1 2 2 2 1 1 1 2 3 2 2 2 1 2
1 0 1 2 1 2 2 2 1 1 1 1 1 2 2 2 1 3
2 4 3 2 2 3 2 2 1 1 1 1 2 1 2 2 1 1
3 4 1 2 2 1 3 2 1 1 1 1 2 1 2 1 1 2
4 4 2 2 2 1 2 2 1 1 1 1 2 3 2 2 2 2
5 4 1 2 2 1 2 2 1 1 1 1 1 1 2 2 2 2
6 4 1 2 1 2 2 2 1 1 1 1 1 1 3 2 2 2
7 0 1 2 1 2 2 2 1 1 1 1 1 1 2 2 3 2
Use replace on the whole dataframe to make the mappings. You could first pass a dictionary of known mappings for values you need to remain consistent, and then generate a set of values for the dataset and map these extra values to say values 100 upwards.
For example, the ? here is not mapped, so would get a value of 100:
mappings = {'republican':1, 'democrat':2, 'y':3, 'n':4}
unknown = set(pd.unique(df.values.ravel())) - set(mappings.keys())
mappings.update([v, c] for c, v in enumerate(unknown, start=100))
df.replace(mappings, inplace=True)
Giving you:
republican n n.1 n.2 n.3 n.4 n.5 n.6 n.7 n.8 n.9 ? n.10 n.11 n.12 n.13 n.14
0 1 4 3 4 3 3 3 4 4 4 3 100 3 3 3 4 3
1 1 4 3 4 3 3 3 4 4 4 4 4 3 3 3 4 100
2 2 100 3 3 100 3 3 4 4 4 4 3 4 3 3 4 4
3 2 4 3 3 4 100 3 4 4 4 4 3 4 3 4 4 3
4 2 3 3 3 4 3 3 4 4 4 4 3 100 3 3 3 3
5 2 4 3 3 4 3 3 4 4 4 4 4 4 3 3 3 3
6 2 4 3 4 3 3 3 4 4 4 4 4 4 100 3 3 3
7 1 4 3 4 3 3 3 4 4 4 4 4 4 3 3 100 3
A more generalized version would be:
mappings = {v:c for c, v in enumerate(sorted(set(pd.unique(df.values.ravel()))), start=1)}
df.replace(mappings, inplace=True)
You can use:
v = df.values
a, b = v.shape
f = pd.factorize(v.T.ravel())[0].reshape(b,a).T
df = pd.DataFrame(f)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 2 4 2 4 4 4 2 2 2 4 3 4 4 4 2 4
1 0 2 4 2 4 4 4 2 2 2 2 2 4 4 4 2 3
2 1 3 4 4 3 4 4 2 2 2 2 4 2 4 4 2 2
3 1 2 4 4 2 3 4 2 2 2 2 4 2 4 2 2 4
4 1 4 4 4 2 4 4 2 2 2 2 4 3 4 4 4 4
5 1 2 4 4 2 4 4 2 2 2 2 2 2 4 4 4 4
6 1 2 4 2 4 4 4 2 2 2 2 2 2 3 4 4 4
7 0 2 4 2 4 4 4 2 2 2 2 2 2 4 4 3 4