I have a dataframe like this (boolean values)
a b c d count
1 0 1 0 196
0 1 0 1 110
0 1 0 0 17
0 0 1 0 10
0 0 0 0 9
As you can, someone can be a and c // or b and d // or only c
I want to built a square matrix dataframe, where
a b
a 0 0
b 0 17
c 196 10
d 0 110
Can I get something like this? I tried
result = df.merge(df, on=['ID']
count = pd.crosstab(df[columns_x],df[columns_y])
but it didn't get what I want
Note
The main data frame is like:
a b c d
Yes No Yes No
No Yes No Yes
No Yes No No
No No Yes No
No No No No
I got the answer by simply do a dot product.
df_transpose = df.transpose()
count = df_transpose.dot(df)
>
a b c d
a 222 5 8 1
b 5 154 14 22
c 8 14 34 6
d 1 22 6 29
Related
I'm trying to create a relationship between repeated ID's in dataframe. For example take 91, so 91 is repeated 4 times so for first 91 entry first column row value will be updated to A and second will be updated to B then for next row of 91, first will be updated to B and second will updated to C then for next first will be C and second will be D and so on and this same relationship will be there for all duplicated ID's.
For ID's that are not repeated first will marked as A.
id
first
other
11
0
0
09
0
0
91
0
0
91
0
0
91
0
0
91
0
0
15
0
0
15
0
0
12
0
0
01
0
0
01
0
0
01
0
0
Expected output:
id
first
other
11
A
0
09
A
0
91
A
B
91
B
C
91
C
D
91
D
E
15
A
B
15
B
C
12
A
0
01
A
B
01
B
C
01
C
D
I using df.iterrows() for this but that's becoming very messy code and will be slow if dataset increases is there any easy way of doing it.
You can perform a mapping using a cumcount per group as source:
from string import ascii_uppercase
# mapping dictionary
# this is an example, you can use any mapping
d = dict(enumerate(ascii_uppercase))
# {0: 'A', 1: 'B', 2: 'C'...}
g = df.groupby('id')
c = g.cumcount()
m = g['id'].transform('size').gt(1)
df['first'] = c.map(d)
df.loc[m, 'other'] = c[m].add(1).map(d)
Output:
id first other
0 11 A 0
1 9 A 0
2 91 A B
3 91 B C
4 91 C D
5 91 D E
6 15 A B
7 15 B C
8 12 A 0
9 1 A B
10 1 B C
11 1 C D
Given:
id
0 12
1 9
2 91
3 91
4 91
5 91
6 15
7 15
8 12
9 1
10 1
11 1
Doing:
# Count ids per group
df['first'] = df.groupby('id').cumcount()
# convert to letters and make other col
m = df.groupby('id').filter(lambda x: len(x)>1).index
df.loc[m, 'other'] = df['first'].add(66).apply(chr)
df['first'] = df['first'].add(65).apply(chr)
# fill in missing with 0
df['other'] = df['other'].fillna(0)
Output:
id first other
0 11 A 0
1 9 A 0
2 91 A B
3 91 B C
4 91 C D
5 91 D E
6 15 A B
7 15 B C
8 12 A 0
9 1 A B
10 1 B C
11 1 C D
I have a dataframe df with column A with random numbers and column B with categories. Now, I obtain another column C using the code below:
df.loc[df['A'] >= 50, 'C'] = 1
df.loc[df['A'] < 50, 'C'] = 0
I want to obtain a column 'D' which creates a sequence if 1 is encountered else returns the value 0. The required dataframe is given below.
Required df
A B C D
17 a 0 0
88 a 1 1
99 a 1 2
76 a 1 3
73 a 1 4
23 b 0 0
36 b 0 0
47 b 0 0
74 b 1 1
80 c 1 1
77 c 1 2
97 d 1 1
30 d 0 0
80 d 1 2
Use GroupBy.cumcount with Series.mask:
df['D'] = df.groupby(['B', 'C']).cumcount().add(1).mask(df['C'].eq(0), 0)
print (df)
A B C D
17 a 0 0
88 a 1 1
99 a 1 2
76 a 1 3
73 a 1 4
23 b 0 0
36 b 0 0
47 b 0 0
74 b 1 1
80 c 1 1
77 c 1 2
97 d 1 1
30 d 0 0
80 d 1 2
Or numpy.where:
df['D'] = np.where(df['C'].eq(0), 0, df.groupby(['B', 'C']).cumcount().add(1))
I have a dataframe with 45 columns. Most are string values, so I'm trying to use pd.get_dummies to turn the strings into numbers using df = pd.get_dummies(drop_first=True); however, the columns without string values are removed from my dataframe. I don't want to have to type out 40 or so columns names. How can I iterate over every column, ignoring ones without strings and still have them remain after the get_dummies call?
Columns can be filtered by dtypes to programmatically determine which columns to pass to get_dummies, namely only the "object or category" type columns:
new_df = pd.get_dummies(
df,
columns=df.columns[(df.dtypes == 'object') | (df.dtypes == 'category')]
)
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
n = 10
df = pd.DataFrame({
'A': np.random.randint(1, 100, n),
'B': pd.Series(np.random.choice(list("ABCD"), n), dtype='category'),
'C': np.random.random(n) * 100,
'D': np.random.choice(list("EFGH"), n)
})
new_df = pd.get_dummies(
df,
columns=df.columns[(df.dtypes == 'object') | (df.dtypes == 'category')]
)
df:
A B C D
0 79 A 76.437261 G
1 62 D 11.090076 E
2 17 B 20.415475 E
3 74 B 11.909536 E
4 9 D 87.790307 G
5 63 A 52.367529 E
6 28 D 49.213600 F
7 31 A 73.187110 H
8 81 B 1.458075 H
9 8 D 9.336303 H
df.dtypes:
A int32
B category
C float64
D object
dtype: object
new_df:
A C B_A B_B B_D D_E D_F D_G D_H
0 79 76.437261 1 0 0 0 0 1 0
1 62 11.090076 0 0 1 1 0 0 0
2 17 20.415475 0 1 0 1 0 0 0
3 74 11.909536 0 1 0 1 0 0 0
4 9 87.790307 0 0 1 0 0 1 0
5 63 52.367529 1 0 0 1 0 0 0
6 28 49.213600 0 0 1 0 1 0 0
7 31 73.187110 1 0 0 0 0 0 1
8 81 1.458075 0 1 0 0 0 0 1
9 8 9.336303 0 0 1 0 0 0 1
I have been stuck for 3 hours on this problem.
I have a DF like that :
p = product
order = number of sales
I don't have the release date of the product so I assume that the release date is the first date with some sales.
Here is my dataframe :
p order
A 0
A 0
A 1
A 1
A 2
B 0
B 0
B 1
B 1
this is what I would like : an incrementation of days since release on columns d_s_r (days since release).
p order d_s_r
A 0 0
A 0 0
A 1 1
A 1 2
A 2 3
B 0 0
B 0 0
B 1 1
B 1 2
What would be your recommendation :
I tried :
for i, row in data[data.order > 0].groupby('p') :
list_rows = row.index.tolist()
for m, k in enumerate(list_rows):
data.loc[k,'s_d_r'] = m +1
seems to be working but it takes too much time....
i'm sure there is an easy way but can't find id.
thanks in advance...
Edit :
Here's my df :
df = pd.DataFrame([['A',0,0],['A',0,0],['A',12,1],['A',23,5],['A',25,7]
,['B',0,0],['B',2,0],['B',8,5],['B',15,12],['B',0,3],['B',0,3],['B',5,4]], columns=['prod','order','order_2'])
with the df.groupby('prod')['order'].transform(lambda x : x.cumsum().factorize()[0])
I get :
prod order order_2 d_s_r
0 A 0 0 0
1 A 0 0 0
2 A 12 1 1
3 A 23 5 2
4 A 25 7 3
5 B 0 0 0
6 B 2 0 1
7 B 8 5 2
8 B 15 12 3
9 B 0 3 3
10 B 0 3 3
11 B 5 4 4
When I would like :
prod order order_2 d_s_r
0 A 0 0 0
1 A 0 0 0
2 A 12 1 1
3 A 23 5 2
4 A 25 7 3
5 B 0 0 0
6 B 2 0 1
7 B 8 5 2
8 B 15 12 3
9 B 0 3 4
10 B 0 3 5
11 B 5 4 6
generally have 0's at the beginning of each groupby.('p') but i could eventually have directly some actual values.
And I can, have 0 order some day(which put's back the counter to 0 here), but still want my counter since release date of product
I actually managed to get my results by adding a dummy column with only "1" and by doing df[df.o' > 0].groupby('p').cumsum() but I don't think it's really interesting...
groupby on p + cumsum on order with factorize
df['d_s_r'] = df.groupby('p')['order'].cumsum().factorize()[0]
print(df)
p order d_s_r
0 A 0 0
1 A 0 0
2 A 1 1
3 A 1 2
4 A 2 3
5 B 0 0
6 B 0 0
7 B 1 1
8 B 1 2
Dataframe:
df = pd.DataFrame({'a':['NA','W','Q','M'], 'b':[0,0,4,2], 'c':[0,12,0,2], 'd':[22, 3, 34, 12], 'e':[0,0,3,6], 'f':[0,2,0,0], 'h':[0,1,1,0] })
df
a b c d e f h
0 NA 0 0 22 0 0 0
1 W 0 12 3 0 2 1
2 Q 4 0 34 3 0 1
3 M 2 2 12 6 0 0
I want to drop the entire row if the value of column b and all columns e contain 0
Basically I want to get something like this
a b c d e f h
1 W 0 12 3 0 2 1
2 Q 4 0 34 3 0 1
3 M 2 2 12 6 0 0
If want test from e to end columns and b columns added by DataFrame.assign use DataFrame.loc for selecing, test for not equal by DataFrame.ne and then if aby values match (it means no all 0) with DataFrame.any and last filter by boolean indexing:
df = df[df.loc[:, 'e':].assign(b = df['b']).ne(0).any(axis=1)]
print (df)
a b c d e f h
1 W 0 12 3 0 2 1
2 Q 4 0 34 3 0 1
3 M 2 2 12 6 0 0