I want the below solution for this in PANDAS 3.5. I have the partial solution in SQL in an earlier post.
Hi I have a dataframe as below with thousands of ID's. It has a list of ID's which have sub id's within them as shown. The subid's may get changed on daily basis, either a new sub id may be added, or an existing sub id maybe lost.
I need to create 2 new columns, which will flag whenever a sub id is added/lost.
So, in the below format you can see that on the 12th, a new sub id 'D' is added and on the 13th, and existing sub id (c) is lost. i want to create a new column/flag to track these sub ids. Can you please help me with this?
When a subid gets removed, I would like it to have a additional row, with the is_removed column = 1 on the date it is actually removed. The sample input/output dataframes are below. Thanks.
Sample input dataframe:
ID Sub Id Date
1 a 3/11/2016
1 b 3/11/2016
1 c 3/11/2016
1 a 3/12/2016
1 b 3/12/2016
1 c 3/12/2016
1 d 3/12/2016
1 a 3/13/2016
1 b 3/13/2016
1 d 3/13/2016
Sample Output:
ID SUBID UPDDATE IS_NEW IS_REMOVED
1 a 2016-03-11 0 0
1 b 2016-03-11 0 0
1 c 2016-03-11 0 0
1 a 2016-03-12 0 0
1 b 2016-03-12 0 0
1 c 2016-03-12 0 0
1 d 2016-03-12 1 0
1 a 2016-03-13 0 0
1 b 2016-03-13 0 0
1 c 2016-03-13 0 1
1 d 2016-03-13 0 0
One way you could do this and visualize the results as you do it to use pd.crosstab:
df_out = pd.crosstab([df['ID'],df['Date']],df['Sub Id'])
df_diff = df_out.diff().fillna(0).stack()
pd.concat([df.set_index(['ID','Date','Sub Id']),
df_diff.eq(1).mul(1).rename('IS_NEW'),
df_diff.eq(-1).mul(1).rename('IS_REMOVED')],axis=1)\
.reset_index()
Output:
ID Date Sub Id IS_NEW IS_REMOVED
0 1 2016-03-11 a 0 0
1 1 2016-03-11 b 0 0
2 1 2016-03-11 c 0 0
3 1 2016-03-11 d 0 0
4 1 2016-03-12 a 0 0
5 1 2016-03-12 b 0 0
6 1 2016-03-12 c 0 0
7 1 2016-03-12 d 1 0
8 1 2016-03-13 a 0 0
9 1 2016-03-13 b 0 0
10 1 2016-03-13 c 0 1
11 1 2016-03-13 d 0 0
Visualize results:
print(df_out)
Sub Id a b c d
ID Date
1 2016-03-11 1 1 1 0
2016-03-12 1 1 1 1
2016-03-13 1 1 0 1
print(df_out.diff().fillna(0))
Sub Id a b c d
ID Date
1 2016-03-11 0.0 0.0 0.0 0.0
2016-03-12 0.0 0.0 0.0 1.0
2016-03-13 0.0 0.0 -1.0 0.0
Related
I have a dataframe df
ID ID2 escto1 escto2 escto3
1 A 1 0 0
2 B 0 1 0
3 C 0 0 3
4 D 0 2 0
so either using indexing or using wildcard
like column name 'escto*'
if df.iloc[:, 2:]>0 then df.helper=1
or
df.loc[(df.iloc[:, 3:]>0,'Transfer')]=1
So that output becomes
ID ID2 escto1 escto2 escto3 helper
1 A 1 0 0 1
2 B 0 1 0 1
3 C 0 0 3 1
4 D 0 2 0 1
Output
One option is to use the boolean output:
df.assign(helper = df.filter(like='escto').gt(0).any(1).astype(int))
ID ID2 escto1 escto2 escto3 helper
0 1 A 1 0 0 1
1 2 B 0 1 0 1
2 3 C 0 0 3 1
3 4 D 0 2 0 1
I have a dataframe that looks like this:
A B C D
0 abc 0 cdf
abf 0 0 afg
And I want to replace any string value with 1.
The expected outcome should look like:
A B C D
0 1 0 1
1 0 0 1
Any help on how to do this is appreciated..
The safe way
df.apply(pd.to_numeric,errors = 'coerce').fillna(1)
Out[217]:
A B C D
0 0.0 1.0 0 1.0
1 1.0 0.0 0 1.0
And for the show case work
(~df.isin([0,'0'])).astype(int)
Out[221]:
A B C D
0 0 1 0 1
1 1 0 0 1
I want to take the sum of values (row-wise) of columns that start with the same text string. Underneath is my original df with fails on courses.
Original df:
ID P_English_2 P_English_3 P_German_1 P_Math_1 P_Math_3 P_Physics_2 P_Physics_4
56 1 3 1 2 0 0 3
11 0 0 0 1 4 1 0
6 0 0 0 0 0 1 0
43 1 2 1 0 0 1 1
14 0 1 0 0 1 0 0
Desired df:
ID P_English P_German P_Math P_Physics
56 4 1 2 3
11 0 0 5 1
6 0 0 0 1
43 3 1 0 2
14 1 0 1 0
Tried code:
import pandas as pd
df = pd.DataFrame({"ID": [56,11,6,43,14],
"P_Math_1": [2,1,0,0,0],
"P_English_3": [3,0,0,2,1],
"P_English_2": [1,0,0,1,0],
"P_Math_3": [0,4,0,0,1],
"P_Physics_2": [0,1,1,1,0],
"P_Physics_4": [3,0,0,1,0],
"P_German_1": [1,0,0,1,0]})
print(df)
categories = ['P_Math', 'P_English', 'P_Physics', 'P_German']
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
result = df.groupby(correct_categories(df.columns),axis=1).sum()
print(result)
Let's try groupby with axis=1:
# extract the subjects
subjects = [x[0] for x in df.columns.str.rsplit('_',n=1)]
df.groupby(subjects, axis=1).sum()
Output:
ID P_English P_German P_Math P_Physics
0 56 4 1 2 3
1 11 0 0 5 1
2 6 0 0 0 1
3 43 3 1 0 2
4 14 1 0 1 0
Or you can use wide_to_long, assuming ID are unique valued:
(pd.wide_to_long(df, stubnames=categories,
i=['ID'], j='count', sep='_')
.groupby('ID').sum()
)
Output:
P_Math P_English P_Physics P_German
ID
56 2.0 4.0 3.0 1.0
11 5.0 0.0 1.0 0.0
6 0.0 0.0 1.0 0.0
43 0.0 3.0 2.0 1.0
14 1.0 1.0 0.0 0.0
sorry, I have a bit of a trouble explaining the problem in title
By accident we pivoted our Pandas Dataframe to this:
df = pd.DataFrame(np.array([[1,1,2], [1,2,1], [2,1,2], [2,2,2],[3,1,3]]),columns=['id', '3s', 'score'])
id 3s score
1 1 2
1 2 1
2 1 2
2 2 2
3 1 3
But we need to unstack this so df will look like this (the original version): The '3s' column 'unpivots' to the discrete set by 3 ordered columns with 0s and 1s, which add in order. So if we had '3s'= 2 with 'score'= 2 the values will be [1,1,0] (2 out of 3 in order) in columns ['4','5','6'] (second set of 3s) for corresponding id
df2 = pd.DataFrame(np.array([[1,1,1,0,1,0,0], [2,1,1,0,1,1,0], [3,1,1,1,np.nan,np.nan,np.nan] ]),columns=['id', '1', '2','3','4','5','6'])
id 1 2 3 4 5 6
1 1 1 0 1 0 0
2 1 1 0 1 1 0
3 1 1 1
Any help greatly appreciated!
(please save me)
Use:
n = 3
df2 = df.reindex(index = df.index.repeat(n))
new_df = (df2.assign(score = df2['score'].gt(df2.groupby(['id','3s'])
.id
.cumcount())
.astype(int),
columns = df2.groupby('id').cumcount().add(1))
.pivot_table(index = 'id',
values='score',
columns = 'columns',
fill_value = '')
.rename_axis(columns = None)
.reset_index())
print(new_df)
Output
id 1 2 3 4 5 6
0 1 1.0 1.0 0.0 1 0 0
1 2 1.0 1.0 0.0 1 1 0
2 3 1.0 1.0 1.0
If you want you can use fill_value = 0
id 1 2 3 4 5 6
0 1 1 1 0 1 0 0
1 2 1 1 0 1 1 0
2 3 1 1 1 0 0 0
This should do the trick:
for gr in df.groupby('3s').groups:
for i in range(1,4):
df[str(i+(gr-1)*3)]=np.where((df['3s'].eq(gr))&(df['score'].ge(i)), 1,0)
df=df.drop(['3s', 'score'], axis=1).groupby('id').max().reset_index()
Output:
id 1 2 3 4 5 6
0 1 1 1 0 1 0 0
1 2 1 1 0 1 1 0
2 3 1 1 1 0 0 0
This is a subset of data frame
F1:
id code s-code
l.1 1 11
l.2 2 12
l.3 3 13
f.1 4 NA
f.2 3 1
h.1 2 1
h.3 1 1
I need to compare the F1.id with F2.id and then add the differences in column "id" to the F2 data frame and fill in columns' values for the added "id" with 0.
this is the second data frame
F2:
id head sweat pain
l.1 1 0 1
l.3 1 0 0
f.2 3 1 1
h.3 1 1 0
The output should be like this:
F3:
id head sweat pain
l.1 1 0 1
l.3 3 13 0
f.2 3 1 1
h.1 2 1 1
h.3 1 1 0
l.2 0 0 0
h.1 0 0 0
f.1 0 0 0
I tried different solution, such as
F1[(F1.index.isin(F2.index)) & (F1.isin(F2))] to return the differences, but non of them worked.
By using reindex
df2.set_index('id').reindex(df1.id).fillna(0).reset_index()
Out[371]:
id head sweat pain
0 l.1 1.0 0.0 1.0
1 l.2 0.0 0.0 0.0
2 l.3 1.0 0.0 0.0
3 f.1 0.0 0.0 0.0
4 f.2 3.0 1.0 1.0
5 h.1 0.0 0.0 0.0
6 h.3 1.0 1.0 0.0
Use an outer merge + fillna:
df[['id']].merge(df2, how='outer')\
.fillna(0).astype(df2.dtypes)
id head sweat pain
0 l.1 1 0 1
1 l.2 0 0 0
2 l.3 1 0 0
3 f.1 0 0 0
4 f.2 3 1 1
5 h.1 0 0 0
6 h.3 1 1 0
Outside the Box
i = np.setdiff1d(F1.id, F2.id)
F2.append(pd.DataFrame(0, range(len(i)), F2.columns).assign(id=i))
id head sweat pain
0 l.1 1 0 1
1 l.3 1 0 0
2 f.2 3 1 1
3 h.3 1 1 0
0 f.1 0 0 0
1 h.1 0 0 0
2 l.2 0 0 0
With a normal index
i = np.setdiff1d(F1.id, F2.id)
F2.append(
pd.DataFrame(0, range(len(i)), F2.columns).assign(id=i),
ignore_index=True
)
id head sweat pain
0 l.1 1 0 1
1 l.3 1 0 0
2 f.2 3 1 1
3 h.3 1 1 0
4 f.1 0 0 0
5 h.1 0 0 0
6 l.2 0 0 0