ffil with a group by and matching a condition - python

I am willing to foward fill the value of log for each id whenever you find the first 1 in the log column
Example:
df
id log
1 0
1 1
1 0
1 0
2 1
2 0
3 1
3 0
3 1
to
id log ffil_log
1 0 0
1 1 1
1 0 1
1 0 1
2 1 1
2 0 1
3 1 1
3 0 1
3 1 1
My try was:
df['ffil_log']=df.log.where(df.log==1).groupby(df.id).ffill()

You can use cummax and groupby such as:
df['ffil_log'] = df.groupby('id')['log'].cummax()
for each id, once your reach 1 in a row, it will be the value for the one after, and you get as expected
id log ffil_log
0 1 0 0
1 1 1 1
2 1 0 1
3 1 0 1
4 2 1 1
5 2 0 1
6 3 1 1
7 3 0 1
8 3 1 1

Related

replacing the value of one column conditional on two other columns in pandas

I have a data-frame df:
year ID category
1 1 0
2 1 1
3 1 1
4 1 0
1 2 0
2 2 0
3 2 1
4 2 0
I want to create a new column such that: for a particular 'year' if the 'category' is 1, the 'new-category' will be always 1 for the upcoming years:
year ID category new_category
1 1 0 0
2 1 1 1
3 1 1 1
4 1 0 1
1 2 0 0
2 2 0 0
3 2 1 1
4 2 0 1
I have tried if-else condition but I am getting the same 'category' column
for row in range(1,df.category[i-1]):
df['new_category'] = df['category'].replace('0',df['category'].shift(1))
But I am not getting the desired column
TRY:
df['new_category'] = df.groupby('ID')['category'].cummax()
OUTPUT:
year ID category new_category
0 1 1 0 0
1 2 1 1 1
2 3 1 1 1
3 4 1 0 1
4 1 2 0 0
5 2 2 0 0
6 3 2 1 1
7 4 2 0 1

Column number detection using dataframe pattern

I have a dataset appearing something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10
1 1 1 1 1 1 1 0 1 1 1
2 0 0 1 1 1 1 1 1 1 0
3 0 1 0 0 1 1 1 1 1 1
4 1 0 1 0 1 1 1 0 1 0
5 1 0 0 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1
Each row indicates a record for an employee with employee id column 'empl_ID'. I am trying to write a code in Python that tracks the first occurrence of '1' in that record. For eg, empl_ID 1 the first time '1' occurs is in day_1 column so the label for that would be 1. For empl_ID 2, the first '1' occurs in column day_3, so the label will be 3. Similarly for all the other employees the labels will be 2,1,1 and 5 respectively. The resultant dataset looks something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10 label
1 1 1 1 1 1 1 0 1 1 1 1
2 0 0 1 1 1 1 1 1 1 0 3
3 0 1 0 0 1 1 1 1 1 1 2
4 1 0 1 0 1 1 1 0 1 0 1
5 1 0 0 1 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1 5
If someone could please help me in writing a Python code for the above problem statement that would be very helpful. Thanks in advance!
s=df.set_index('empl_ID').idxmax(1).str.split('_').str[-1]
empl_ID
1 1
2 3
3 2
4 1
5 1
6 5
dtype: object
df['new']=s.values

Create Duplicate Rows and Change Values in Specific Columns

How to create x amount of duplicates based on a row in the dataframe and change a single or multi variables from specific columns. The rows are then added to the end of the same dataframe.
A B C D E F
0 1 1 0 1 1 0
1 2 2 1 1 1 0
2 2 2 1 1 1 0
3 2 2 1 1 1 0
4 1 1 0 1 1 0 <- Create 25 Duplicates of this row (4) and change variable C to 1
5 1 1 0 1 1 0
6 2 2 1 1 1 0
7 2 2 1 1 1 0
8 2 2 1 1 1 0
9 1 1 0 1 1 0
I repeat only 10 times to keep length of result reasonable.
# Number of repeats |
# v
df.append(df.loc[[4] * 10].assign(C=1), ignore_index=True)
A B C D E F
0 1 1 0 1 1 0
1 2 2 1 1 1 0
2 2 2 1 1 1 0
3 2 2 1 1 1 0
4 1 1 0 1 1 0
5 1 1 0 1 1 0
6 2 2 1 1 1 0
7 2 2 1 1 1 0
8 2 2 1 1 1 0
9 1 1 0 1 1 0
10 1 1 1 1 1 0
11 1 1 1 1 1 0
12 1 1 1 1 1 0
13 1 1 1 1 1 0
14 1 1 1 1 1 0
15 1 1 1 1 1 0
16 1 1 1 1 1 0
17 1 1 1 1 1 0
18 1 1 1 1 1 0
19 1 1 1 1 1 0
Per comments, try:
df.append(df.loc[[4] * 10].assign(**{'C': 1}), ignore_index=True)
I am using repeat and reindex
s=df.iloc[[4],] # pick the row you want to do repeat
s=s.reindex(s.index.repeat(45))# repeat the row by the giving number
#s=pd.DataFrame([df.iloc[4,].tolist()]*25) if need enhance the speed , using this line replace the above
s.loc[:,'C']=1 # change the value
pd.concat([df,s]) #append to the original df

Identifying groups with same column value and count them

I am working with a dataframe, consisting of a continuity column df['continuity'] and a column group df['group'].
Both are binary columns.
I want to add an extra column 'group_id' that gives consecutive rows of 1s the same integer value, where the first group of rows have a
1, then 2 etc. After each time where the continuity value of a row is 0, the counting should start again at 1.
Since this question is rather specific, I'm not sure how to tackle this vectorized. Below an example, where the first two
columns are the input and the column the output I'd like to have.
continuity group group_id
1 0 0
1 1 1
1 1 1
1 1 1
1 0 0
1 1 2
1 1 2
1 1 2
1 0 0
1 0 0
1 1 3
1 1 3
0 1 1
0 0 0
1 1 1
1 1 1
1 0 0
1 0 0
1 1 2
1 1 2
I believe you can use:
#get unique groups in both columns
b = df[['continuity','group']].ne(df[['continuity','group']].shift()).cumsum()
#identify first 1
c = ~b.duplicated() & (df['group'] == 1)
#cumulative sum of first values only if group are 1, else 0 per groups
df['new'] = np.where(df['group'] == 1,
c.groupby(b['continuity']).cumsum(),
0).astype(int)
print (df)
continuity group group_id new
0 1 0 0 0
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 0 0 0
5 1 1 2 2
6 1 1 2 2
7 1 1 2 2
8 1 0 0 0
9 1 0 0 0
10 1 1 3 3
11 1 1 3 3
12 0 1 1 1
13 0 0 0 0
14 1 1 1 1
15 1 1 1 1
16 1 0 0 0
17 1 0 0 0
18 1 1 2 2
19 1 1 2 2

How to concatenate all values of a pandas dataframe into an integer in python?

I have the following dataframe:
1 2 3 4 5 6 7 8 9 10
dog cat 1 1 0 1 1 1 0 0 1 0
dog 1 1 1 1 1 1 0 0 1 1
fox 1 1 1 1 1 1 0 0 1 1
jumps 1 1 1 1 1 1 0 1 1 1
over 1 1 1 1 1 1 0 0 1 1
the 1 1 1 1 1 1 1 0 1 1
I want to first drop all labels from both rows and columns so the df becomes:
1 1 0 1 1 1 0 0 1 0
1 1 1 1 1 1 0 0 1 1
1 1 1 1 1 1 0 0 1 1
1 1 1 1 1 1 0 1 1 1
1 1 1 1 1 1 0 0 1 1
1 1 1 1 1 1 1 0 1 1
And then get then concatenate the values into one long int number so it becomes:
110111001011111100111111110011111111011111111100111111111011
Does any know a way of doing it in the shortest snippet of code possible. I appreciate the suggestions. Thank you.
Option 1
apply(str.join) + str.cat:
df.astype(str).apply(''.join, 1).str.cat(sep='')
'110111001011111100111111110011111111011111111100111111111011'
Option 2
apply + np.add, proposed by Wen:
np.sum(df.astype(str).apply(np.sum, 1))
'110111001011111100111111110011111111011111111100111111111011'
IIUC
''.join(str(x) for x in sum(df.values.tolist(),[]))
Out[344]: '110111001011111100111111110011111111011111111100111111111011'
Or
''.join(map(str,sum(df.values.tolist(),[])))

Categories

Resources