Python pandas create new column with string code based on boolean rows - python

I have a dataframe with multiple columns containing booleans/ints(1/0). I need a new result pandas column with strings that are built by following code: How many times the True's are consecutive, if the chain is interrupted or not, and from what column to what column the trues are.
For example this is the following dataframe:
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8 column_9 column_10
0 0 1 0 1 1 1 1 0 0 1
1 0 1 1 0 1 1 1 0 0 1
2 1 1 0 0 0 1 1 0 0 1
3 1 1 1 0 0 0 0 1 1 1
4 1 1 1 0 0 1 0 0 1 1
5 1 1 1 0 0 0 1 1 0 1
6 0 1 1 1 1 1 1 0 1 0
Where the following row for example: 1: [0 1 1 0 1 1 1 0 0 1]
Would result in code string in the column_result: i2/2-3/c2-c3_c5-c7/6 which is build in four segments I can read somewhere in my code later.
Segment 1:
Where 'i' stands for interrupted, if not interrupted would be 'c' for consecutive
2 stands for how many times it found 2 or more consecutive True's,
Segment 2:
The consecutive count of the consecutive group, in this case the first consecutive count is 2, and the second count is 3..
Semgent 3:
The number/id of the column where the first True was found and the column number of where the last True was found of that consecutive True's.
Semgent 4:
Just the total count of Trues in the row.
Another example would be the following row: 6: [0 1 1 1 1 1 1 0 1 0]
Would result in code string in the column_result: c1/6/c2-c7/7
The below code is the startcode I used to create the above dataframe that has random int's for bools:
def create_custom_result(df: pd.DataFrame) -> pd.Series:
return df
def create_dataframe() -> pd.DataFrame:
df = pd.DataFrame() # empty df
for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: # create random bool/int values
df[f'column_{i}'] = np.random.randint(2, size=50)
df["column_result"] = '' # add result column
return df
if __name__=="__main__":
df = create_dataframe()
custom_results = create_custom_result(df=df)
Would someone have any idea of how to tackle this? To be honest I have no idea where to start. I found the following that probably came closest: count sets of consecutive true values in a column, however, it uses the column and not the rows horizontal. Maybe someone can tell me if I should try np.array stuff, or maybe pandas has some function that can help me? I found some groupby functions that work horizontal, but I wouldnt know how to convert that to the string code to be used in the result column. Or should I loop through the Dataframe by rows and then build the column_result code in segments?
Thanks in advance!
I tried some things already, looping through the dataframe row by row, but had no idea how to build a new column with the code strings.
I also found this artikel: pandas groupby .. but wouldnt know how to create a new column str data by the group I found. Also, almost everything I find is group stuff by the single column and not through the rows of all columns.

these codes maybe works ?
df = pd.DataFrame(np.random.randint(0,2, size=(12,8)))
df.columns=["col1","col2","col3","col4","col5","col6","col7","col8"]
def func(df:pd.DataFrame) -> pd.DataFrame:
result_list = []
copy = df.copy()
cumsum = copy.cumsum(axis=1)
for r,s in cumsum.iterrows():
count = 0
last = -1
interrupted = 0
consecutive = 0
consecutives = []
ranges = []
for x in s.values:
count += 1
if x != 0:
if x!=last:
consecutive += 1
last = x
if consecutive == 2:
ranges.append(count-1)
elif x==last:
if consecutive > 1:
interrupted += 1
ranges.append(count-1)
consecutives.append(str(consecutive))
consecutive = 0
else:
if consecutive > 1:
consecutives.append(str(consecutive))
ranges.append(count)
result = f'{interrupted}i/{len(consecutives)}c/{"-".join(consecutives)}/{"_".join([ f"c{ranges[i]}-c{ranges[i+1]}" for i in range(0,len(ranges),2) ])}/{last}'
result_list.append(result.split("/"))
copy["results"] = pd.Series(["/".join(i) for i in result_list])
copy[["interrupts_count","consecutives_count","consecutives lengths","consecutives columns ranges","total"]] = pd.DataFrame(np.array(result_list))
return copy
result_df = func(df)

Maybe go with simple class for each column that will receive series from original DataFrame (i.e. sliced vertically) and new value. Using original DataFrame sliced vertical array calculate all starting values as fields (start of consecutive true values, length of consecutive true values, last value..). And finally using start values and new next value update fields and prepare string output.

Related

Count consecutive row values but reset count with every 0 in row

Within a dataframe, I need to count and sum consecutive row values in column A into a new column, column B.
Starting with column A, the script would count the consecutive runs in 1s but when a 0 appears it prints the total count in column B, it then resets the count and continues through the remaining data.
Desired outcome:
A | B
0 0
1 0
1 0
1 0
1 0
0 4
0 0
1 0
1 0
0 2
I've tried using .shift() along with various if-statements but have been unsuccessful.
This could be a way to do it. Probably there exists a more elegant solution.
df['B'] = df['A'].groupby(df['A'].ne(df['A'].shift()).cumsum()).cumsum().shift(fill_value=0) * (df['A'].diff() == -1)
This part df['A'].groupby(df['A'].ne(df['A'].shift()) groups the data by consecutive occurences of values.
Then we take the cumsum which counts the cumulated sum along each group. Then we shift the results by 1 row because you want the count after the group. Then we mask out all the rows which are not the last row of the group + 1.
Here is one way to do it. However, I get the feeling that there might be better ways.. But you can try this for now:
The routine function is use to increment the counter variable until it encounters a value of 0 in the A column. At which point it grabs the total count, and then resets the counter variable.
I use a for-loop to iterate through the A column, and append the returned B values to a list
This list is then inserted into the dataframe.
df = pd.DataFrame({"A":[0,1,1,1,1,0,0,1,1,0]})
def routine(row, c):
val = 0
if row:
c += 1
else:
val = c
c = 0
return(val, c)
B_vals = []
counter = 0
for item in df['A'].values:
b, counter = routine(item, counter)
B_vals.append(b)
df['B'] = B_vals
print(df)
OUTPUT:
A B
0 0 0
1 1 0
2 1 0
3 1 0
4 1 0
5 0 4
6 0 0
7 1 0
8 1 0
9 0 2

How to set value of first row of pandas dataframe meeting condition?

I want to update the first row of a dataframe that meets a certain condition. Like in this question Get first row of dataframe in Python Pandas based on criteria but for setting instead of just selecting.
df[df['Qty'] > 0].iloc[0] = 5
The above line does not seem to do anything.
Given df below:
a b
0 1 2
1 2 1
2 3 1
you change the values in the first row where the value in column b is equal to 1 by:
df.loc[df[df['b'] == 1].index[0]] = 1000
Output:
a b
0 1 2
1 1000 1000
2 3 1
If you want to change the value in a specific column(s), you can do that too:
df.loc[df[df['b'] == 1].index[0],'a'] = 1000
a b
0 1 2
1 1000 1
2 3 1
I believe what you're looking for is:
idx = df[df['Qty'] > 0].index[0]
df.loc[[idx], ['Qty']] = 5

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

Select last observation per group

Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.

Removing values that repeat more than 5 times in Pandas DataFrame

I am using pandas to work with csv files. I need to remove a few repeated values if they occur consecutively.
I understand there is a duplicate function that removes any value that repeats the second time irrespective of where they occur.
But I have to remove the data only if the values of a column repeat for more than 5 consecutive rows.
For example,
1
1
3
1
1
1
1
1
2
Here I don't want to remove the two 1's at the top in B but only the 1's that repeat for 5 times successively.
Any pointers as to how I should go about this?
This should do it:
>> df = pd.Series([1,1,3,1,1,1,1,1,2])
>> df.groupby((df.shift() != df).cumsum())\
.filter(lambda x: len(x) < 5)
0 1
1 1
2 3
8 2
Showing how answer by elyase also works for DataFrame (not Series).
>> df = pd.DataFrame(np.array([[1,1,3,1,1,1,1,1,2]]).transpose(),columns = ["col"])
>> df.groupby((df["col"].shift() != df["col"]).cumsum()).filter(lambda x: len(x) < 5)
col
0 1
1 1
2 3
8 2

Categories

Resources