Removing values that repeat more than 5 times in Pandas DataFrame - python

I am using pandas to work with csv files. I need to remove a few repeated values if they occur consecutively.
I understand there is a duplicate function that removes any value that repeats the second time irrespective of where they occur.
But I have to remove the data only if the values of a column repeat for more than 5 consecutive rows.
For example,
1
1
3
1
1
1
1
1
2
Here I don't want to remove the two 1's at the top in B but only the 1's that repeat for 5 times successively.
Any pointers as to how I should go about this?

This should do it:
>> df = pd.Series([1,1,3,1,1,1,1,1,2])
>> df.groupby((df.shift() != df).cumsum())\
.filter(lambda x: len(x) < 5)
0 1
1 1
2 3
8 2

Showing how answer by elyase also works for DataFrame (not Series).
>> df = pd.DataFrame(np.array([[1,1,3,1,1,1,1,1,2]]).transpose(),columns = ["col"])
>> df.groupby((df["col"].shift() != df["col"]).cumsum()).filter(lambda x: len(x) < 5)
col
0 1
1 1
2 3
8 2

Related

Python pandas create new column with string code based on boolean rows

I have a dataframe with multiple columns containing booleans/ints(1/0). I need a new result pandas column with strings that are built by following code: How many times the True's are consecutive, if the chain is interrupted or not, and from what column to what column the trues are.
For example this is the following dataframe:
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8 column_9 column_10
0 0 1 0 1 1 1 1 0 0 1
1 0 1 1 0 1 1 1 0 0 1
2 1 1 0 0 0 1 1 0 0 1
3 1 1 1 0 0 0 0 1 1 1
4 1 1 1 0 0 1 0 0 1 1
5 1 1 1 0 0 0 1 1 0 1
6 0 1 1 1 1 1 1 0 1 0
Where the following row for example: 1: [0 1 1 0 1 1 1 0 0 1]
Would result in code string in the column_result: i2/2-3/c2-c3_c5-c7/6 which is build in four segments I can read somewhere in my code later.
Segment 1:
Where 'i' stands for interrupted, if not interrupted would be 'c' for consecutive
2 stands for how many times it found 2 or more consecutive True's,
Segment 2:
The consecutive count of the consecutive group, in this case the first consecutive count is 2, and the second count is 3..
Semgent 3:
The number/id of the column where the first True was found and the column number of where the last True was found of that consecutive True's.
Semgent 4:
Just the total count of Trues in the row.
Another example would be the following row: 6: [0 1 1 1 1 1 1 0 1 0]
Would result in code string in the column_result: c1/6/c2-c7/7
The below code is the startcode I used to create the above dataframe that has random int's for bools:
def create_custom_result(df: pd.DataFrame) -> pd.Series:
return df
def create_dataframe() -> pd.DataFrame:
df = pd.DataFrame() # empty df
for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: # create random bool/int values
df[f'column_{i}'] = np.random.randint(2, size=50)
df["column_result"] = '' # add result column
return df
if __name__=="__main__":
df = create_dataframe()
custom_results = create_custom_result(df=df)
Would someone have any idea of how to tackle this? To be honest I have no idea where to start. I found the following that probably came closest: count sets of consecutive true values in a column, however, it uses the column and not the rows horizontal. Maybe someone can tell me if I should try np.array stuff, or maybe pandas has some function that can help me? I found some groupby functions that work horizontal, but I wouldnt know how to convert that to the string code to be used in the result column. Or should I loop through the Dataframe by rows and then build the column_result code in segments?
Thanks in advance!
I tried some things already, looping through the dataframe row by row, but had no idea how to build a new column with the code strings.
I also found this artikel: pandas groupby .. but wouldnt know how to create a new column str data by the group I found. Also, almost everything I find is group stuff by the single column and not through the rows of all columns.
these codes maybe works ?
df = pd.DataFrame(np.random.randint(0,2, size=(12,8)))
df.columns=["col1","col2","col3","col4","col5","col6","col7","col8"]
def func(df:pd.DataFrame) -> pd.DataFrame:
result_list = []
copy = df.copy()
cumsum = copy.cumsum(axis=1)
for r,s in cumsum.iterrows():
count = 0
last = -1
interrupted = 0
consecutive = 0
consecutives = []
ranges = []
for x in s.values:
count += 1
if x != 0:
if x!=last:
consecutive += 1
last = x
if consecutive == 2:
ranges.append(count-1)
elif x==last:
if consecutive > 1:
interrupted += 1
ranges.append(count-1)
consecutives.append(str(consecutive))
consecutive = 0
else:
if consecutive > 1:
consecutives.append(str(consecutive))
ranges.append(count)
result = f'{interrupted}i/{len(consecutives)}c/{"-".join(consecutives)}/{"_".join([ f"c{ranges[i]}-c{ranges[i+1]}" for i in range(0,len(ranges),2) ])}/{last}'
result_list.append(result.split("/"))
copy["results"] = pd.Series(["/".join(i) for i in result_list])
copy[["interrupts_count","consecutives_count","consecutives lengths","consecutives columns ranges","total"]] = pd.DataFrame(np.array(result_list))
return copy
result_df = func(df)
Maybe go with simple class for each column that will receive series from original DataFrame (i.e. sliced vertically) and new value. Using original DataFrame sliced vertical array calculate all starting values as fields (start of consecutive true values, length of consecutive true values, last value..). And finally using start values and new next value update fields and prepare string output.

How to set value of first row of pandas dataframe meeting condition?

I want to update the first row of a dataframe that meets a certain condition. Like in this question Get first row of dataframe in Python Pandas based on criteria but for setting instead of just selecting.
df[df['Qty'] > 0].iloc[0] = 5
The above line does not seem to do anything.
Given df below:
a b
0 1 2
1 2 1
2 3 1
you change the values in the first row where the value in column b is equal to 1 by:
df.loc[df[df['b'] == 1].index[0]] = 1000
Output:
a b
0 1 2
1 1000 1000
2 3 1
If you want to change the value in a specific column(s), you can do that too:
df.loc[df[df['b'] == 1].index[0],'a'] = 1000
a b
0 1 2
1 1000 1
2 3 1
I believe what you're looking for is:
idx = df[df['Qty'] > 0].index[0]
df.loc[[idx], ['Qty']] = 5

How to count number of records in a group and save them in a csv file?

I have a dataset as below:
import pandas as pd
dict = {"A":[1,1,1,1,5],"B":[1,1,2,4,1]}
dt = pd.DataFrame(data=dict)
so, it is as below:
A B
1 1
1 1
1 2
1 4
5 1
i need to apply a groupby based on A and B count how many records each group has?
i have applied the below solution:
dtSize = dt.groupby(by=["A","B"], as_index=False).size()
dtSize.to_csv("./datasets/Final DT/dtSize.csv", sep=',', encoding='utf-8', index=False)
I have 2 problems:
When i open the saved file, it only contains the last column which includes number element in each group, but it does not include the groups
when i print the final dtSize it is as below:
so, some similar records in A is missed.
My favorit output is as below in a .csv file
A B Number of elements in group
1 1 2
1 2 1
1 4 1
5 1 1
Actually, data from A isn't missing. GroupBy.size returns a Series, so A and B are used as a MultiIndex. Due to this, repeated values for A in the first three rows aren't printed.
You're close. You need to reset the index and, optionally, name the result:
dt.groupby(['A', 'B']).size().reset_index(name='Size')
The result is:
A B Size
0 1 1 2
1 1 2 1
2 1 4 1
3 5 1 1

Removing duplicates based on two columns while deleting inconsistent data

I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64

Select last observation per group

Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.

Categories

Resources