Reset count of group after pandas dataframe concatenation [duplicate] - python

This question already has answers here:
Pandas recalculate index after a concatenation
(3 answers)
Closed 3 years ago.
I am doing data processing and I have a problem figuring out how to reset groups counter after concatenating pandas dataframes. Here is an example below to illustrate my problem:
For example I have two dataframes:
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
Counter Value
0 1 8
1 1 10
2 2 2
3 2 4
4 2 10
after concatenation I get:
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
0 1 8
1 1 10
2 2 2
3 2 4
4 2 10
and I want to reset counter and make it sequential and make counter values to be by one digit bigger than the last group of counters.
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
0 3 8
1 3 10
2 4 2
3 4 4
4 4 10
I was trying to shift all dataframe by one upwards and compare shifted values with original and if original one is bigger that the shifted one, add original value to all values below it. But this solution is not always working due to noisy and inconsistent raw data.

You can just add the maximum value in the Counter column in the first dataframe to the second before concatenating:
df2.Counter += df1.Counter.max()
pd.concat([df1, df2], ignore_index=True)
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
5 3 8
6 3 10
7 4 2
8 4 4
9 4 10

Or another way using shift():
df=pd.concat([df1,df2])
df=df.assign(Counter_1=df.Counter.ne(df.Counter.shift()).cumsum())
#for same col df=df.assign(Counter=df.Counter.ne(df.Counter.shift()).cumsum())
Counter Value Counter_1
0 1 3 1
1 1 4 1
2 1 2 1
3 2 4 2
4 2 10 2
0 1 8 3
1 1 10 3
2 2 2 4
3 2 4 4
4 2 10 4

Related

How to identify one column with continuous number and same value of another column?

I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6

Appending a list to dataframe and adding count

I have a pandas data frame and a list -
d={'abc':[0,2,4,5,2,2],'bec':[0,5,6,4,0,2],'def':[7,6,0,1,1,2],'rtr':[5,6,7,2,0,3],'rwr':[5,6,7,1,0,5],'xx':[4,5,6,7,8,7]}
X=pd.DataFrame(d)
abc bec def rtr rwr xx
0 0 0 7 5 5 4
1 2 5 6 6 6 5
2 4 6 0 7 7 6
3 5 4 1 2 1 7
4 2 0 1 0 0 8
5 2 2 2 3 5 7
l=[ 'bec','def','cef','ghd','rtr','fgh','ewr']
Now I want to append the list to data frame in the following way-
For each row in dataframe- We count the number of non zero elements in it(lets say it is 3 in the first row
We take 50% of 3= 1.5 (rounded to 1) and we append those many elements from the list l to the row(starting from the beginning). For the first row it is 'bec', since 'bec' is already present in the
row we increase its count by 1.
If the element from list is not present in the dataframe we append it at the end.
Dry run-
for row 1(index 1)- no of non zero elements is 6. So 50% of it is 3. So we take the first 3 elements from list['bec','def','cef']. 'bec' is already present so its count increases by 1 and it becomes(2,2)=6.
Similarly 'def' is present so it becomes(2,3)=7. 'cef' isn't present in the dataframe so we add it and make the count as 1.
The final output looks like this-
abc bec def rtr rwr xx cef
0 0 1 8 5 5 4 0
1 2 6 7 6 6 5 1
2 4 7 1 7 7 6 0
3 5 5 2 2 1 7 1
4 2 1 1 0 0 8 0
5 2 1 1 3 5 7 1
We can use ne + sum along axis=1 to count the non zero values in each row, followed by floordiv with 2 to consider only 50% of these counts, next create a list of record with the help of dict.fromkeys method inside a list comprehension, now create a dataframe lets say y from these records and add it with X to get the desired result
y = pd.DataFrame(dict.fromkeys(l[:i], 1)
for i in X.ne(0).sum(1).floordiv(2).astype(int))
X.add(y.fillna(0), fill_value=0).astype(int)
abc bec cef def rtr rwr xx
0 0 1 0 8 5 5 4
1 2 6 1 7 6 6 5
2 4 7 0 1 7 7 6
3 5 5 1 2 2 1 7
4 2 1 0 1 0 0 8
5 2 3 1 3 3 5 7

Shuffling one Column of a DataFrame By Group Efficiently

I am trying to implement a permutation test on a large Pandas dataframe. The dataframe looks like the following:
group some_value label
0 1 8 1
1 1 7 0
2 1 6 2
3 1 5 2
4 2 1 0
5 2 2 0
6 2 3 1
7 2 4 2
8 3 2 1
9 3 4 1
10 3 2 1
11 3 4 2
I want to group by column group, and shuffle the label column and write back to the data frame, preferably in place. The some_value column should remain intact. The result should look something like the following:
group some_value label
0 1 8 1
1 1 7 2
2 1 6 2
3 1 5 0
4 2 1 1
5 2 2 0
6 2 3 0
7 2 4 2
8 3 2 1
9 3 4 2
10 3 2 1
11 3 4 1
I used np.random.permutation but found it was very slow.
df["label"] = df.groupby("group")["label"].transform(np.random.permutation
It seems that df.sample is much faster. How can I solve this problem using df.sample() instead of np.random.permutation, and inplace?
We can using sample Notice this is assuming df=df.sort_values('group')
df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values
Or we can do it by
df['New']=df.sample(len(df)).sort_values('group').New.values
What about providing a custom transform function?
def sample(x):
return x.sample(n=x.shape[0])
df.groupby("group")["label"].transform(sample)
This SO explanation of printing out what is passed into the custom function via the transform function is helpful.

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby:
df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?
You can use .transform to get non-collapsing behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note, i'll have to choose an arbitrary column from the .transform result, but then just do:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3

Categories

Resources