How do I count the number of unique strings in a rolling window of a pandas dataframe?
a = pd.DataFrame(['a','b','a','a','b','c','d','e','e','e','e'])
a.rolling(3).apply(lambda x: len(np.unique(x)))
Output, same as original dataframe:
0
0 a
1 b
2 a
3 a
4 b
5 c
6 d
7 e
8 e
9 e
10 e
Expected:
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
I think you need first convert values to numeric - by factorize or by rank. Also min_periods parameter is necessary for avoid NaN in start of column:
a[0] = pd.factorize(a[0])[0]
print (a)
0
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 4
9 4
10 4
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
Or:
a[0] = a[0].rank(method='dense')
0
0 1.0
1 2.0
2 1.0
3 1.0
4 2.0
5 3.0
6 4.0
7 5.0
8 5.0
9 5.0
10 5.0
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
Related
I have the following dataframe
print(A)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
And I have the following Code (Pandas Dataframe count occurrences that only happen immediately), which counts the occurrences of values that happen immediately one after another.
ser = A["1or0"].ne(A["1or0"].shift().bfill()).cumsum()
B = (
A.groupby(ser, as_index=False)
.agg({"Index": ["first", "last", "count"],
"1or0": "unique"})
.set_axis(["StartNum", "EndNum", "Size", "Value"], axis=1)
.assign(Value= lambda d: d["Value"].astype(str).str.strip("[]"))
)
print(B)
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
The issues is, when NaN Values occur, the code doesn't put them together in one interval it count them always as one sized interval and not e.g. 3
print(A2)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
10 11 NaN
11 12 NaN
12 13 NaN
print(B2)
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 11 1 NaN
5 12 12 1 NaN
6 13 13 1 NaN
But I want B2 to be the following
print(B2Wanted)
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 13 3 NaN
What do I need to change so that it works also with NaN?
First fillna with a value this is not possible (here -1) before creating your grouper:
group = A['1or0'].fillna(-1).diff().ne(0).cumsum()
# or
# s = A['1or0'].fillna(-1)
# group = s.ne(s.shift()).cumsum()
B = (A.groupby(group, as_index=False)
.agg(**{'StartNum': ('Index', 'first'),
'EndNum': ('Index', 'last'),
'Size': ('1or0', 'size'),
'Value': ('1or0', 'first')
})
)
Output:
StartNum EndNum Size Value
0 1 3 3 0.0
1 4 7 4 1.0
2 8 8 1 0.0
3 9 10 2 1.0
4 11 13 3 NaN
Let's say I have a pandas DataFrame:
import pandas as pd
df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
a
0 1
1 2
2 2
3 2
4 2
5 1
6 1
7 1
8 2
9 2
I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let's say that n=3. Then, my target dataframe is
>> df
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 2
9 2
EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.
You can create unique value for each consecutive group, then use groupby and head:
group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)
# result:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 3
9 3
Use boolean indexing with groupby.cumcount:
N = 3
df[df.groupby('a').cumcount().lt(N)]
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
8 3
9 3
For the last N:
df[df.groupby('a').cumcount(ascending=False).lt(N)]
apply on consecutive repetitions
df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1 # this is #3 of the local group
8 3
9 3
advantages of boolean indexing
You can use it for many other operations, such as setting values or masking:
group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)
df.where(m)
a
0 1.0
1 2.0
2 2.0
3 2.0
4 NaN
5 1.0
6 1.0
7 1.0
8 3.0
9 3.0
df.loc[~m] = -1
a
0 1
1 2
2 2
3 2
4 -1
5 1
6 1
7 1
8 3
9 3
I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have a pandas data frame and I want to move the "F" column to after the "B" column. Is there a way to do that?
A B C D E F
0 7 1 8 1 6
1 8 2 5 8 5 8
2 9 3 6 8 5
3 1 8 1 3 4
4 6 8 2 5 0 9
5 2 N/A 1 3 8
df2
A B F C D E
0 7 1 6 8 1
1 8 2 8 5 8 5
2 9 3 5 6 8
3 1 4 8 1 3
4 6 8 9 2 5 0
5 2 8 N/A 1 3
So it should finally look like df2.
Thanks in advance.
You can try df.insert + df.pop after getting location of B by get_loc
df.insert(df.columns.get_loc("B")+1,"F",df.pop("F"))
print(df)
A B F C D E
0 7.0 1 6.0 NaN 8 1.0
1 8.0 2 8.0 5.0 8 5.0
2 9.0 3 5.0 6.0 8 NaN
3 1.0 8 NaN 1.0 3 4.0
4 6.0 8 9.0 2.0 5 0.0
5 NaN 2 8.0 NaN 1 3.0
Another minimalist, (and very specific!) approach:
df = df[list('ABFCDE')]
Here is a very simple answer to this(only one line). Giving littlebit more explanation to the answer from #warped
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
Assuming a df as follows:
Product Time
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
2 5
2 6
2 7
3 1
3 2
3 3
4 1
4 2
4 3
I would like to only keep those Products whose Time is greater than 3 and drop the others.
In the above example, after I do
df.groupby(['Product']).size()
I get the following output:
1 4
2 7
3 3
4 3
and based on this, from my main df, I would only like to retain Product 1 & 2
Expected output:
Product Time
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
2 5
2 6
2 7
Use GroupBy.transform for return Series with same size like original, so possible filtering by boolean indexing:
df = df[df.groupby(['Product'])['Product'].transform('size') > 3]
print (df)
Product Time
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 2 7
Details:
b = df.groupby(['Product'])['Product'].transform('size') > 3
a = df.groupby(['Product'])['Product'].transform('size')
print (df.assign(size=a, filter=b))
Product Time size filter
0 1 1 4 True
1 1 2 4 True
2 1 3 4 True
3 1 4 4 True
4 2 1 7 True
5 2 2 7 True
6 2 3 7 True
7 2 4 7 True
8 2 5 7 True
9 2 6 7 True
10 2 7 7 True
11 3 1 3 False
12 3 2 3 False
13 3 3 3 False
14 4 1 3 False
15 4 2 3 False
16 4 3 3 False
If DataFrame is not large, here is alternative with DataFrameGroupBy.filter:
df = df.groupby(['Product']).filter(lambda x: len(x) > 3)
Instead use transform.size after grouping, check which are greater than (gt) 3 and use the result to perform boolean indexing on your dataframe:
df[df.groupby('Product').Time.transform('size').gt(3)]
Product Time
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 2 7
You can do this if you don't plan to use assign operation and you like to use boolean indexing.
g = df.groupby('Product')
t = g.transform('count')
df['c']=t #new column holding the count
df2=df[df['c'] > 3]
print(df2)
Product Time
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 2 7
11 3 1
12 3 2
13 3 3
14 4 1
15 4 2
16 4 3
Product Time c
0 1 1 4
1 1 2 4
2 1 3 4
3 1 4 4
4 2 1 7
5 2 2 7
6 2 3 7
7 2 4 7
8 2 5 7
9 2 6 7
10 2 7 7