I have a pandas data frame where values should be greater or equal to preceding values. In cases where the current value is lower than the preceding values, the preceding values must be set equal to the current value. This is best explained by example below:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[0, 1, 2, 3, 2, 0, 1, 2, 3, 1, 5, 0, 1, 0, 3, 2]}
df = pd.DataFrame(data)
df
group value
0 A 0
1 A 1
2 A 2
3 A 3
4 A 2
5 B 0
6 B 1
7 B 2
8 B 3
9 B 1
10 B 5
11 C 0
12 C 1
13 C 0
14 C 3
15 C 2
and the result I am looking for is:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
So here's my go!
(Special thanks to #jezrael for helping me simplify it considerably!)
I'm basing this on Expanding Windows, in reverse, to always get a suffix of the elements in each group (from the last element, expanding towards first).
this expanding window has the following logic:
For element in index i, you get a Series containing all elements in group with indices >=i, and I need to return a new single value for i in the result.
What is the value corresponding to this suffix? its minimum! because if the later elements are smaller, we need to take the smallest among them.
then we can assign the result of this operation to df['value'].
try this:
df['value'] = (df.iloc[::-1]
.groupby('group')['value']
.expanding()
.min()
.reset_index(level=0, drop=True)
.astype(int))
print (df)
Output:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
I didnt get your output but I believe you are looking for something like
df['fwd'] = df.value.shift(-1)
df['new'] = np.where(df['value'] > df['fwd'], df['fwd'], df['value'])
Related
Say I have a sample dataframe like this, with val being a binary value (between 1 and 2 in this instance). I would like to eliminate outliers in val, changing them to be the same as the majority value.
df = pandas.DataFrame({'name':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'], 'val':[1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2]})
name val
0 A 1
1 A 2
2 A 2
3 A 2
4 B 2
5 B 1
6 B 1
7 B 1
8 C 1
9 C 1
10 C 2
11 C 2
I would like the values at indexes 0 and 4 to be corrected (to 2 and 1 respectively, here), as there is only one occurrence in each group, but C to be unaltered.
I think I could write a transform statement, but not sure how to go about it.
As you wrote you only have two possible values, you can compare the count of each value:
def fix_outliers(sr):
cnt = sr.value_counts()
return sr if cnt.iloc[0] == cnt.iloc[1] else [cnt.index[0]]*len(sr)
out = df.groupby('name')['val'].transform(fix_outliers)
Output:
>>> out
0 2
1 2
2 2
3 2
4 1
5 1
6 1
7 1
8 1
9 1
10 2
11 2
Name: val, dtype: int64
If you want to keep the value that occurs most times you can use mode to find this values, than you can check if the count of mode is equal to 1. In case it is not equal to 1 that means that has two or more values happen in the same frequency.
for name in df["name"].unique(): #find distinct names in df
if(df[(df["name"] == name)].mode()["val"].count() == 1): #check if mode is sized 1
most_common_value = df[(df["name"] == name)].mode()["val"][0] # find the mode
df.loc[df["name"] == name , "val"] = most_common_value # modify df to val be the mode
Output:
name val
0 A 2
1 A 2
2 A 2
3 A 2
4 B 1
5 B 1
6 B 1
7 B 1
8 C 1
9 C 1
10 C 2
11 C 2
Let's assume I have a df that looks like this:
import pandas as pd
d = {'group': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'number': [0, 3, 2, 1, 2, 1, -2, 1, 2, 3, 4, 2, 1, -1, 0]}
df = pd.DataFrame(data=d)
df
group number
0 A 0
1 A 3
2 A 2
3 A 1
4 A 2
5 B 1
6 B -2
7 B 1
8 B 2
9 B 3
10 C 4
11 C 2
12 C 1
13 C -1
14 C 0
And I would like to delete a whole group if one of its values in the number column is negative. I can do:
df.groupby('group').filter(lambda g: (g.number < 0).any())
However this gives me the wrong output since this returns all groups with any rows that have a negative number in the number column. See below:
group number
5 B 1
6 B -2
7 B 1
8 B 2
9 B 3
10 C 4
11 C 2
12 C 1
13 C -1
14 C 0
How do I change this function to make it return all groups without any negative numbers in the number column. The output should be group A with its values.
Use the boolean NOT operator ~:
df.groupby('group').filter(lambda g: ~(g.number < 0).any())
Or check if all values don't match using De Morgan's Law:
df.groupby('group').filter(lambda g: (g.number >= 0).all())
You can use the all function which returns the opposite result you expect. i.e. It will do the opposite. It will return TRUE only if all are true else it will return FALSE.
Just Try :
not all(list)
I have something like that:
>>> x = {'id': [1,1,2,2,2,3,4,5,5], 'value': ['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']}
>>> df = pd.DataFrame(x)
>>> df
id value
0 1 a
1 1 a
2 2 b
3 2 b
4 2 c
5 3 d
6 4 e
7 5 f
8 5 g
I want to filter inconsistent values in this table. For example, columns with id=2 or id=5 are inconsistent, because the same id is associated with different values. I have read solutions about where or any, but they are not something like "comparing if columns with this id always have the same value.
How can I solve this problem?
You can use groupby and filter. This should give you the ids with inconsistent values.
df.groupby('id').filter(lambda x: x.value.nunique()>1)
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
In your case we do groupby + transform with nunique
unc_df=df[df.groupby('id').value.transform('nunique').ne(1)]
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
I guess, you can use drop_duplicates to drop repetitive rows based on id column:
In [599]: df.drop_duplicates('id', keep='first')
Out[599]:
id value
0 1 a
2 2 b
5 3 d
6 4 e
7 5 f
So the above will pick the first value for duplicated id column. And you will have 1 row per id in your resultant dataframe.
Say I have two different columns within a large transportation dataset, one with a trip id and another with a user id. How can I count the amount of times two people have ridden on the same trip together, i.e. different user id but same trip id?
df = pd.DataFrame([[1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5], ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'B', 'C', 'D', 'D','A']]).T
df.columns = ['trip_id', 'user_id']
print(df)
trip_id user_id
0 1 A
1 1 B
2 1 C
3 2 A
4 2 B
5 3 A
6 3 B
7 4 B
8 4 C
9 4 D
10 5 D
11 5 A
The ideal output would be a sort of aggregated pivot table or crosstab that displays each user_id and their count of trips with other user_id's, so as to see who has the highest counts of trips together.
I tried something like this:
df5 = pd.crosstab(index=df4['trip_id'], columns=df4['user_id'])
df5['sum'] = df5[df5.columns].sum(axis=1)
df5
user_id A B C D sum
trip_id
1 1 1 1 0 3
2 1 1 0 0 2
3 1 1 0 0 2
4 0 1 1 1 3
5 1 0 0 1 2
which I can use to get the average users per trip, but not the frequency of unique user_ids riding together on a trip.
I also tried some variations with this:
df.trip_id = df.trip_id+'_'+df.groupby(['user_id','trip_id']).cumcount().add(1).astype(str)
df.pivot('trip_id','user_id')
but I'm not getting what I want. I'm not sure if I need to approach this by iterating with a for loop or if I'll need to stack the dataframe from a crosstab to get those aggregate values. Also, I'm trying to avoid having the trip_id and user_id in the original data be aggregated as numerical datatypes since they should not be treated as ints but strings.
Thank you for any insight you may be able to provide!
Here is an example dataset
import pandas as pd
df = pd.DataFrame([[1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3], ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B']]).T
df.columns = ['trip_id', 'user_id']
print(df)
Gives:
trip_id user_id
0 1 A
1 1 B
2 1 C
3 2 A
4 2 B
5 2 C
6 3 A
7 3 B
8 3 C
9 3 A
10 3 B
I think what you're asking for is:
df.groupby(['trip_id', 'user_id']).size()
trip_id user_id
1 A 1
B 1
C 1
2 A 1
B 1
C 1
3 A 2
B 2
C 1
dtype: int64
Am I correct?
I have the following data frame and need to repeat the values for a set of values. That is, given
test3 = pd.DataFrame(data={'x':[1, 2, 3, 4, pd.np.nan], 'y':['a', 'a', 'a', 'b', 'b']})
test3
x y
0 1 a
1 2 a
2 3 a
3 4 b
4 NaN b
I need to do something like this, but more performant:
test3['group'] = np.NaN
groups = ['a', 'b']
dfs = []
for group in groups:
temp = test3.copy()
temp['group'] = group
dfs.append(temp)
pd.concat(dfs)
That is, the expected output is:
x y group
0 1 a a
1 2 a a
2 3 a a
3 4 b a
4 NaN b a
0 1 a b
1 2 a b
2 3 a b
3 4 b b
4 NaN b b