Say I have a sample dataframe like this, with val being a binary value (between 1 and 2 in this instance). I would like to eliminate outliers in val, changing them to be the same as the majority value.
df = pandas.DataFrame({'name':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'], 'val':[1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2]})
name val
0 A 1
1 A 2
2 A 2
3 A 2
4 B 2
5 B 1
6 B 1
7 B 1
8 C 1
9 C 1
10 C 2
11 C 2
I would like the values at indexes 0 and 4 to be corrected (to 2 and 1 respectively, here), as there is only one occurrence in each group, but C to be unaltered.
I think I could write a transform statement, but not sure how to go about it.
As you wrote you only have two possible values, you can compare the count of each value:
def fix_outliers(sr):
cnt = sr.value_counts()
return sr if cnt.iloc[0] == cnt.iloc[1] else [cnt.index[0]]*len(sr)
out = df.groupby('name')['val'].transform(fix_outliers)
Output:
>>> out
0 2
1 2
2 2
3 2
4 1
5 1
6 1
7 1
8 1
9 1
10 2
11 2
Name: val, dtype: int64
If you want to keep the value that occurs most times you can use mode to find this values, than you can check if the count of mode is equal to 1. In case it is not equal to 1 that means that has two or more values happen in the same frequency.
for name in df["name"].unique(): #find distinct names in df
if(df[(df["name"] == name)].mode()["val"].count() == 1): #check if mode is sized 1
most_common_value = df[(df["name"] == name)].mode()["val"][0] # find the mode
df.loc[df["name"] == name , "val"] = most_common_value # modify df to val be the mode
Output:
name val
0 A 2
1 A 2
2 A 2
3 A 2
4 B 1
5 B 1
6 B 1
7 B 1
8 C 1
9 C 1
10 C 2
11 C 2
Related
I have a data frame as:
a b c d......
1 1
3 3 3 5
4 1 1 4 6
1 0
I want to select number of columns based on value given in column "a". In this case for first row it would only select column b.
How can I achieve something like:
df.iloc[:,column b:number of columns corresponding to value in column a]
My expected output would be:
a b c d e
1 1 0 0 1 # 'e' contains value in column b because colmn a = 1
3 3 3 5 335 # 'e' contains values of column b,c,d because colm a
4 1 1 4 1 # = 3
1 0 NAN
Define a little function for this:
def select(df, r):
return df.iloc[r, 1:1 + df.iat[r, 0]]
The function uses iat to query the a column for that row, and iloc to select columns from the same row.
Call it as such:
select(df, 0)
b 1.0
Name: 0, dtype: float64
And,
select(df, 1)
b 3.0
c 3.0
d 5.0
Name: 1, dtype: float64
Based on your edit, consider this -
df
a b c d e
0 1 1 0 0 0
1 3 3 3 5 0
2 4 1 1 4 6
3 1 0 0 0 0
Use where/mask (with numpy broadcasting) + agg here -
df['e'] = df.iloc[:, 1:]\
.astype(str)\
.where(np.arange(df.shape[1] - 1) < df.a[:, None], '')\
.agg(''.join, axis=1)
df
a b c d e
0 1 1 0 0 1
1 3 3 3 5 335
2 4 1 1 4 1146
3 1 0 0 0 0
If nothing matches, then those entries in e will have an empty string. Just use replace -
df['e'] = df['e'].replace('', np.nan)
A numpy slicing approach
a = v[:, 0]
b = v[:, 1:]
n, m = b.shape
b = b.ravel()
b = np.where(b == 0, '', b.astype(str))
r = np.arange(n) * m
f = lambda t: b[t[0]:t[1]]
df.assign(g=list(map(''.join, map(f, zip(r, r + a)))))
a b c d e g
0 1 1 0 0 0 1
1 3 3 3 5 0 335
2 4 1 1 4 6 1146
3 1 0 0 0 0
Edit: one line solution with slicing.
df["f"] = df.astype(str).apply(lambda r: "".join(r[1:int(r["a"])+1]), axis=1)
# df["f"] = df["f"].astype(int) if you need `f` to be integer
df
a b c d e f
0 1 1 X X X 1
1 3 3 3 5 X 335
2 4 1 1 4 6 1146
3 1 0 X X X 0
Dataset used:
df = pd.DataFrame({'a': {0: 1, 1: 3, 2: 4, 3: 1},
'b': {0: 1, 1: 3, 2: 1, 3: 0},
'c': {0: 'X', 1: '3', 2: '1', 3: 'X'},
'd': {0: 'X', 1: '5', 2: '4', 3: 'X'},
'e': {0: 'X', 1: 'X', 2: '6', 3: 'X'}})
Suggestion for improvement would be appreciated!
I have a pandas data frame where values should be greater or equal to preceding values. In cases where the current value is lower than the preceding values, the preceding values must be set equal to the current value. This is best explained by example below:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[0, 1, 2, 3, 2, 0, 1, 2, 3, 1, 5, 0, 1, 0, 3, 2]}
df = pd.DataFrame(data)
df
group value
0 A 0
1 A 1
2 A 2
3 A 3
4 A 2
5 B 0
6 B 1
7 B 2
8 B 3
9 B 1
10 B 5
11 C 0
12 C 1
13 C 0
14 C 3
15 C 2
and the result I am looking for is:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
So here's my go!
(Special thanks to #jezrael for helping me simplify it considerably!)
I'm basing this on Expanding Windows, in reverse, to always get a suffix of the elements in each group (from the last element, expanding towards first).
this expanding window has the following logic:
For element in index i, you get a Series containing all elements in group with indices >=i, and I need to return a new single value for i in the result.
What is the value corresponding to this suffix? its minimum! because if the later elements are smaller, we need to take the smallest among them.
then we can assign the result of this operation to df['value'].
try this:
df['value'] = (df.iloc[::-1]
.groupby('group')['value']
.expanding()
.min()
.reset_index(level=0, drop=True)
.astype(int))
print (df)
Output:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
I didnt get your output but I believe you are looking for something like
df['fwd'] = df.value.shift(-1)
df['new'] = np.where(df['value'] > df['fwd'], df['fwd'], df['value'])
Say I have two different columns within a large transportation dataset, one with a trip id and another with a user id. How can I count the amount of times two people have ridden on the same trip together, i.e. different user id but same trip id?
df = pd.DataFrame([[1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5], ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'B', 'C', 'D', 'D','A']]).T
df.columns = ['trip_id', 'user_id']
print(df)
trip_id user_id
0 1 A
1 1 B
2 1 C
3 2 A
4 2 B
5 3 A
6 3 B
7 4 B
8 4 C
9 4 D
10 5 D
11 5 A
The ideal output would be a sort of aggregated pivot table or crosstab that displays each user_id and their count of trips with other user_id's, so as to see who has the highest counts of trips together.
I tried something like this:
df5 = pd.crosstab(index=df4['trip_id'], columns=df4['user_id'])
df5['sum'] = df5[df5.columns].sum(axis=1)
df5
user_id A B C D sum
trip_id
1 1 1 1 0 3
2 1 1 0 0 2
3 1 1 0 0 2
4 0 1 1 1 3
5 1 0 0 1 2
which I can use to get the average users per trip, but not the frequency of unique user_ids riding together on a trip.
I also tried some variations with this:
df.trip_id = df.trip_id+'_'+df.groupby(['user_id','trip_id']).cumcount().add(1).astype(str)
df.pivot('trip_id','user_id')
but I'm not getting what I want. I'm not sure if I need to approach this by iterating with a for loop or if I'll need to stack the dataframe from a crosstab to get those aggregate values. Also, I'm trying to avoid having the trip_id and user_id in the original data be aggregated as numerical datatypes since they should not be treated as ints but strings.
Thank you for any insight you may be able to provide!
Here is an example dataset
import pandas as pd
df = pd.DataFrame([[1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3], ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B']]).T
df.columns = ['trip_id', 'user_id']
print(df)
Gives:
trip_id user_id
0 1 A
1 1 B
2 1 C
3 2 A
4 2 B
5 2 C
6 3 A
7 3 B
8 3 C
9 3 A
10 3 B
I think what you're asking for is:
df.groupby(['trip_id', 'user_id']).size()
trip_id user_id
1 A 1
B 1
C 1
2 A 1
B 1
C 1
3 A 2
B 2
C 1
dtype: int64
Am I correct?
I have dataframe with column names as A,B,C,D,e,f,g,h.These columns names are stored in a list as cols1=[A,B,C,D,e,f,g,h]
I have to groupby these columns as df.groupby(['A','B','C','D','e']) and store it in variable names as e
And again as df.groupby(['A','B','C','D','f']) and store it in variable names as f
And again as df.groupby(['A','B','C','D','g']) till the end of the list.
This should be done in a loop.And then store the groupby.sum() values of columns e,f,g etc in a new variable to compare the values of e,f,g,h.
Any way of doing it in pandas.Thanks in Advance
IIUC you need groupby with sum:
df = pd.DataFrame({'A':[1,8,8],
'B':[4,6,6],
'C':[7,2,2],
'D':[1,3,3],
'e':[2,3,6],
'f':[0,2,4],
'g':[7,4,1],
'h':[1,4,2]})
print (df)
A B C D e f g h
0 1 4 7 1 2 0 7 1
1 8 6 2 3 3 2 4 4
2 8 6 2 3 6 4 1 2
cols1=['A','B','C','D','e','f','g','h']
cols11 = cols1[:4]
print (cols11)
['A', 'B', 'C', 'D']
cols12 = cols1[4:]
print (cols12)
['e', 'f', 'g', 'h']
df = df.groupby(cols11)[cols12].sum()
print (df)
e f g h
A B C D
1 4 7 1 2 0 7 1
8 6 2 3 9 6 5 6
df = df.reset_index(drop=True)
print (df)
e f g h
0 2 0 7 1
1 9 6 5 6
I have the following data frame and need to repeat the values for a set of values. That is, given
test3 = pd.DataFrame(data={'x':[1, 2, 3, 4, pd.np.nan], 'y':['a', 'a', 'a', 'b', 'b']})
test3
x y
0 1 a
1 2 a
2 3 a
3 4 b
4 NaN b
I need to do something like this, but more performant:
test3['group'] = np.NaN
groups = ['a', 'b']
dfs = []
for group in groups:
temp = test3.copy()
temp['group'] = group
dfs.append(temp)
pd.concat(dfs)
That is, the expected output is:
x y group
0 1 a a
1 2 a a
2 3 a a
3 4 b a
4 NaN b a
0 1 a b
1 2 a b
2 3 a b
3 4 b b
4 NaN b b