I want to filter a dataframe based on two conditions on two different columns. In the example below, I want to filter the dataframe df to contain rows such that it contains uids with value counts for the val column greater than 4 is more than 2.
df = pd.DataFrame({'uid':[1,1,1,2,2,3,3,4,4,4],'iid':[11,12,13,12,13,13,14,14,11,12], 'val':[3,4,5,3,5,4,5,4,3,4]})
For this dataframe, my output should be
df
uid iid val
0 1 11 3
1 1 12 4
2 1 13 5
5 3 13 4
6 3 14 5
7 4 14 4
8 4 11 3
9 4 12 4
Here, I filtered out the uid 2 becuase number of rows with uid == 2 and val >= 4 is less than 2. I want to keep only uid rows for which number of val with values greater than 4 is greater than or equal to 2.
you need groupby.transform with sum once check where val is greater or equal ge than 4. and check that the result is ge to use it as a boolean filter on df.
print (df[df['val'].ge(4).groupby(df['uid']).transform(sum).ge(2)])
uid iid val
0 1 11 3
1 1 12 4
2 1 13 5
5 3 13 4
6 3 14 5
7 4 14 4
8 4 11 3
9 4 12 4
EDIT: another way to avoid groupby.transform is to loc the rows where val is ge than 4 and the column uid, use value_counts on it and get True where ge 2. then map back to the uid column to create the boolean filter on df. same result and potentially faster.
df[df['uid'].map(df.loc[df['val'].ge(4), 'uid'].value_counts().ge(2))]
Related
I have a data frame like that :
Index
Time
Id
0
10:10:00
11
1
10:10:01
12
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
8
10:10:12
11
9
10:10:14
13
I want to compare id column for each pairs. So between the row 0 and 1, between the row 2 and 3 etc.
In others words I want to compare even rows with odd rows and keep same id pairs rows.
My ideal output would be :
Index
Time
Id
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
I tried that but it did not work :
df = df[
df[::2]["id"] ==df[1::2]["id"]
]
You can use a GroupBy.transform approach:
# for each pair, is there only one kind of Id?
out = df[df.groupby(np.arange(len(df))//2)['Id'].transform('nunique').eq(1)]
Or, more efficient, using the underlying numpy array:
# convert to numpy
a = df['Id'].to_numpy()
# are the odds equal to evens?
out = df[np.repeat((a[::2]==a[1::2]), 2)]
output:
Index Time Id
2 2 10:10:02 12
3 3 10:10:04 12
4 4 10:10:06 13
5 5 10:10:07 13
6 6 10:10:08 11
7 7 10:10:10 11
My first question here, I hope this is understandable.
I have a Panda DataFrame:
order_numbers
x_closest_autobahn
0
34
3
1
11
3
2
5
3
3
8
12
4
2
12
I would like to get a new column with the order_number per closest_autobahn in ascending order:
order_numbers
x_closest_autobahn
order_number_autobahn_x
2
5
3
1
1
11
3
2
0
34
3
3
4
2
12
1
3
8
12
2
I have tried:
df['order_number_autobahn_x'] = ([df.loc[(df['x_closest_autobahn'] == 3)]].sort_values(by=['order_numbers'], ascending=True, inplace=True))
I have looked at slicing, sort_values and reset_index
df.sort_values(by=['order_numbers'], ascending=True, inplace=True)
df = df.reset_index() # reset index to the order after sort
df['order_numbers_index'] = df.index
but I can't seem to get the DataFrame I am looking for.
Use DataFrame.sort_values by both columns and for counter use GroupBy.cumcount:
df = df.sort_values(['x_closest_autobahn','order_numbers'])
df['order_number_autobahn_x'] = df.groupby('x_closest_autobahn').cumcount().add(1)
print (df)
order_numbers x_closest_autobahn order_number_autobahn_x
2 5 3 1
1 11 3 2
0 34 3 3
4 2 12 1
3 8 12 2
I have some dataframe like the one shown above. The goal of this program is to replace some specific value by the previous one.
import pandas as pd
test = pd.DataFrame([2,2,3,1,1,2,4,6,43,23,4,1,3,3,1,1,1,4,5], columns = ['A'])
obtaining:
If one want to replace all 1 by the previous values, a possible solution is:
for li in test[test['A'] == 1].index:
test['A'].iloc[li] = test['A'].iloc[li-1]
However, it is very inefficient. Can you suggest a more efficient solution?
IIUC, replace to np.nan then ffill
test.replace(1,np.nan).ffill().astype(int)
Out[881]:
A
0 2
1 2
2 3
3 3
4 3
5 2
6 4
7 6
8 43
9 23
10 4
11 4
12 3
13 3
14 3
15 3
16 3
17 4
18 5
I have two pandas data frames, I want to get the sum of items_bought for each ID in DF1. Then add a column to DF2 containing the sum of items_bought calculated from DF1 with matching ID else fill it with 0. How can I do this in an elegant and efficient manner?
DF1
ID | items_bought
1 5
3 8
2 2
3 5
4 6
2 2
DF2
ID
1
2
8
3
2
Desired Result: DF2 Becomes
ID | items_bought
1 5
2 4
8 0
3 13
2 4
df1.groupby('ID').sum().loc[df2.ID].fillna(0).astype(int)
Out[104]:
items_bought
ID
1 5
2 4
8 0
3 13
2 4
Work on df1 to calculate the sum for each ID.
The resulting dataframe is now indexed by ID, so you can select with df2 IDs by calling loc.
Fill the gaps with fillna.
NA are handled by float type. Now that they are removed, convert the column back to integer.
Solution with groupby and sum, then reindex with fill_value=0 and last reset_index:
df2 = df1.groupby('ID').items_bought.sum().reindex(df2.ID, fill_value=0).reset_index()
print (df2)
ID items_bought
0 1 5
1 2 4
2 8 0
3 3 13
4 2 4
I have a dataset with a number of values like below.
>>> a.head()
value freq
3 9 1
2 11 1
0 12 4
1 15 2
I need to fill in the values between the integers in the value column. For example, I need to insert one new row between 9 & 11 filled with zeroes, then another two between 12-15. The end result should be the dataset with 9-15 with 'missing' rows as zeroes across the board.
Is there anyway to insert a new row at an specific location without replacing data? The only methods I've found involve slicing the dataframe at a location then appending a new row and concatenating the remainder.
UPDATE: The index is completely irrelevant so don't worry about that.
You didn't say what should happen to your Index, so I'm assuming it's unimportant.
In [12]: df.index = df['value']
In [15]: df.reindex(np.arange(df.value.min(), df.value.max() + 1)).fillna(0)
Out[15]:
value freq
value
9 9 1
10 0 0
11 11 1
12 12 4
13 0 0
14 0 0
15 15 2
Another option is to create a second dataframe with values from min to max, and outer join this to your dataframe:
import pandas as pd
a = pd.DataFrame({'value':[9,11,12,15], 'freq':[1,1,4,2]})
# value freq
#0 9 1
#1 11 1
#2 12 4
#3 15 2
b = pd.DataFrame({'value':[x for x in range(a.value.min(), a.value.max()+1)]})
value
0 9
1 10
2 11
3 12
4 13
5 14
6 15
a = pd.merge(left=a, right=b, on='value', how='outer').fillna(0).sort_values(by='value')
# value freq
#0 9 1.0
#4 10 0.0
#1 11 1.0
#2 12 4.0
#5 13 0.0
#6 14 0.0
#3 15 2.0