Find rows whose values are less/greater than rows of another dataFrame - python

I have 2 dataframes:
df = pd.DataFrame({'begin': [10, 20, 30, 40, 50],
'end': [15, 23, 36, 48, 56]})
begin end
0 10 15
1 20 23
2 30 36
3 40 48
4 50 56
df2 = pd.DataFrame({'begin2': [12, 13, 22, 40],
'end2': [14, 13, 26, 48]})
begin2 end2
0 12 14
1 13 13
2 22 26
3 40 48
How can i get the rows of df2 which are within the rows of df1? I want each row of the df2 to be compared to all rows of df1.
That is, i want a df3 like:
begin2 end2
0 12 14
1 13 13
3 40 48
I tried:
df3 = df2.loc[ (df['begin'] <= df2['begin2']) & (df2['end2'] <= df['end'] )]
But it only compares row for row and requeres same sizes of the dataframes.

You need apply with boolean indexing:
df = df2[df2.apply(lambda x: any((df['begin'] <= x['begin2']) &
(x['end2'] <= df['end'])), axis=1)]
print (df)
begin2 end2
0 12 14
1 13 13
3 40 48
Detail:
print (df2.apply(lambda x: any((df['begin'] <= x['begin2']) &
(x['end2'] <= df['end'])), axis=1))
0 True
1 True
2 False
3 True
dtype: bool

Related

Alternatives to multiple nested if elif statements

I have a data frame with four columns that have values between 0-100.
In a new column I want to assign a value dependant on the values within the first four columns.
The values from the first four columns will be assigned a number 0, 1 or 2 and then summed together as follows:
0 - 30 = 0
31 -70 = 1
71 - 100 = 2
So the maximum number in the fifth column will be 8 and the minimum 0.
In the example data frame below the fifth column should result in 2, 3. (Just in case I haven't described this clearly.)
I'm still very new with python and at this stage the only string that I have in my bow is a very long and cumbersome multiple nested if statement, followed with df['E'] = df.apply().
My question is what is the best and most efficient function/method for populating the fifth column.
data = {
'A': [50, 90],
'B': [2, 4],
'C': [20, 80],
'D': [75, 72],
}
df = pd.DataFrame(data)
Edit
A more comprehensive method with np.select:
condlist = [(0 <= df) & (df <= 30),
(31 <= df) & (df <= 70),
(71 <= df) & (df <= 100)]
choicelist = [0, 1, 2]
df['E'] = np.select(condlist, choicelist).sum(axis=1)
print(df)
# Output
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Use pd.cut after flatten your dataframe into one column with melt:
df['E'] = pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2]) \
.cat.codes.groupby(level=0).sum()
print(df)
# Output:
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Details:
>>> pd.melt(df, ignore_index=False)
variable value
0 A 50
1 A 90
0 B 2
1 B 4
0 C 20
1 C 80
0 D 75
1 D 72
>>> pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2])
0 1
1 2
0 0
1 0
0 0
1 2
0 2
1 2
Name: value, dtype: category
Categories (3, int64): [0 < 1 < 2]

How to change only the maximum value of a group in pandas dataframe

I have following dataset
Item Count
A 60
A 20
A 21
B 33
B 33
B 32
Code to reproduce:
import pandas as pd
df = pd.DataFrame([
['A', 60],
['A', 20],
['A', 21],
['B', 33],
['B', 33],
['B', 32],
],
columns=['Item', 'Count'])
Suppose I have to Change only the maximum value of each group of "Item" column by adding 1.
the output should be like this:
Item Count New_Count
A 60 61
A 20 20
A 21 21
B 33 34
B 33 34
B 32 32
I tried df['New_Count']=df.groupby(['Item'])['Count'].transform(lambda x: max(x)+1) but all the values in "Count" was replaced by max value of each group +1.
Item Count New_Count
A 60 61
A 20 61
A 21 61
B 33 34
B 33 34
B 32 34
Use idxmax:
idx = df.groupby("Item")["Count"].idxmax()
df["New_Count"] = df["Count"]
df.loc[idx, "New_Count"] += 1
This will only increment the first occurrence of th maximum in each group.
If you want to increment all the maximum values in the case of a tie, you can use transform instead. Just replace the first line above with:
idx = df.groupby("Item")["Count"].transform(max) == df["Count"]
You can use idxmax() to get the idx of the maximum for each group, and increment only these items, like this:
max_idxs = df.groupby(['Item'])['Count'].idxmax()
df['New_Count']=df['Count'] # copy entire column
df['New_Count'][max_idxs]+=1 # increment only the maximum item for each group by 1
Here's another way not using groupby but using duplicated
df.loc[~df.sort_values('Count', ascending=False).duplicated('Item'), 'Count'] += 1
Output:
Item Count
0 A 61
1 A 20
2 A 21
3 B 34
4 B 33
5 B 32
to change the value in all the maximum values ​​that are repeated you will need .groupby(), .join() and np.where()
df = pd.DataFrame([
['A', 60],
['A', 60],
['A', 20],
['A', 21],
['B', 21],
['B', 33],
['B', 34],
], columns=['Item', 'Count'])
s = df.groupby('Item')['Count'].max().rename('newCount')
df = df.set_index('Item').join(s).reset_index()
df['newCount'] = np.where(df['Count'] != df['newCount'], df['Count'], (df['newCount'] + 1))
df.head(10)
#output
Item Count newCount
0 A 60 61
1 A 60 61
2 A 20 20
3 A 21 21
4 B 21 21
5 B 33 33
6 B 34 35
Edit
We can replace the .join() with a .transform() as suggested by #Dan
df['newCount'] = df.groupby('Item')['Count'].transform('max')
df['newCount'] = np.where(df['Count'] != df['newCount'], df['Count'], (df['newCount'] + 1))
#output
Item Count newCount
0 A 60 61
1 A 60 61
2 A 20 20
3 A 21 21
4 B 21 21
5 B 33 33
6 B 34 35

Convert Nx1 pandas dataframe with single 1xM array-containing column to M columns in Pandas dataframe

This is the current dataframe I have: It is Nx1 with each cell containing a numpy array.
print (df)
age
0 [35, 34, 55, 56]
1 [25, 34, 35, 66]
2 [45, 35, 53, 16]
.
.
.
N [45, 35, 53, 16]
I would like somehow to ravel each value of each cell to a new column.
# do conversion
print (df)
age1 age2 age3 age4
0 35 34 55 56
1 25 34 35 66
2 45 35 53 16
.
.
.
N 45 35 53 16
You can reconstruct the dataframe from the lists, and customize the column names with:
df = pd.DataFrame(df.age.values.tolist())
df.columns += 1
df = df.add_prefix('age')
print(df)
age1 age2 age3 age4
0 35 34 55 56
1 25 34 35 66
...
Here is another alternative:
import pandas as pd
df = pd.DataFrame({'age':[[35,34,55,54],[1,2,3,4],[5,6,7,8],[9,10,11,12]]})
df['age_aux'] = df['age'].astype(str).str.split(',')
for i in range(4):
df['age_'+str(i)] = df['age_aux'].str.get(i).map(lambda x: x.lstrip('[').rstrip(']'))
df = df.drop(columns=['age','age_aux'])
print(df)
Output:
age_0 age_1 age_2 age_3
0 35 34 55 54
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
You can create DataFrame by constructor for improve performance and change columns names by rename with f-strings:
df1 = (pd.DataFrame(df.age.values.tolist(), index=df.index)
.rename(columns = lambda x: f'age{x+1}'))
Another variation is to apply pd.Series to the column and massage the column names:
df= pd.DataFrame( { "age": [[1,2,3,4],[2,3,4,5]] })
df = df["age"].apply(pd.Series)
df.columns = ["age1","age2","age3","age4"]

Pandas: Drop and count consecutive duplicates with condition

I want to drop and count duplicates in column val when val equal to 1.
Then set start to be the first row and end to be the last row of consecutive duplicates.
df = pd.DataFrame()
df['start'] = [1, 2, 3, 4, 5, 6, 18, 30, 31]
df['end'] = [2, 3, 4, 5, 6, 18, 30, 31, 32]
df['val'] = [1 , 1, 1, 1, 1, 12, 12, 1, 1]
df
start end val
0 1 2 1
1 2 3 1
2 3 4 1
3 4 5 1
4 5 6 1
5 6 18 12
6 18 30 12
7 30 31 1
8 31 32 1
Expected Result
start end val
0 1 6 5
1 6 18 12
2 18 30 12
3 30 32 2
I tried. df[~((df.val==1) & (df.val == df.val.shift(1)) & (df.val == df.val.shift(-1)))]
start end val
0 1 2 1
4 5 6 1
5 6 18 12
6 18 30 12
7 30 31 1
8 31 32 1
but I can't figure out how to complete my expected result, any suggestion?
Use:
#mask by condition
m = df.val==1
#consecutive groups
g = m.ne(m.shift()).cumsum()
#filter by condition and aggregate per groups
df1 = df.groupby(g[m]).agg({'start':'first', 'end':'last', 'val':'sum'})
#concat together, for correct order create index by g
df = pd.concat([df1, df.set_index(g)[~m.values]]).sort_index().reset_index(drop=True)
print (df)
start end val
0 1 6 5
1 6 18 12
2 18 30 12
3 30 32 2
You could also do a two-liner with a mask to groupby:
m = (df.val.ne(1) | df.val.ne(df.val.shift())).cumsum()
df = df.groupby(m).agg({'start': 'first', 'end': 'last', 'val': 'last'})
Solution by #jezrael is perfect, but here is slightly different approach:
df['aux'] = (df['val'] != df['val'].shift()).cumsum()
df.loc[df['val'] == 1, 'end'] = df[df['val'] == 1].groupby('aux')['end'].transform('last')
df.loc[df['val'] == 1, 'val'] = df.groupby('aux')['val'].transform('sum')
df = df.drop_duplicates(subset=df.columns.difference(['start']), keep='first')
df = df.drop(columns=['aux'])

Count appearances of a value until it changes to another value

I have the following DataFrame:
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
I want to calculate the frequency of each value, but not an overall count - the count of each value until it changes to another value.
I tried:
df['values'].value_counts()
but it gives me
10 6
9 3
23 2
12 1
The desired output is
10:2
23:2
9:3
10:4
12:1
How can I do this?
Use:
df = df.groupby(df['values'].ne(df['values'].shift()).cumsum())['values'].value_counts()
Or:
df = df.groupby([df['values'].ne(df['values'].shift()).cumsum(), 'values']).size()
print (df)
values values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
Name: values, dtype: int64
Last for remove first level:
df = df.reset_index(level=0, drop=True)
print (df)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
Explanation:
Compare original column by shifted with not equal ne and then add cumsum for helper Series:
print (pd.concat([df['values'], a, b, c],
keys=('orig','shifted', 'not_equal', 'cumsum'), axis=1))
orig shifted not_equal cumsum
0 10 NaN True 1
1 10 10.0 False 1
2 23 10.0 True 2
3 23 23.0 False 2
4 9 23.0 True 3
5 9 9.0 False 3
6 9 9.0 False 3
7 10 9.0 True 4
8 10 10.0 False 4
9 10 10.0 False 4
10 10 10.0 False 4
11 12 10.0 True 5
You can keep track of where the changes in df['values'] occur, and groupby the changes and also df['values'] (to keep them as index) computing the size of each group
changes = df['values'].diff().ne(0).cumsum()
df.groupby([changes,'values']).size().reset_index(level=0, drop=True)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
itertools.groupby
from itertools import groupby
pd.Series(*zip(*[[len([*v]), k] for k, v in groupby(df['values'])]))
10 2
23 2
9 3
10 4
12 1
dtype: int64
It's a generator
def f(x):
count = 1
for this, that in zip(x, x[1:]):
if this == that:
count += 1
else:
yield count, this
count = 1
yield count, [*x][-1]
pd.Series(*zip(*f(df['values'])))
10 2
23 2
9 3
10 4
12 1
dtype: int64
Using crosstab
df['key']=df['values'].diff().ne(0).cumsum()
pd.crosstab(df['key'],df['values'])
Out[353]:
values 9 10 12 23
key
1 0 2 0 0
2 0 0 0 2
3 3 0 0 0
4 0 4 0 0
5 0 0 1 0
Slightly modify the result above
pd.crosstab(df['key'],df['values']).stack().loc[lambda x:x.ne(0)]
Out[355]:
key values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
dtype: int64
Base on python groupby
from itertools import groupby
[ (k,len(list(g))) for k,g in groupby(df['values'].tolist())]
Out[366]: [(10, 2), (23, 2), (9, 3), (10, 4), (12, 1)]
This is far from the most time/memory efficient method that in this thread but here's an iterative approach that is pretty straightforward. Please feel encouraged to suggest improvements on this method.
import pandas as pd
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
dict_count = {}
for v in df['values'].unique():
dict_count[v] = 0
curr_val = df.iloc[0]['values']
count = 1
for i in range(1, len(df)):
if df.iloc[i]['values'] == curr_val:
count += 1
else:
if count > dict_count[curr_val]:
dict_count[curr_val] = count
curr_val = df.iloc[i]['values']
count = 1
if count > dict_count[curr_val]:
dict_count[curr_val] = count
df_count = pd.DataFrame(dict_count, index=[0])
print(df_count)
The function groupby in itertools can help you, for str:
>>> string = 'aabbaacc'
>>> for char, freq in groupby('aabbaacc'):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
a:2
b:2
a:2
c:2
This function also works for list:
>>> df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
>>> for char, freq in groupby(df['values'].tolist()):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
10:2
23:2
9:3
10:4
12:1
Note: for df, you always use this way like df['values'] to take 'values' column, because DataFrame have a attribute values

Categories

Resources