I has a huge pandas data frame that looks like this:
id type price min max
1 ch 10 10 100
1 fo 8 20 100
1 dr 7 10 90
1 ad 5 16 20
1 dr 6 10 90
1 fo 4 20 100
2 ch 5 40 50
2 fo 3 10 50
2 ch 3 40 50
... ... ... ... ...
I would like to add a new column 'match' to get something such this:
id type price min max match
1 ch 10 10 100 false
1 fo 8 20 100 false
1 dr 7 10 90 false
1 ad 5 16 20 false
1 dr 6 10 90 true
1 fo 4 20 100 true
2 ch 5 40 50 false
2 fo 3 10 50 false
2 ch 3 40 50 true
... ... ... ... ... ...
I tried using shift:
df['match']=np.where((df['id'] == df['id'].shift()) & (df['type'] == df['type'].shift()) & (df['min'] == df['min'].shift()) & (df['max'] == df['max'].shift()),true, false)
but that just compares the current row with the previous one.There is no specific pattern to determine the number of previous rows that match the condition. I would like to choose the id as a window to compare rows.Is there a way to do that?
Any suggestions are highly appreciated.
Thank you
You could use duplicated specifying the subset of columns to consider:
df.assign(match=df.duplicated(subset=['id', 'type', 'min', 'max']))
id type price min max match
0 1 ch 10 10 100 False
1 1 fo 8 20 100 False
2 1 dr 7 10 90 False
3 1 ad 5 16 20 False
4 1 dr 6 10 90 True
5 1 fo 4 20 100 True
6 2 ch 5 40 50 False
7 2 fo 3 10 50 False
8 2 ch 3 40 50 True
Related
I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row which has its value > 0 in column B
So this:
A B
1 20
1 10
1 -3
2 30
2 -9
2 40
3 10
Should turn into this:
A B
1 20
1 10
2 30
2 40
3 10
Any suggestions on how this can be achieved? I shall be grateful!
In sample data are not duplciates, so use only:
df = df[df['B'].gt(0)]
print (df)
A B
0 1 20
1 1 10
3 2 30
5 2 40
6 3 10
If there are duplicates:
print (df)
A B
0 1 20
1 1 10
2 1 10
3 1 10
4 1 -3
5 2 30
6 2 -9
7 2 40
8 3 10
df = df[df['B'].gt(0) & ~df.duplicated()]
print (df)
A B
0 1 20
1 1 10
5 2 30
7 2 40
8 3 10
I have two dataframes:
a = pd.DataFrame({'id': [10, 20, 30, 40, 50, 60, 70]})
b = pd.DataFrame({'id': [10, 30, 40, 70]})
print(a)
print(b)
# a
id
0 10
1 20
2 30
3 40
4 50
5 60
6 70
# b
id
0 10
1 30
2 40
3 70
I am trying to have an extra column in a if id is present on b like so:
# a
id present
0 10 True
1 20 False
2 30 True
3 40 True
4 50 False
5 60 False
6 70 True
What I've tried:
a.join(b,rsuffix='a')
# and then thought I'd replace nans with False and values with True
# but it does not return what I expect as it joined row by row
id ida
0 10 10.000
1 20 30.000
2 30 40.000
3 40 70.000
4 50 nan
5 60 nan
6 70 nan
Then I added:
a.join(b,rsuffix='a', on='id')
But did not get what I expected as well:
id ida
0 10 nan
1 20 nan
2 30 nan
3 40 nan
4 50 nan
5 60 nan
6 70 nan
I also tried a['present'] = b['id'].isin(a['id']) but that returned not what I expect:
id present
0 10 True
1 20 True
2 30 True
3 40 True
4 50 NaN
5 60 NaN
6 70 NaN
How can I have an extra column in a denoting if id is present in b with True / False statements?
You are close, need test a['id'] with b['id'] in Series.isin:
a['present'] = a['id'].isin(b['id'])
print (a)
id present
0 10 True
1 20 False
2 30 True
3 40 True
4 50 False
5 60 False
6 70 True
With merge is possible use parameter indicator=True in left join and test _merge column for both:
a['present'] = a.merge(b, on='id', how='left', indicator=True)['_merge'].eq('both')
print (a)
id present
0 10 True
1 20 False
2 30 True
3 40 True
4 50 False
5 60 False
6 70 True
I would like to subtract a fixed row value in rows, in reference to their values in another column.
My data looks like this:
TRACK TIME POSITION_X
0 1 0 12
1 1 30 13
2 1 60 15
3 1 90 11
4 2 0 10
5 2 20 11
6 2 60 13
7 2 90 17
I would like to subtract a fixed row value (WHEN TIME=0) of the POSITION_X column in reference to the TRACK column, and create a new column ("NEW_POSX") with those values. The output should be like this:
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7
I have been using the following code to get this done:
import pandas as pd
data = {'TRACK': [1,1,1,1,2,2,2,2],
'TIME': [0,30,60,90,0,20,60,90],
'POSITION_X': [12,13,15,11,10,11,13,17],
}
df = pd.DataFrame (data, columns = ['TRACK','TIME','POSITION_X'])
df['NEW_POSX']= df.groupby('TRACK')['POSITION_X'].diff().fillna(0).astype(int)
df.head(8)
... but I don't get the desired output. Instead, I get a new column where every row is subtracted by the previous row (according to the "TRACK" column):
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 2
3 1 90 11 -4
4 2 0 10 0
5 2 20 11 1
6 2 60 13 2
7 2 90 17 4
can anyone help me with this?
You can use transform and first to get the value at time 0, and then substract it to the 'POSITION_X' column:
s=df.groupby('TRACK')['POSITION_X'].transform('first')
df['NEW_POSX']=df['POSITION_X']-s
#Same as:
#df['NEW_POSX']=df['POSITION_X'].sub(s)
Output:
df
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7
I have a df with weather reporting data. It has over 2 million rows and the following columns.
ID MONTH TEMP
1 1 0
1 1 10
2 1 50
2 1 60
3 1 80
3 1 90
1 2 0
1 2 10
2 2 50
2 2 60
3 2 80
3 2 90
I am looking to create an column for the average monthly temperature. I need a faster way than for-loops. The values for average monthly temperature are from the TEMP column. I would like them to be specific to each ID for each MONTH.
ID MONTH TEMP AVE MONTHLY TEMP
1 1 0 5
1 1 10 5
2 1 50 55
2 1 60 55
3 1 80 85
3 1 90 85
1 2 0 5
1 2 10 5
2 2 50 55
2 2 60 55
3 2 80 85
3 2 90 85
Use groupby.transform:
df['AVE MONTHLY TEMP']=df.groupby(['ID','MONTH'])['TEMP'].transform('mean')
print(df)
Output
ID MONTH TEMP AVE MONTHLY TEMP
0 1 1 0 5
1 1 1 10 5
2 2 1 50 55
3 2 1 60 55
4 3 1 80 85
5 3 1 90 85
6 1 2 0 5
7 1 2 10 5
8 2 2 50 55
9 2 2 60 55
10 3 2 80 85
11 3 2 90 85
I think this solution may work better if you have millions of lines of data as those groupings may repeat (ID, MONTH). This makes an assumption that the ID series is always grouped as you have in your data. I'm trying to think out of the box here as you said you have a million lines of data:
df['AVG MONTHLY TEMP'] = df.groupby(df['ID'].ne(df['ID'].shift()).cumsum(), as_index=False)['TEMP'].transform('mean')
Also, if you average temperatures are ALWAYS grouped in two you can do this formula as well:
df.groupby(np.arange(len(df))//2)['TEMP'].transform('mean')
output:
ID MONTH TEMP AVG MONTHLY TEMP
0 1 1 0 5
1 1 1 10 5
2 2 1 50 55
3 2 1 60 55
4 3 1 80 85
5 3 1 90 85
6 1 2 0 5
7 1 2 10 5
8 2 2 50 55
9 2 2 60 55
10 3 2 80 85
11 3 2 90 85
I hope this help or give ideas as a million lines of data is a lot of data
In a pandas dataframe, I want to filter the rows where some column is stable within 10.0 units.
def abs_delta_fn(window):
x = window[0]
for y in window[1:]:
if abs(x-y) > 10.0:
return False
return True
df['filter']= df['column'].rolling(5, min_periods=5).apply(abs_delta)
So, if a have a df like this
1 0
2 20
3 40
4 40
5 40
6 40
7 40
8 90
9 120
10 120
applying the rolling window I get:
1 0 nan
2 20 nan
3 40 nan
4 40 nan
5 40 False
6 40 False
7 40 True
8 90 False
9 120 False
10 120 False
How can I get this in a smart way?
1 0 nan (or False)
2 20 nan (or False)
3 40 True
4 40 True
5 40 True
6 40 True
7 40 True
8 90 False
9 120 False
10 120 False
IIUC, you already know rolling, just adding apply after that , the key here is .iloc[::-1], cause rolling is from the current row looking up(backward), but you need forward
s=df.x.iloc[::-1].rolling(5,min_periods=5).apply(lambda x : (abs((x-x[0]))<10).all())
df.loc[df.index.difference(sum([list(range(x, x+5))for x in s[s==1].index.values],[]))]
Out[1119]:
x
1 0
2 20
8 90
9 120
10 120