I have two dataframes(missingData and bias) and one Series(missingDateUnique).
missingDateUnique = pd.Series({0: 2459650, 9: 2459654})
missingDate = pd.DataFrame({0: [2459650, 2459650,2459650,2459654,2459654,2459654], 1: [10, 10,10,14,14,14]},index=[0,1,2,9,10,11])
bias = pd.DataFrame({0: [2459651, 2459652,2459653,2459655,2459656,2459658,2459659], 1: [11, 12,13,15,16,18,19]})
As missingDateUnique values are not in bias dataFrame. I have to check for i+1 in bias dataframe and subtract the 1's column value with missingDate's value.
I was doing it like this
for i in missingDateUnique:
if i+1 in bias[0].values:
missingDate[1] = missingDate[1].sub(missingDate[0].map(bias.set_index(0)[1]),fill_value=0)
The result should be like---
In missingDate's 1st row instead of 10 it should be 11-10=1
Full output-----
2459650 1
2459650 1
2459650 1
2459654 1
2459654 1
2459654 1
For example 2459654 in missingDate i have to check for 2459655 and 2459653 both in bias and subtract with any one from that. If both 2459655 and 2459653 are not present then I have to check for 2459656 and 2459652 and so on
You can subtract 1 from bias column 0 and map it to missingDate column 0
missingDate[2] = missingDate[0].map(bias.assign(key=bias[0]-1).set_index('key')[1]) - missingDate[1]
print(missingDate)
0 1 2
0 2459650 10 1
1 2459650 10 1
2 2459650 10 1
9 2459654 14 1
10 2459654 14 1
11 2459654 14 1
Related
I have two DataFrame as listed below
plusMinusOne = pd.DataFrame({0: [2459650, 2459650,2459650,2459654,2459654,2459654,2459660], 1: [100, 90,80,14,15,16,2]},index=[3,4,5,12,13,14,27])
bias = pd.DataFrame({0: [2459651, 2459652,2459653,2459655,2459656,2459658,2459659], 1: [10, 20,30,40,50,60,70]})
I have to subtract plusMinusOne's 1st column with bias 1th column by matching the bias 0th column with plusMinusOne's 0th column.
As 2459650 is not present in bias dataFrame i have to check for 2459651/2459649 from bias and subtract any one's value from that. I have to look for 1 above or 1 below from bias and then subtract the value for every row
I was trying like this.
for i in plusMinusOne[0]:
if i+1 in bias[0].values:
plusMinusOne[1] = plusMinusOne[1].sub(plusMinusOne[0].map(
bias.assign(key=bias[0]-1).set_index('key')[1]), fill_value=0)
break
elif i-1 in bias[0].values:
plusMinusOne[1] = plusMinusOne[1].sub(plusMinusOne[0].map(
bias.assign(key=bias[0]+1).set_index('key')[1]), fill_value=0)
break
My expected output is :
plusMinusOne
2459650 90
2459650 80
2459650 70
2459654 -26
2459654 -25
2459654 -24
2459660 -68
Vectorized solution,
def bias_diff(row):
value = 0
if (row[0] == bias[0]).any():
value = row[1] - bias[(row[0]) == bias[0]].iloc[0,1]
elif ((row[0]+1) == bias[0]).any():
value = row[1] - bias[(row[0]+1) == bias[0]].iloc[0,1]
else:
value = row[1] - bias[(row[0]-1) == bias[0]].iloc[0,1]
return value
plusMinusOne[1] = plusMinusOne.apply(bias_diff, axis=1)
print(plusMinusOne)
Output
0 1
3 2459650 90
4 2459650 80
5 2459650 70
12 2459654 -26
13 2459654 -25
14 2459654 -24
27 2459660 -68
This is not an efficient code, but this work for your case.This will work for which difference you want by changing the diff variable value
import pandas as pd
df1 = pd.DataFrame({0: [2459650, 2459650,2459650,2459654,2459654,2459654,2459660], 1: [100, 90,80,14,15,16,2]})
df2 = pd.DataFrame({0: [2459651, 2459652,2459653,2459655,2459656,2459658,2459659], 1: [10, 20,30,40,50,60,70]})
diff = 3
def data_process(df1,df2,i,diff):
data = None
for j in range(len(df2)):
if df1[0][i] == df2[0][j]:
data = df1[1][i]-df2[1][j]
else:
try:
if (df1[0][i])+diff == df2[0][j]:
data = df1[1][i]-df2[1][j]
elif (df1[0][i])-diff == df2[0][j]:
data = df1[1][i] - df2[1][j]
except:
pass
return data
processed_data = []
for i in range(len(df1)):
if data_process(df1,df2,i,diff) is None:
processed_data.append(df1[1][i])
else:
processed_data.append(data_process(df1,df2,i,diff))
df1[2] = processed_data
print(df1[[0,2]])
The output dataframe for diff 1 is
0 2
0 2459650 90
1 2459650 80
2 2459650 70
3 2459654 -26
4 2459654 -25
5 2459654 -24
6 2459660 -68
the output dataframe for diff 3 is
0 2
0 2459650 70.0
1 2459650 60.0
2 2459650 50.0
3 2459654 4.0
4 2459654 5.0
5 2459654 6.0
6 2459660 2
The 2459660 does not contain +3 or -3 combinational value (i.e) 2459657 or 2459663 in second dataframe. so i return the value as it is. Else it will return Nan value instead of 2.
I got a table with lots of point informations and I need to fill the position field after row wise comparison of the four fields before.
If the X- & Y-Coordinate is equal and also the ID_01, a comparison of ID_02 is required to assign "End" into the Position field for the lower ID_02 value, hence the row with value 35 and "Start" into the one with row equal 36 as its larger.
X-Coordinate
Y-Coordinate
ID_01
ID_02
Position
45000
554000
15
35
?
45000
554000
15
36
?
94475
59530
1
1
94491
60948
1
1
94491
60948
1
2
94151
64480
1
2
94151
64480
1
3
95408
68694
1
3
95408
68694
1
4
94703
69961
1
4
94703
69961
1
5
93719
70786
1
5
93719
70786
1
6
95310
72044
1
6
95310
72044
1
7
99525
82049
1
7
99525
82049
1
8
101600
84306
1
8
102744
85032
1
9
101600
84306
1
9
102744
85032
1
10
104155
86535
1
10
104575
86430
1
11
How would you handle in a pandas dataframe for instance?
You can use a boolean mask. First sort your values by ID_02 then check duplicated values. The position with row set to True has the End position, the other the Start position:
m = df.sort_values('ID_02').duplicated(['X-Coordinate', 'Y-Coordinate', 'ID_01'])
df['Position'] = np.where(m, 'End', 'Start')
print(df)
# Output
X-Coordinate Y-Coordinate ID_01 ID_02 Position
0 45000 554000 15 35 Start
1 45000 554000 15 36 End
I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1'
What I have:
df['target'].value_counts()
0 = 4000
1 = 120000
What I need:
new_df['target'].value_counts()
0 = 4000
1 = 6000
I know I can df.sample but I dont know how to insert the conditional.
Thanks
Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group:
df.groupby('target').sample(4000)
Demo:
df = pd.DataFrame({'x': [0] * 10 + [1] * 25})
df.groupby('x').sample(5)
x
8 0
6 0
7 0
2 0
9 0
18 1
33 1
24 1
32 1
15 1
If you need to sample conditionally based on the group value, you can do:
df.groupby('target', group_keys=False).apply(
lambda g: g.sample(4000 if g.name == 0 else 6000)
)
Demo:
df.groupby('x', group_keys=False).apply(
lambda g: g.sample(4 if g.name == 0 else 6)
)
x
7 0
8 0
2 0
1 0
18 1
12 1
17 1
22 1
30 1
28 1
Assuming the following input and using the values 4/6 instead of 4000/6000:
df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]})
You could groupby your target and sample to take at most N values per group:
df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6)))
example output:
target
4 0
0 0
8 0
12 0
10 1
14 1
1 1
7 1
11 1
13 1
If you want the same size you can simply use df.groupby('target').sample(n=4)
I have a dataframe where rows represents dates or minutes, and their corresponding values for each.
Is it possible for pandas to first (1) detect a increase in value, (2) then goes flat for x number of rows, (3) then increases again?
For e.g, in the image below, at 0934 value is 8.135. It increases to 8.18 and stays there until 0941, seven rows later. it then increases again to 0.185 at 0942am.
This group of rows needs to be identified with a 1 at the end of such a group. Referring to the earlier example, the desired output is 1 at 0942am.
Note that the increaseing plateaus is just for illustration, not to be in the actual dataframe. The code needs to find the "values" column amongst the 30+ other columns in the dataframe, process the column to find the plateaus and output to a new column called "desired output".
And the numbers under the value column do not just increase, may decrease as well. No set pattern, just that in this case an event occurred causing the numbers to go up. At least two of the same values are needed across two rows to be considered a plateau.
Sample CSV here
The closest example I can find here is the following, but the solutions seem to involve sorting the columns, which I cannot in this case as the rows are representing time.
Pandas: for groups of rows where 2 or more particular columns values are exactly the same, how to assign a unique integer as a new column
I'm still learning python and can write something like below, but I'm do not know where to start for the above case. Please advise!
Will be great if the above case can be solved using numpy or some other fast method!
def check2(df):
df.loc[:, 'check2'] = np.where((df.open > df.close) & (df.close.shift(1) > df.open.shift(1)),
np.where((df.open > df.close.shift(1)) & (df.close < df.open.shift(1)), 1, 0), 0)
return df
Here is one approach (using my own data). Is aware of decreasing as well as increasing values. Detects uptick-plateau-uptick patterns and marks the ending uptick for each pattern detected:
d = {'value': [10,10,11,11,11,11,13,13,13,15,15,15,13,13,13,12,12,13,14,14,14,15,15,15]}
df = pd.DataFrame(d)
df['diff'] = df['value'].diff()
df['plat_up_end'] = (df['diff'] > 0) & (df['diff'].shift() == 0)
df['output'] = (df['plat_up_end']
& (df['diff'].replace(to_replace=0, method='ffill') > 0)
.shift()).astype('int')
df[['value','output']]
value output
0 10 0
1 10 0
2 11 0
3 11 0
4 11 0
5 11 0
6 13 1
7 13 0
8 13 0
9 15 1
10 15 0
11 15 0
12 13 0
13 13 0
14 13 0
15 12 0
16 12 0
17 13 0
18 14 0
19 14 0
20 14 0
21 15 1
22 15 0
23 15 0
Does not assume a monotonically increasing series. Values may decrease as well as increase. It does assume that at least two of the same values are needed to form a plateau. Could be adjusted if not true.
Here's how to extend the minimum required number to form a plateau. You would need to modify this line:
df['plat_up_end'] = (df['diff'] > 0) & (df['diff'].shift() == 0)
This is the look-back part of that line:
(df['diff'].shift() == 0)
A zero in the 'diff' column indicates that the entry is the same as the entry before it. So you need to look back n-1 entries from the uptick to identify a plateau of n values. If you want to see at least 4 of the same values to form a plateau then you would need 3 look-backs. Notice that each shift() looks back one further than the one before it.
& (df['diff'].shift() == 0) & (df['diff'].shift(2) == 0) & (df['diff'].shift(3) == 0)
Full line:
df['plat_up_end'] = (df['diff'] > 0) & (df['diff'].shift() == 0) & (df['diff'].shift(2) == 0) & (df['diff'].shift(3) == 0)
Result:
value output
0 10 0
1 10 0
2 11 0
3 11 0
4 11 0
5 11 0
6 13 1
7 13 0
8 13 0
9 15 0
10 15 0
11 15 0
12 13 0
13 13 0
14 13 0
15 12 0
16 12 0
17 13 0
18 14 0
19 14 0
20 14 0
21 15 0
22 15 0
23 15 0
Though if the plateau size gets much bigger we would probably want to change the logic to do a rolling sum instead.
We first make a series (tmp) that marks where the value has increased.
Then we exclude where previous value also has increased.
tmp = df['values'].diff().gt(0.0, fill_value=True) # 1
df['sol'] = (tmp & ~tmp.shift(fill_value=True)).astype(int) # 2
Example Input/Output:
df = pd.DataFrame({
't': pd.date_range('2021-01-01 12:00:00', '2021-01-01 12:10:00', freq='1min', closed='left'),
'values': [8.0, 8.15, 8.15, 8.15, 8.2, 8.3, 8.4, 8.4, 8.5, 8.5],
'ans': [0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
})
# ...
df
t values ans sol
0 2021-01-01 12:00:00 8.00 0 0
1 2021-01-01 12:01:00 8.15 0 0
2 2021-01-01 12:02:00 8.15 0 0
3 2021-01-01 12:03:00 8.15 0 0
4 2021-01-01 12:04:00 8.20 1 1
5 2021-01-01 12:05:00 8.30 0 0
6 2021-01-01 12:06:00 8.40 0 0
7 2021-01-01 12:07:00 8.40 0 0
8 2021-01-01 12:08:00 8.50 1 1
9 2021-01-01 12:09:00 8.50 0 0
How to calculate amounts that row values greater than a specific value in pandas?
For example, I have a Pandas DataFrame dff. I want to count row values greater than 0.
dff = pd.DataFrame(np.random.randn(9,3),columns=['a','b','c'])
dff
a b c
0 -0.047753 -1.172751 0.428752
1 -0.763297 -0.539290 1.004502
2 -0.845018 1.780180 1.354705
3 -0.044451 0.271344 0.166762
4 -0.230092 -0.684156 -0.448916
5 -0.137938 1.403581 0.570804
6 -0.259851 0.589898 0.099670
7 0.642413 -0.762344 -0.167562
8 1.940560 -1.276856 0.361775
I am using an inefficient way. How to be more efficient?
dff['count'] = 0
for m in range(len(dff)):
og = 0
for i in dff.columns:
if dff[i][m] > 0:
og += 1
dff['count'][m] = og
dff
a b c count
0 -0.047753 -1.172751 0.428752 1
1 -0.763297 -0.539290 1.004502 1
2 -0.845018 1.780180 1.354705 2
3 -0.044451 0.271344 0.166762 2
4 -0.230092 -0.684156 -0.448916 0
5 -0.137938 1.403581 0.570804 2
6 -0.259851 0.589898 0.099670 2
7 0.642413 -0.762344 -0.167562 1
8 1.940560 -1.276856 0.361775 2
You can create a boolean mask of your DataFrame, that is True wherever a value is greater than your threshold (in this case 0), and then use sum along the first axis.
dff.gt(0).sum(1)
0 1
1 1
2 2
3 2
4 0
5 2
6 2
7 1
8 2
dtype: int64