I have 5 years of stock date. I need to do this: take years 1, 2 and 3 What is the probability that after seeing k consecutive ”down days”, the next day is an ”up day”? For example, if k = 3, what is the probability of seeing ”−, −, −, +” as opposed to seeing ”−, −, −, −”. Compute this for k = 1, 2, 3. I have played with groupby and cumsum, but can't seem to get it right.
For example:
group1 = df[df['True Label'] == "-"].groupby((df['True Label'] != "-").cumsum())
Date
True Label
2019-01-02
+
2019-01-03
-
2019-01-04
+
2019-01-07
+
2019-01-08
+
Try this bit of logic:
import pandas as pd
import numpy as np
np.random.seed(123)
s = pd.Series(np.random.choice(['+','-'], 1000))
sm = s.groupby((s == '+').cumsum()).cumcount()
prob = (sm.diff() == -3).sum() / (sm == 3).sum()
prob
Output:
0.43661971830985913
Details:
Use (s == '+').cumsum() to create groups of '-' records, groupby and cumcount the elements in this group, the first element is the '+' and cumcount starts with zero. There fore '+--' will become 0, 1, 2. Now, take the difference to find out where '-' turns to '+'.
If this is equal to -3 then we know this group has three minus and is followed by a '+'.
Check sm==3 to get to all number of times you hand '---', sum then divide.
Related
The data in my csv like this:
staff_id,clock_time,device_id,latitude,longitude
1001,2020/9/14 4:43:00,d_1,24.59652556,118.0824644
1001,2020/9/14 8:34:40,d_1,24.59732974,118.0859631
1001,2020/9/14 3:33:34,d_1,24.73208312,118.0957197
1001,2020/9/14 4:17:29,d_1,24.59222786,118.0955275
1001,2020/9/20 5:30:56,d_1,24.59689407,118.2863806
1001,2020/9/20 7:26:05,d_1,24.58237852,118.2858955
I want to find any row where the difference between longitude or latitude of 2 consecutive rows is greater than 0.1,then put the row index of two consecutive rows into a list.
From my data, the latitude difference of rows 2(24.59732974), 3(24.73208312), 4(24.59222786) greater than 0.1, and the longitude difference of rows 4(118.0955275),5(118.2863806) greater than 0.1.
I want to put the indexes of rows 2, 3, 4 into a list latitude_diff_list, and put the index of 4,5 rows into another list longitude_diff_list, what should I do?
You need to use a combination of diff(), to check if the absolute difference with the next or the previous row is more than 0.1, and then get the indices of these rows (I understand you actually want the index, not the descriptive row number, i.e. an index that starts from 0). One way you could do this is:
latitude_diff_list = df.index[(abs(df['latitude'].diff()) > 0.1) | (abs(df['latitude'].diff(-1)) > 0.1)].tolist()
longitude_diff_list = df.index[(abs(df['longitude'].diff()) > 0.1) | (abs(df['longitude'].diff(-1)) > 0.1)].tolist()
You can then offset this by +1 if you want the row number starting from 1 (e.g. [i+1 for i in latitude_diff_list])
I believe you need absolute difference between original and shifted values, compared by DataFrame.gt for greater:
m1 = df[['latitude','longitude']].diff().abs().gt(0.1)
m2 = df[['latitude','longitude']].shift().diff().abs().gt(0.1)
m = m1 | m2
print (m)
latitude longitude
0 False False
1 False False
2 True False
3 True False
4 True True
5 False True
latitude_diff_list = df.index[m['latitude']].tolist()
print (latitude_diff_list)
[2, 3, 4]
longitude_diff_list = df.index[m['longitude']].tolist()
print (longitude_diff_list)
[4, 5]
This should work:
import pandas as pd
df_ex = pandas.read_csv('ex.csv', sep=',')
latitude_diff_list, longitude_diff_list = [], []
for idx,row in df_ex[1:].iterrows():
if abs(row['latitude'] - df_ex.loc[idx-1, 'latitude']) > 0.1:
latitude_diff_list.extend([idx-1, idx])
if abs(row['longitude'] - df_ex.loc[idx-1, 'longitude']) > 0.1:
longitude_diff_list.extend([idx-1, idx])
latitude_diff_list, longitude_diff_list = list(set(latitude_diff_list)), list(set(longitude_diff_list))
I'm struggling with a series sum after already grouped the dataframe, and I was hoping that someone could please help me with an idea.
Basically I have in the example below I need to have the sum per each "Material".
Basically Material "ABC" should give me 2, and all the others as they have only one sign operation would have the same value.
import numpy as np
import pandas as pd
df = pd.DataFrame({
"Material" : ["M-12", "H4-LAMPE", "M-12", "H4-LAMPE",
"ABC" , "H4-LAMPE", "ABC", "ABC"] ,
"Quantity" : [6, 1, 3, 5, 1, 1, 10, 9],
"TYPE": ["+", "-", "+", "-", "+", "-", "+", "-"]})
df.groupby(['Material', "Quantity"], as_index=False).count()
listX = []
for item in df["TYPE"]:
if item == "+":
listX.append(1)
elif item == "-":
listX.append(-1)
else:
pass
df["Sign"] = lista
df["MovementsQty"] = df["Quantity"]*df["Sign"]
#df = df.groupby(["Material", "TYPE", "Quantity1"]).sum()
df1 = df.groupby(["Material", "TYPE"]).sum()
df1.drop(columns=["Quantity", "Sign"], inplace=True)
print(df1)
The result is:
The desired result is:
I tried to sum it again, to consider it differently but I was not successful so far and I think I need some help.
Thank you very much for your help
You're on the right track. I've tried to improve your code. Just use "Type" to determine and assign the sign using np.where, perform groupby and sum, and then re-compute the "Type" column based on the result.
v = (df.assign(Quantity=np.where(df.TYPE == '+', df.Quantity, -df.Quantity))
.groupby('Material', as_index=False)[['Quantity']]
.sum())
v.insert(1, 'Type', np.where(np.sign(v.Quantity) == 1, '+', '-'))
print (v)
Material Type Quantity
0 ABC + 2
1 H4-LAMPE - -7
2 M-12 + 9
Alternatively, you can do this with two groupby calls:
i = df.query('TYPE == "+"').groupby('Material').Quantity.sum()
j = df.query('TYPE == "-"').groupby('Material').Quantity.sum()
# Find the union of the indexes.
idx = i.index.union(j.index)
# Reindex and subtract.
v = i.reindex(idx).fillna(0).sub(j.reindex(idx).fillna(0)).reset_index()
# Insert the Type column back into the result.
v.insert(1, 'Type', np.where(np.sign(v.Quantity) == 1, '+', '-'))
print(v)
Material Type Quantity
0 ABC + 2.0
1 H4-LAMPE - -7.0
2 M-12 + 9.0
Here is another take (fairly similar to coldspeed though).
#Correct quantity with negative sign (-) according to TYPE
df.loc[df['TYPE'] == '-', 'Quantity'] *= -1
#Reconstruct df as sum of quantity to remove dups
df = df.groupby('Material')['Quantity'].sum().reset_index()
df['TYPE'] = np.where(df['Quantity'] < 0, '-', '+')
print(df)
Returns:
Material Quantity TYPE
0 ABC 2 +
1 H4-LAMPE -7 -
2 M-12 9 +
map and numpy.sign
Just sum up Quantity * TYPE and figure out the sign afterwards.
d = {'+': 1, '-': -1}
r = dict(map(reversed, d.items())).get
q = df.Quantity
m = df.Material
t = df.TYPE
s = pd.Series((q * t.map(d)).values, m, name='MovementsQty').sum(level=0)
s.reset_index().assign(TYPE=lambda x: [*map(r, np.sign(x.MovementsQty))])
Material MovementsQty TYPE
0 M-12 9 +
1 H4-LAMPE -7 -
2 ABC 2 +
I'm trying to figure out the average of increasing values in my table per column.
my table
A | B | C
----------------
0 | 5 | 10
100 | 2 | 20
50 | 2 | 30
100 | 0 | 40
function I'm trying to write for my problem
def avergeIncreace(data,value): #not complete but what I have so far
x = data[value].pct_change().fillna(0).gt(0)
print( x )
pct_change() returns a table of the percentage of the number at that index compared to the number in row before it.fillna(0) replaces the NaN in position 0 of the chart that pct_change() creates with 0.gt(0) returns true or false table depending if the value at that index is greater than 0
current output of this function
In[1]:avergeIncreace(df,'A')
Out[1]: 0 False
1 True
2 False
3 True
Name: BAL, dtyle: bool
desired output
In[1]:avergeIncreace(df,'A')
Out[1]:75
In[2]:avergeIncreace(df,'B')
Out[2]:0
In[3]:avergeIncreace(df,'C')
Out[3]:10
From my limited understanding of pandas there should be a way to return an array of all the indexes that are true and then use a for loop and go through the original data table, but I believe pandas should have a way to do this without a for loop.
what I think the for loop way would look plus missing code so indexes returned are ones that are true instead of every index
avergeIncreace(df,'A')
indexes = data[value].pct_change().fillna(0).gt(0).index.values #this returns an array containing all of the index (true and false)
answer = 0
times = 0
for x in indexes:
answer += (data[value][x] - data[value][x-1])
times += 1
print( answer/times )
How to I achieve my desired output without using a for loop in the function?
You can use mask() and diff():
df.diff().mask(df.diff()<=0, np.nan).mean().fillna(0)
Yields:
A 75.0
B 0.0
C 10.0
dtype: float64
How about
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 100, 50, 100],
'B': [5, 2, 2, 0],
'C': [10, 20, 30, 40]})
def averageIncrease(df, col_name):
# Create array of deltas. Replace nan and negative values with zero
a = np.maximum(df[col_name] - df[col_name].shift(), 0).replace(np.nan, 0)
# Count non-zero values
count = np.count_nonzero(a)
if count == 0:
# If only zero values… there is no increase
return 0
else:
return np.sum(a) / count
print(averageIncrease(df, 'A'))
print(averageIncrease(df, 'B'))
print(averageIncrease(df, 'C'))
75.0
0
10.0
I have a dataframe df like
A B
1 2
3 4
I then want to create 2 new series
t = pd.Series()
r = pd.Series()
I was able to assign values to t using the condition cond as below
t = "1+" + df.A.astype(str) + '+' + df.B.astype(str)
cond = df['A']<df['B']
team[cond] = "1+" + df.loc[cond,'B'].astype(str) + '+' + df.loc[cond,'A'].astype(str)
But I'm having problems with r. I just want r to contain values of 2 when con is satisfied and 1 otherwise
If I just try
r = 1
r[cond] = 2
Then I get TypeError: 'int' object does not support item assignment
I figure I could just run a for loop through df and check the cases in cond through each row of df, but I was wondering if Pandas offers a more efficient way instead?
You will laugh at how easy this is:
r = cond + 1
The reason is that cond is a boolean (True and False) which evaluate to 1 and 0. If you add one to it, it coerces the boolean to an int, which will mean True maps to 2 and False maps to one.
df = pd.DataFrame({'A': [1, 3, 4],
'B': [2, 4, 3]})
cond = df['A'] < df['B']
>>> cond + 1
0 2
1 2
2 1
dtype: int64
When you assign 1 to r as in
r = 1
r now references the integer 1. So when you call r[cond] you're treating an integer like a series.
You want to first create a series of ones for r the size of cond. Something like
r = pd.Series(np.ones(cond.shape))
I'm trying to calculate statistics (min, max, avg...) of streaks of consecutive higher values of a column. I'm rather new to pandas and stats, searched a bit but could not find an answer.
The data is financial data, with OHLC values in column, e.g.
Open High Low Close
Date
2013-10-20 1.36825 1.38315 1.36502 1.38029
2013-10-27 1.38072 1.38167 1.34793 1.34858
2013-11-03 1.34874 1.35466 1.32941 1.33664
2013-11-10 1.33549 1.35045 1.33439 1.34950
....
For example the average consecutive higher Low streak.
LATER EDIT
I think I didn't explain well. An item that was counted in a sequence can't be counted again. So for the sequence:
1,2,3,4,1,2,3,3,2,1
There are 4 streaks: 1,2,3,4 | 1,2,3,3 | 2 | 1
max = 4
min = 1
avg = (4+4+1+1)/4 = 2.5
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,1,2,3,3,2,1])
def ascends(s):
diff = np.r_[0, (np.diff(s.values)>=0).astype(int), 0]
diff2 = np.diff(diff)
descends = np.where(np.logical_not(diff)[1:] & np.logical_not(diff)[:-1])[0]
starts = np.sort(np.r_[np.where(diff2 > 0)[0], descends])
ends = np.sort(np.r_[np.where(diff2 < 0)[0], descends])
return ends - starts + 1
b = ascends(s)
print b
print b.max()
print b.min()
print b.mean()
(reference)
Output:
[4 4 1 1]
4
1
2.5