I have a dataframe df that looks like this:
ID Sequence
0 A->A
1 C->C->A
2 C->B->A
3 B->A
4 A->C->A
5 A->C->C
6 A->C
7 A->C->C
8 B->B
9 C->C and so on ....
I want to create a column called 'Outcome', which is binomial in nature.
Its value essentially depends on three lists that I am generating from below
Whenever 'A' occurs in a sequence, probability of "Outcome" being 1 is 2%
Whenever 'B' occurs in a sequence, probability of "Outcome" being 1 is 6%
Whenever 'C' occurs in a sequence, probability of "Outcome" being 1 is 1%
so here is the code which is generating these 3 (bi_A, bi_B, bi_C) lists -
A=0.02
B=0.06
C=0.01
count_A=0
count_B=0
count_C=0
for i in range(0,len(df)):
if('A' in df.sequence[i]):
count_A+=1
if('B' in df.sequence[i]):
count_B+=1
if('C' in df.sequence[i]):
count_C+=1
bi_A = np.random.binomial(1, A, count_A)
bi_B = np.random.binomial(1, B, count_B)
bi_C = np.random.binomial(1, C, count_C)
What I am trying to do is to combine these 3 lists as an "output" column so that probability of Outcome being 1 when "A" is in sequence is 2% and so on. How to I solve for it as I understand there would be data overlap, where bi_A says one sequence is 0 and bi_B says it's 1, so how would we solve for this ?
End data should look like -
ID Sequence Output
0 A->A 0
1 C->C->A 1
2 C->B->A 0
3 B->A 0
4 A->C->A 0
5 A->C->C 1
6 A->C 0
7 A->C->C 0
8 B->B 0
9 C->C 0
and so on ....
Such that when I find probability of Outcome = 1 when A is in string, it should be 2%
EDIT -
you can generate the sequence data using this code-
import pandas as pd
import itertools
import numpy as np
import random
alphabets=['A','B','C']
combinations=[]
for i in range(1,len(alphabets)+1):
combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))
weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''
df=pd.DataFrame(random.choices(
population=combinations,weights=weights,
k=10000),columns=['sequence'])
Related
Problem
I want to pick a subset of fixed size from a list of items such that the count of the most frequent occurrence of the labels of the selected items is minimized. In English, I have a DataFrame consisting of a list of 10000 items, generated as follows.
import random
import pandas as pd
def RandLet():
alphabet = "ABCDEFG"
return alphabet[random.randint(0, len(alphabet) - 1)]
items = pd.DataFrame([{"ID": i, "Label1": RandLet(), "Label2": RandLet(), "Label3": RandLet()} for i in range(0, 10000)])
items.head(3)
Each item has 3 labels. The labels are letters within ABCDEFG, and the order of the labels doesn't matter. An item may be tagged multiple times with the same label.
[Example of the first 3 rows]
ID Label1 Label2 Label3
0 0 G B D
1 1 C B C
2 2 C A B
From this list, I want to pick 1000 items in a way that minimizes the number of occurrences of the most frequently appearing label within those items.
For example, if my DataFrame only consisted of the above 3 items, and I only wanted to pick 2 items, and I picked items with ID #1 and #2, the label 'C' appears 3 times, 'B' appears 2 times, 'A' appears 1 time, and all other labels appear 0 times - The maximum of these is 3. However, I could have done better by picking items #0 and #2, in which label 'B' appears the most frequently, coming in as a count of 2. Since 2 is less than 3, picking items #0 and #2 is better than picking items #1 and #2.
In the case where there are multiple ways to pick 1000 items such that the count of the maximum label occurrence is minimized, returning any of those selections is fine.
What I've got
To me, this feels similar a knapsack problem in len("ABCDEFG") = 7 dimensions. I want to put 1000 items in the knapsack, and each item's size in the relevant dimension is the sum of the occurrences of the label for that particular item. To that extent, I've built this function to convert my list of items into a list of sizes for the knapsack.
def ReshapeItems(items):
alphabet = "ABCDEFG"
item_rebuilder = []
for i, row in items.iterrows():
letter_counter = {}
for letter in alphabet:
letter_count = sum(row[[c for c in items.columns if "Label" in c]].apply(lambda x: 1 if x == letter else 0))
letter_counter[letter] = letter_count
letter_counter["ID"] = row["ID"]
item_rebuilder.append(letter_counter)
items2 = pd.DataFrame(item_rebuilder)
return items2
items2 = ReshapeItems(items)
items2.head(3)
[Example of the first 3 rows of items2]
A B C D E F G ID
0 0 1 0 1 0 0 1 0
1 0 1 2 0 0 0 0 1
2 1 1 1 0 0 0 0 2
Unfortunately, at that point, I am completely stuck. I think that the point of knapsack problems is to maximize some sort of value, while keeping the sum of the selected items sizes under some limit - However, here my problem is the opposite, I want to minimize the sum of the selected size such that my value is at least some amount.
What I'm looking for
Although a function that takes in items or items2 and returns a subset of these items that meets my specifications would be ideal, I'd be happy to accept any sufficiently detailed answer that points me in the right direction.
Using a different approach, here is my take on your interesting question.
def get_best_subset(
df: pd.DataFrame, n_rows: int, key_cols: list[str], iterations: int = 50_000
) -> tuple[int, pd.DataFrame]:
"""Subset df in such a way that the frequency
of most frequent values in key columns is minimum.
Args:
df: input dataframe.
n_rows: number of rows in subset.
key_cols: columns to consider.
iterations: max number of tries. Defaults to 50_000.
Returns:
Minimum frequency, subset of n rows of input dataframe.
"""
lowest_frequency: int = df.shape[0] * df.shape[1]
best_df = pd.DataFrame([])
# Iterate through possible subsets
for _ in range(iterations):
sample_df = df.sample(n=n_rows)
# Count values in each column, concat and sum counts, get max count
frequency = (
pd.concat([sample_df[col].value_counts() for col in key_cols])
.pipe(lambda df_: df_.groupby(df_.index).sum())
.max()
)
if frequency < lowest_frequency:
lowest_frequency = frequency
best_df = sample_df
return lowest_frequency, best_df.sort_values(by=["ID"]).reset_index(drop=True)
And so, with the toy dataframe constructor you provided:
lowest_frequency, best_df = get_best_subset(
items, 1_000, ["Label1", "Label2", "Label3"]
)
print(lowest_frequency)
# 431
print(best_df)
# Output
ID Label1 Label2 Label3
0 39 F D A
1 46 B G E
2 52 D D B
3 56 D F B
4 72 C D E
.. ... ... ... ...
995 9958 A F E
996 9961 E E E
997 9966 G E C
998 9970 B C B
999 9979 A C G
[1000 rows x 4 columns]
I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !
You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1
If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6
I'm just seeking some guidance on how to do this better. I was just doing some basic research to compare Monday's opening and low. The code code returns two lists, one with the returns (Monday's close - open/Monday's open) and a list that's just 1's and 0's to reflect if the return was positive or negate.
Please take a look as I'm sure there's a better way to do it in pandas but I just don't know how.
#Monday only
m_list = [] #results list
h_list = [] #hit list (close-low > 0)
n=0 #counter variable
for t in history.index:
if datetime.datetime.weekday(t[1]) == 1: #t[1] is the timestamp in multi index (if timestemp is a Monday)
x = history.ix[n]['open']-history.ix[n]['low']
m_list.append((history.ix[n]['open']-history.ix[n]['low'])/history.ix[n]['open'])
if x > 0:
h_list.append(1)
else:
h_list.append(0)
n += 1 #add to index counter
else:
n += 1 #add to index counter
print("Mean: ", mean(m_list), "Max: ", max(m_list),"Min: ",
min(m_list), "Hit Rate: ", sum(h_list)/len(h_list))
You can do that by straight forward :
(history['open']-history['low'])>0
This will give you true for rows where open is greater and flase where low is greater.
And if you want 1,0, you can multiply the above statement with 1.
((history['open']-history['low'])>0)*1
Example
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':np.random.random(10),
'b':np.random.random(10)})
Printing the data frame:
print(df)
a b
0 0.675916 0.796333
1 0.044582 0.352145
2 0.053654 0.784185
3 0.189674 0.036730
4 0.329166 0.021920
5 0.163660 0.331089
6 0.042633 0.517015
7 0.544534 0.770192
8 0.542793 0.379054
9 0.712132 0.712552
To make a new column compare where it is 1 if a is greater and 9 if b is greater :
df['compare'] = (df['a']-df['b']>0)*1
this will add new column compare:
a b compare
0 0.675916 0.796333 0
1 0.044582 0.352145 0
2 0.053654 0.784185 0
3 0.189674 0.036730 1
4 0.329166 0.021920 1
5 0.163660 0.331089 0
6 0.042633 0.517015 0
7 0.544534 0.770192 0
8 0.542793 0.379054 1
9 0.712132 0.712552 0
I have some data in the form:
ID A B VALUE EXPECTED RESULT
1 1 2 5 GROUP1
2 2 3 5 GROUP1
3 3 4 6 GROUP2
4 3 5 5 GROUP1
5 6 4 5 GROUP3
What i want to do is iterate through the data (thousand of rows) and create a common field so i will be able to join the data easily ( *A-> start Node, B->End Node Value-> Order...the data form something like a chain where only neighbors share a common A or B)
Rules for joining:
equal value for all elements of a group
A of element one equal to B of element two (or the oposite but NOT A=A' or B=B')
The most difficult one: assign to same group all sequential data that form a series of intersecting nodes.
That is the first element [1 1 2 5] has to be joined with [2 2 3 5] and then with [4 3 5 5]
Any idea how to accomplish this robustly when iterating through a large number of data? I have problem with rule number 3, the others are easily applied. For limited data i have some success, but this depends on the order i start examining the data. And this doesn't work for the large dataset.
I can use arcpy (preferably) or even Python or R or Matlab to solve this. Have tried arcpy with no success so i am checking on alternatives.
In ArcPy this code works ok but to limited extend (i.e. in large features with many segments i get 3-4 groups instead of 1):
TheShapefile="c:/Temp/temp.shp"
desc = arcpy.Describe(TheShapefile)
flds = desc.fields
fldin = 'no'
for fld in flds: #Check if new field exists
if fld.name == 'new':
fldin = 'yes'
if fldin!='yes': #If not create
arcpy.AddField_management(TheShapefile, "new", "SHORT")
arcpy.CalculateField_management(TheShapefile,"new",'!FID!', "PYTHON_9.3") # Copy FID to new
with arcpy.da.SearchCursor(TheShapefile, ["FID","NODE_A","NODE_B","ORDER_","new"]) as TheSearch:
for SearchRow in TheSearch:
if SearchRow[1]==SearchRow[4]:
Outer_FID=SearchRow[0]
else:
Outer_FID=SearchRow[4]
Outer_NODEA=SearchRow[1]
Outer_NODEB=SearchRow[2]
Outer_ORDER=SearchRow[3]
Outer_NEW=SearchRow[4]
with arcpy.da.UpdateCursor(TheShapefile, ["FID","NODE_A","NODE_B","ORDER_","new"]) as TheUpdate:
for UpdateRow in TheUpdate:
Inner_FID=UpdateRow[0]
Inner_NODEA=UpdateRow[1]
Inner_NODEB=UpdateRow[2]
Inner_ORDER=UpdateRow[3]
if Inner_ORDER==Outer_ORDER and (Inner_NODEA==Outer_NODEB or Inner_NODEB==Outer_NODEA):
UpdateRow[4]=Outer_FID
TheUpdate.updateRow(UpdateRow)
And some data in shapefile form and dbf form
Using matlab:
A = [1 1 2 5
2 2 3 5
3 3 4 6
4 3 5 5
5 6 4 5]
%% Initialization
% index of matrix line sharing the same group
ind = 1
% length of the index
len = length(ind)
% the group array
g = []
% group counter
c = 1
% Start the small algorithm
while 1
% Check if another line with the same "Value" share some common node
ind = find(any(ismember(A(:,2:3),A(ind,2:3)) & A(:,4) == A(ind(end),4),2));
% If there is no new line, we create a group with the discovered line
if length(ind) == len
%group assignment
g(A(ind,1)) = c
c = c+1
% delete the already discovered line (or node...)
A(ind,:) = []
% break if no more node
if isempty(A)
break
end
% reset the index for the next group
ind = 1;
end
len = length(ind);
end
And here is the output:
g =
1 1 2 1 3
As expected
What I have right now looks like this:
spread
0 0.00000787
1 0.00000785
2 0.00000749
3 0.00000788
4 0.00000786
5 0.00000538
6 0.00000472
7 0.00000759
And I would like to add a new column next to it, and if the value of spread in between (for example) 0 and 0.00005 then it is part of bin A, if (for example) between 0.00005 and 0.0006 then bin B (there are three bins in total). What I have tried so far:
minspread = df['spread'].min()
maxspread = df['spread'].max()
born = (float(maxspread)-float(minspread))/3
born1 = born + float(minspread)
born2 = float(maxspread) - born
df['Bin'] = df['spread'].apply(lambda x: 'A' if x < born1 else ( 'B' if born1 < x <= born2 else 'C'))
But when I do so everything ends up in the Bin A:
spread Bin
0 0.00000787 A
1 0.00000785 A
2 0.00000749 A
3 0.00000788 A
4 0.00000786 A
Does anyone have an idea on how to divide the column 'spread' in three bins (A-B-C) with the same number of observations in it? Thanks!
If get error:
unsupported operand type(s) for +: 'decimal.Decimal' and 'float'
It means the column type is Decimal, which works poorly with pandas, and should be converted to numeric.
One possible solution is to multiply columns by some big number e.g. 10e15 and convert to integer to avoid lost precision if converting to floats and then use qcut:
#sample data
#from decimal import Decimal
#df['spread'] = [Decimal(x) for x in df['spread']]
df['spread1'] = (df['spread'] * 10**15).astype(np.int64)
df['bins'] = pd.qcut(df['spread1'], 3, labels=list('ABC'))
print (df)
spread spread1 bins
0 0.00000787 7870000000 C
1 0.00000785 7850000000 B
2 0.00000749 7490000000 A
3 0.00000788 7880000000 C
4 0.00000786 7860000000 C
5 0.00000538 5380000000 A
6 0.00000472 4720000000 A
7 0.00000759 7590000000 B
Solution with no new column:
s = (df['spread'] * 10**15).astype(np.int64)
df['bins'] = pd.qcut(s, 3, labels=list('ABC'))
print (df)
spread bins
0 0.00000787 C
1 0.00000785 B
2 0.00000749 A
3 0.00000788 C
4 0.00000786 C
5 0.00000538 A
6 0.00000472 A
7 0.00000759 B