Is there a standard way of doing this?
I basically have some users, who performed actions together and split off as a group. We don't know the order of the events, but can infer them:
A B C D E
WentToMall 1 1 1 0 0
DroveToTheMovies 1 0 0 0 0
AteLunchTogether 1 1 0 0 0
BoughtClothes 1 1 0 0 1
BoughtElectronics 1 1 0 0 0
The rule is they can't converge together after.
So the time series would look like:
Time 0 is always all of them together, then the largest 'grouping' is splitting off into
'WentToMall' where we get A,B,C and D,E split off.
From there, it looks like AB split off from C, and AB proceed to 'AteLunchTogether, BoughtClothes, BoughtElectronics'. Sometime during 'boughtclothes', it looks like E split off from D.
Finally, A and B split off at the end as A 'Drove to the movies'.
If possible, I'd like to also show this visually, maybe with nodes showing the number of events separating the split (which would look like):
ABCDE ---> ABC --> AB ->A
| | |->B
| |
| |--> C
|
|
|---> DE --> D
|-->E
A problem that comes up is sometimes you get time points which are 'difficult to asses' or appear contradictory, and don't fit in based on the minimal amount of columns. I'm not sure what to do about those either. I am given 'weights' for the actions, so I could decide based on that, or I guess generate all versions of the graph.
I was thinking maybe of recursion to do a search through, or similar?
edit: the latest file is here
The process is through recursion. Pandas is useful in your scenario, though there might be a more efficient ways to do this.
We search from the further nodes. In your case, these would be A and E nodes. How we know these are the furthest nodes? Just count 0 and 1 values of all rows. Then get sum of 0 and 1 values. Also, sort values by 0. For first case, it should be like that:
0 1
DroveToTheMovies 4.0 1.0
AteLunchTogether 3.0 2.0
BoughtElectronics 3.0 2.0
WentToMall 2.0 3.0
BoughtClothes 2.0 3.0
FirstCase 0.0 5.0
This means there is 1 person drove to the movies. You see the pattern. There are people joining this person later on. In first case, there are 5 people which we began with. But there is a problem. How we know whether the previous person was in the group? Let's say X drove to the movies. Now we check for ate lunch. Say Y and Z joined the group, but not X. For this case, we will check if the latest group is in the new group. So until we reach first case, we add all the event to an array. So now we have a branch.
Assume there were people not in group case. In this case, we stored this odd behavior as well. Then, we go from there now. In first case our beginning node was A; not it is B using the same technique. So the process will be repeated again.
My final results were like that:
0 1
0 DroveToTheMovies Index(['A'], dtype='object')
1 AteLunchTogether Index(['A', 'B'], dtype='object')
2 BoughtElectronics Index(['A', 'B'], dtype='object')
3 WentToMall Index(['A', 'B', 'C'], dtype='object')
4 FirstCase Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
5 BoughtClothes Index(['E'], dtype='object')
6 FirstCase Index(['D', 'E'], dtype='object')
There are two FirstCase. But you need to process these two FirstCase values and know that, this D-E group is from the first FirstCase group, then E went for bought clothes. D is unknown, therefore could be assigned as something else. And there you have it.
First branch:
ABCDE ---> ABC --> AB ->A
| |->B
|
|--> C
Second branch:
(first case)---> DE --> D
|-->E
All you have to do is now find who left branches. For first branch it is B, C, and D-E. These are easy to calculate from now on. Hope it'll help you. The code is here, and I suggest to to debug the code in order to get the whole idea more clear:
import pandas as pd
df = pd.DataFrame(
[[1, 1, 1, 0, 0],
[1, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 0, 0, 1],
[1, 1, 0, 0, 0]], columns=list("ABCDE"))
df.index = ['WentToMall', 'DroveToTheMovies', 'AteLunchTogether', 'BoughtClothes', 'BoughtElectronics']
first_case = pd.DataFrame(
[[1, 1, 1, 1, 1]], columns=list("ABCDE"), index=['FirstCase'])
all_case = pd.concat([first_case, df])
def case_finder(all_case):
df_case = all_case.apply(lambda x: x.value_counts(), axis=1).fillna(0)
df_case = df_case.loc[df_case[1] != 0]
return df_case.sort_values(by=1)
def check_together(x):
x = df.iloc[x]
activity = all_case.loc[x.name]
does_activity = activity.loc[activity == 1]
return activity.name, does_activity.index
def check_in(pre, now):
return pre.isin(now).all()
def check_odd(i):
act = check_together(i)[0]
who = check_together(i)[1][~check_together(i)[1].isin(check_together(i-1)[1])]
return act, who
df = case_finder(all_case)
total = all_case.shape[0]
all_acts = []
last_stable = []
while True:
for i in range(total):
act, ind = check_together(i)
if ind.size == 1:
print("Initiliazed!")
all_acts.append([act, ind])
pass
else:
p_act, p_ind = check_together(i-1)
if check_in(p_ind, ind) == True:
print("So a new person joins us!")
all_acts.append([act, ind])
else:
print("This is weird. We'll check later!")
# act, who = check_odd(i)
last_stable.append([i, p_ind])
continue
if act == 'FirstCase':
break
if len(last_stable) == 0:
print("Process done!")
break
else:
print("Update cases!")
ls_ind = last_stable[0]
all_case = all_case.drop(last_stable[0][1], axis=1)
total = all_case.shape[0]
df = case_finder(all_case)
last_stable = last_stable[1:]
print(all_acts)
x = pd.DataFrame(all_acts)
Related
An example to illustrate the point. In 1 column, there are the following 5 categories for "food_spice_levels".
high_heat, medium_heat, mild_heat, no_heat, bland
The goal is to create a new binary variable called "Spiciness" to show whether the food is spicy or not spicy. Bland, no_heat, mild_heat, and medium_heat = 0, and High_heat = 1 is the goal and again, to be in 1 new column.
Current code and issues:
df['Spiciness'] = df['food_spice_levels'].map({'Bland''no_heat''mild_heat''medium_heat': 0, 'high_heat': 1})
Commas between each category in the code for the "0" category gave a syntax error. Without the commas, this warning came:
"SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"
It did create a new column with high_heat being coded as "1" correctly, but all the desired "0" values got coded to "NaN" and I don't want to destroy the dataset if the warning is telling me something that can't be ignored. Can anyone help so that I get 0's and 1's in the new column while potentially avoiding this error message. Thanks!
IIUC
df = pd.DataFrame({'food': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F'}, 'food_spice_levels': {0: 'bland', 1: 'high_heat', 2: 'mild_heat', 3: 'medium_heat', 4: 'high_heat', 5: 'bland'}})
print(df)
# food food_spice_levels
# 0 A bland
# 1 B high_heat
# 2 C mild_heat
# 3 D medium_heat
# 4 E high_heat
# 5 F bland
df['binary'] = (df['food_spice_levels']=='high_heat').astype(int)
print(df)
# food food_spice_levels binary
# 0 A bland 0
# 1 B high_heat 1
# 2 C mild_heat 0
# 3 D medium_heat 0
# 4 E high_heat 1
# 5 F bland 0
This utilizes the fact that booleans are represented as 1 (True) or 0 (False). The part df['food_spice_levels']=='high_heat' creates a boolean series, which then is parsed as int.
i have a dataframe (=used_dataframe), that contains duplicates. I am required to create a list that contains the indices of those duplicates
For this I used a function I found here:
Find indices of duplicate rows in pandas DataFrame
def duplicates(x):
#dataframe = pd.read_csv(x)
#df = dataframe.iloc[: , 1:]
df = x
duplicateRowsDF = df[df.duplicated()]
df = df[df.duplicated(keep=False)]
tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!
n = 1 # N. . .
indicees = [x[n] for x in tuppl]
return indicees
duplicates(used_df)
The next function I need is one, where I remove the duplicates from the dataset which i did like this:
x= tidy(mn)
indices = duplicates(tidy(mn))
used_df = x
used_df['indexcol'] = range(0, len(tidy(mn)))
dropped = used_df[~used_df['indexcol'].isin(indices)]
finito = dropped.drop(columns=['indexcol'])
return finito
handling_duplicate_entries(used_df)
And it works - but when I want to check my solution (to assess, that all duplicates have been removed)
Which I do by duplicates(handling_duplicate_entries(used_df))which should return an empty dataframe to show that there are no duplicates, it returns the error 'DataFrame' object has no attribute 'tolist'.
In the question of the link above, this has also been added as a comment but not solved - and to be quite frank I would love to find a different solution for the duplicates function because I don't quite understand it but so far I haven't.
Ok. I'll try to do my best.
So if you are trying to find the duplicate indices, and want to store those values in a list you can use the following code. Also I have included a small example to create a dataframe containing the duplicated values (original), and the data without any duplicated data.
import pandas as pd
# Toy dataset
data = {
'A': [0, 0, 3, 0, 3, 0],
'B': [0, 1, 3, 2, 3, 0],
'C': [0, 1, 3, 2, 3, 0]
}
df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]
duplicates
A B C
0 0 0 0
5 0 0 0
2 3 3 3
4 3 3 3
no_duplicates
A B C
1 0 1 1
3 0 2 2
I'm trying to figure out the average of increasing values in my table per column.
my table
A | B | C
----------------
0 | 5 | 10
100 | 2 | 20
50 | 2 | 30
100 | 0 | 40
function I'm trying to write for my problem
def avergeIncreace(data,value): #not complete but what I have so far
x = data[value].pct_change().fillna(0).gt(0)
print( x )
pct_change() returns a table of the percentage of the number at that index compared to the number in row before it.fillna(0) replaces the NaN in position 0 of the chart that pct_change() creates with 0.gt(0) returns true or false table depending if the value at that index is greater than 0
current output of this function
In[1]:avergeIncreace(df,'A')
Out[1]: 0 False
1 True
2 False
3 True
Name: BAL, dtyle: bool
desired output
In[1]:avergeIncreace(df,'A')
Out[1]:75
In[2]:avergeIncreace(df,'B')
Out[2]:0
In[3]:avergeIncreace(df,'C')
Out[3]:10
From my limited understanding of pandas there should be a way to return an array of all the indexes that are true and then use a for loop and go through the original data table, but I believe pandas should have a way to do this without a for loop.
what I think the for loop way would look plus missing code so indexes returned are ones that are true instead of every index
avergeIncreace(df,'A')
indexes = data[value].pct_change().fillna(0).gt(0).index.values #this returns an array containing all of the index (true and false)
answer = 0
times = 0
for x in indexes:
answer += (data[value][x] - data[value][x-1])
times += 1
print( answer/times )
How to I achieve my desired output without using a for loop in the function?
You can use mask() and diff():
df.diff().mask(df.diff()<=0, np.nan).mean().fillna(0)
Yields:
A 75.0
B 0.0
C 10.0
dtype: float64
How about
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 100, 50, 100],
'B': [5, 2, 2, 0],
'C': [10, 20, 30, 40]})
def averageIncrease(df, col_name):
# Create array of deltas. Replace nan and negative values with zero
a = np.maximum(df[col_name] - df[col_name].shift(), 0).replace(np.nan, 0)
# Count non-zero values
count = np.count_nonzero(a)
if count == 0:
# If only zero values… there is no increase
return 0
else:
return np.sum(a) / count
print(averageIncrease(df, 'A'))
print(averageIncrease(df, 'B'))
print(averageIncrease(df, 'C'))
75.0
0
10.0
I am trying to find out Index of such rows before "None" occurs.
pId=["a","b","c","None","d","e","None"]
df = pd.DataFrame(pId,columns=['pId'])
pId
0 a
1 b
2 c
3 None
4 d
5 e
6 None
df.index[df.pId.eq('None') & df.pId.ne(df.pId.shift(-1))]
I am expecting the output of the above code should be
Index([2,5])
It gives me
Index([3,6])
Please correct me
I am not sure for the specific example you showed. Anyway, you could do it in a more simple way:
indexes = [i-1 for i,x in enumerate(pId) if x == 'None']
The problem is that you're returning the index of the "None". You compare it against the previous item, but you're still reporting the index of the "None". Note that your accepted answer doesn't make this check.
In short, you still need to plaster a "-1" onto the result of your checking.
Just -1 from df[df["pId"] == "None"].index:
import pandas as pd
pId=["a","b","c","None","d","e","None"]
df = pd.DataFrame(pId,columns=['pId'])
print(df[df["pId"] == "None"].index - 1)
Which gives you:
Int64Index([2, 5], dtype='int64')
Or if you just want a list of values:
(df[df["pId"] == "None"].index - 1).tolist()
You should be aware that for a list like:
pId=["None","None","b","c","None","d","e","None"]
You get a df like:
pId
0 None
1 None
2 b
3 c
4 None
5 d
6 e
7 None
And output like:
[-1, 0, 3, 6]
Which does not make a great deal of sense.
I have a DataFrame with one column with positive and negative integers. For each row, I'd like to see how many consecutive rows (starting with and including the current row) have negative values.
So if a sequence was 2, -1, -3, 1, -1, the result would be 0, 2, 1, 0, 1.
I can do this by iterating over all the indices, using .iloc to split the column, and next() to find out where the next positive value is. But I feel like this isn't taking advantage of panda's capabilities, and I imagine that there's a better way of doing it. I've experimented with using .shift() and expanding_window but without success.
Is there a more "pandastic" way of finding out how many consecutive rows after the current one meet some logical condition?
Here's what's working now:
import pandas as pd
df = pd.DataFrame({"a": [2, -1, -3, -1, 1, 1, -1, 1, -1]})
df["b"] = 0
for i in df.index:
sub = df.iloc[i:].a.tolist()
df.b.iloc[i] = next((sub.index(n) for n in sub if n >= 0), 1)
Edit: I realize that even my own example doesn't work when there's more than one negative value at the end. So that makes a better solution even more necessary.
Edit 2: I stated the problem in terms of integers, but originally only put 1 and -1 in my example. I need to solve for positive and negative integers in general.
FWIW, here's a fairly pandastic answer that requires no functions or applies. Borrows from here (among other answers I'm sure) and thanks to #DSM for mentioning the ascending=False option:
df = pd.DataFrame({"a": [2, -1, -3, -1, 1, 1, -1, 1, -1, -2]})
df['pos'] = df.a > 0
df['grp'] = ( df['pos'] != df['pos'].shift()).cumsum()
dfg = df.groupby('grp')
df['c'] = np.where( df['a'] < 0, dfg.cumcount(ascending=False)+1, 0 )
a b pos grp c
0 2 0 True 1 0
1 -1 3 False 2 3
2 -3 2 False 2 2
3 -1 1 False 2 1
4 1 0 True 3 0
5 1 0 True 3 0
6 -1 1 False 4 1
7 1 0 True 5 0
8 -1 1 False 6 2
9 -2 1 False 6 1
I think a nice thing about this method is that once you set up the 'grp' variable you can do lots of things very easily with standard groupby methods.
This was an interesting puzzle. I found a way to do it using pandas tools, but I think you'll agree it's a lot more opaque :-). Here's the example:
data = pandas.Series([1, -1, -1, -1, 1, -1, -1, 1, 1, -1, 1])
x = data[::-1] # reverse the data
print(x.groupby(((x<0) != (x<0).shift()).cumsum()).apply(lambda x: pandas.Series(
np.arange(len(x))+1 if (x<0).all() else np.zeros(len(x)),
index=x.index))[::-1])
The output is correct:
0 0
1 3
2 2
3 1
4 0
5 2
6 1
7 0
8 0
9 1
10 0
dtype: float64
The basic idea is similar to what I described in my answer to this question, and you can find the same approach used in various answers that ask how to make use of inter-row information in pandas. Your question is slightly trickier because your criterion goes in reverse (asking for the number of following negatives rather than the number of preceding negatives), and because you only want one side of the grouping (i.e., you only want the number of consecutive negatives, not the number of consecutive numbers with the same sign).
Here is a more verbose version of the same code with some explanation that may make it easier to grasp:
def getNegativeCounts(x):
# This function takes as input a sequence of numbers, all the same sign.
# If they're negative, it returns an increasing count of how many there are.
# If they're positive, it just returns the same number of zeros.
# [-1, -2, -3] -> [1, 2, 3]
# [1, 2, 3] -> [0, 0, 0]
if (x<0).all():
return pandas.Series(np.arange(len(x))+1, index=x.index)
else:
return pandas.Series(np.zeros(len(x)), index=x.index)
# we have to reverse the data because cumsum only works in the forward direction
x = data[::-1]
# compute for each number whether it has the same sign as the previous one
sameSignAsPrevious = (x<0) != (x<0).shift()
# cumsum this to get an "ID" for each block of consecutive same-sign numbers
sameSignBlocks = sameSignAsPrevious.cumsum()
# group on these block IDs
g = x.groupby(sameSignBlocks)
# for each block, apply getNegativeCounts
# this will either give us the running total of negatives in the block,
# or a stretch of zeros if the block was positive
# the [::-1] at the end reverses the result
# (to compensate for our reversing the data initially)
g.apply(getNegativeCounts)[::-1]
As you can see, run-length-style operations are not usually simple in pandas. There is, however, an open issue for adding more grouping/partitioning abilities that would ameliorate some of this. In any case, your particular use case has some specific quirks that make it a bit different from a typical run-length task.