I wish to extract a dataframe of numbers(floating) based on the first instance of position MrkA and Mrk1. I am not interested in the second instance of MrkA because I know what columns to extract via the line df1
Input:
df = pd.DataFrame({'A':['sdfg',23,'MrkA',34,0,56],'B':['jfgh',23,'sdfg','MrkB',0,56], 'C':['cvb',7,'dsfgA','ghks',47,3],'D':['rrb',7,'gfd',3,0,7],'E':['dfg',7,'gfd',5,12,1],'F':['dfg',7,'sdfA',5,0,4],'G':['dfg',7,'sdA',5,8,9],'H':['dfg',7,'gfA',5,0,8],'I':['dfg',7,'sdfA',5,7,23]})
A B C D E F G H I
0 sdfg jfgh cvb rrb dfg dfg dfg dfg dfg
1 23 23 7 7 7 7 7 7 7
2 MrkA sdfg dsfgA MrkA gfd sdfA sdA gfA sdfA
3 34 Mrk1 ghks 3 Mrk2 5 5 5 5
4 0 0 47 0 12 0 8 0 7
5 56 56 3 7 1 4 9 8 23
for i,j in range(df.shape[1]):
for k,l in range(df.shape[0]):
if df.iloc[k,i] == 'MrkA'and df.iloc[l,j] == 'Mrk1':
col = i
row = k
df1=df.iloc[row+2:,[col,col+1,col+2,col+4,col+5,col+7,col+8]]
break
Output: cannot unpack non-iterable int object
Desired Output:
A B C E F H I
4 0 0 47 12 0 0 7
5 56 56 3 1 4 8 23
How shall I proceed?
Your problem is that df.shape[0]/df.shape[1] is a single element. So trying to unpack range(value) to 2 indices is causing the error.
It should be:
for i in range(df.shape[1]):
for j in range(df.shape[0]):
Then you can apply the desired logic to extract the rows.
Note that it's unclear way you ignore the second row which is also all numeric. If it's only a typo you can try the following to extract all the fully numeric rows and apply some logic there:
df[df.applymap(np.isreal).all(1)]
Edit
Although it is not clear from your specific example what is the logic:
In the example you gave there is no Mrk1 but rather MrkB.
Why is D column disappeared?
A hard-coded example that gives the desired output should be something similar to the following:
import pandas as pd
df = pd.DataFrame({'A':['sdfg',23,'MrkA',34,0,56],'B':['jfgh',23,'sdfg','MrkB',0,56], 'C':['cvb',7,'dsfgA','ghks',47,3],'D':['rrb',7,'gfd',3,0,7],'E':['dfg',7,'gfd',5,12,1],'F':['dfg',7,'sdfA',5,0,4],'G':['dfg',7,'sdA',5,8,9],'H':['dfg',7,'gfA',5,0,8],'I':['dfg',7,'sdfA',5,7,23]})
for r in range(0, df.shape[0] - 1):
for c in range(df.shape[1] - 1):
if df.iloc[r, c] == "MrkA" and df.iloc[r + 1, c + 1] == "MrkB":
print(df.iloc[r + 2:, :])
This gives:
A B C D E F G H I
4 0 0 47 0 12 0 8 0 7
5 56 56 3 7 1 4 9 8 23
Related
Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15
It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15
Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15
You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15
Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15
I have extracted some data in pandas format from a sql server. The structure like this:
df = pd.DataFrame({'Day':(1,2,3,4,1,2,3,4),'State':('A','A','A','A','B','B','B','B'),'Direction':('N','S','N','S','N','S','N','S'),'values':(12,34,22,37,14,16,23,43)})
>>> df
Day Direction State values
0 1 N A 12
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Now I want to substitute all values with same day and same Direction but with (State == A) by itself + values with same day and same State but with (State == B). For example, like this:
df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'] = df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'].values + df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'B'),'values'].values
>>> df
Day Direction State values
0 1 N A 26
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Notice the first line values have been changed from 12 to 26(12 + 14)
Since the values are from different rows, so kind of difficult to use combine_first functions?
Now I have to use two loops (on 'Day' and on 'Direction') and the above attribution sentence to do, it's extremely slow when the dataframe's getting big. Do you have any smart and efficient way doing this?
You can first define a function to do add values from B to A in the same group. Then apply this function to each group.
def f(x):
x.loc[x.State=='A','values']+=x.loc[x.State=='B','values'].iloc[0]
return x
df.groupby(['Day','Direction']).apply(f)
Out[94]:
Day Direction State values
0 1 N A 26
1 2 S A 50
2 3 N A 45
3 4 S A 80
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
What is the best way to implement the Apriori algorithm in pandas? So far I got stuck on transforming extracting out the patterns using for loops. Everything from the for loop onward does not work. Is there a vectorized way to do this in pandas?
import pandas as pd
import numpy as np
trans=pd.read_table('output.txt', header=None,index_col=0)
def apriori(trans, support=4):
ts=pd.get_dummies(trans.unstack().dropna()).groupby(level=1).sum()
#user input
collen, rowlen =ts.shape
#max length of items
tssum=ts.sum(axis=1)
maxlen=tssum.loc[tssum.idxmax()]
items=list(ts.columns)
results=[]
#loop through items
for c in range(1, maxlen):
#generate patterns
pattern=[]
for n in len(pattern):
#calculate support
pattern=['supp']=pattern.sum/rowlen
#filter by support level
Condit=pattern['supp']> support
pattern=pattern[Condit]
results.append(pattern)
return results
results =apriori(trans)
print results
When I insert this with support 3
a b c d e
0
11 1 1 1 0 0
666 1 0 0 1 1
10101 0 1 1 1 0
1010 1 1 1 1 0
414147 0 1 1 0 0
10101 1 1 0 1 0
1242 0 0 0 1 1
101 1 1 1 1 0
411 0 0 1 1 1
444 1 1 1 0 0
it should output something like
Pattern support
a 6
b 7
c 7
d 7
e 3
a,b 5
a,c 4
a,d 4
Assuming I understand what you're after, maybe
from itertools import combinations
def get_support(df):
pp = []
for cnum in range(1, len(df.columns)+1):
for cols in combinations(df, cnum):
s = df[list(cols)].all(axis=1).sum()
pp.append([",".join(cols), s])
sdf = pd.DataFrame(pp, columns=["Pattern", "Support"])
return sdf
would get you started:
>>> s = get_support(df)
>>> s[s.Support >= 3]
Pattern Support
0 a 6
1 b 7
2 c 7
3 d 7
4 e 3
5 a,b 5
6 a,c 4
7 a,d 4
9 b,c 6
10 b,d 4
12 c,d 4
14 d,e 3
15 a,b,c 4
16 a,b,d 3
21 b,c,d 3
[15 rows x 2 columns]
add support, confidence, and lift caculation。
def apriori(data, set_length=2):
import pandas as pd
df_supports = []
dataset_size = len(data)
for combination_number in range(1, set_length+1):
for cols in combinations(data.columns, combination_number):
supports = data[list(cols)].all(axis=1).sum() * 1.0 / dataset_size
confidenceAB = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[0]]==1])
confidenceBA = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[-1]]==1])
liftAB = confidenceAB * dataset_size / len(data[data[cols[-1]]==1])
liftBA = confidenceAB * dataset_size / len(data[data[cols[0]]==1])
df_supports.append([",".join(cols), supports, confidenceAB, confidenceBA, liftAB, liftBA])
df_supports = pd.DataFrame(df_supports, columns=['Pattern', 'Support', 'ConfidenceAB', 'ConfidenceBA', 'liftAB', 'liftBA'])
df_supports.sort_values(by='Support', ascending=False)
return df_supports
Suppose I have the following DataFrame:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
And I need to group it given the "edge B"; that means the groups will be:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
That is any time I find a 'B' in the column 'a' I want to split my DataFrame.
My current solution is:
#create the dataframe
s = pd.Series(['A','A','A','B','A','A','B','A','A'])
ss = pd.Series(np.random.randn(9))
dff = pd.DataFrame({"a":s,"b":ss})
#my solution
count = 0
ls = []
for i in s:
if i=="A":
ls.append(count)
else:
ls.append(count)
count+=1
dff['grpb']=ls
and I got the dataframe:
a b grpb
0 A 1.516733 0
1 A 0.035646 0
2 A -0.942834 0
3 B -0.157334 0
4 A 2.226809 1
5 A 0.768516 1
6 B -0.015162 1
7 A 0.710356 2
8 A 0.151429 2
Which I can then split with dff.groupby('grpb').
Is there a more efficient way to do this using pandas' functions?
here's a oneliner:
zip(*dff.groupby(pd.rolling_median((1*(dff['a']=='B')).cumsum(),3,True)))[-1]
[ 1 2
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334,
1 2
4 A 2.226809
5 A 0.768516
6 B -0.015162,
1 2
7 A 0.710356
8 A 0.151429]
How about:
df.groupby((df.a == "B").shift(1).fillna(0).cumsum())
For example:
>>> df
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072
4 A -1.903361
5 A 1.436268
6 B 0.391087
7 A -0.907679
8 A 1.672897
>>> gg = list(df.groupby((df.a == "B").shift(1).fillna(0).cumsum()))
>>> pprint.pprint(gg)
[(0,
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072),
(1, a b
4 A -1.903361
5 A 1.436268
6 B 0.391087),
(2, a b
7 A -0.907679
8 A 1.672897)]
(I didn't bother getting rid of the indices; you could use [g for k, g in df.groupby(...)] if you liked.)
An alternative is:
In [36]: dff
Out[36]:
a b
0 A 0.689785
1 A -0.374623
2 A 0.517337
3 B 1.549259
4 A 0.576892
5 A -0.833309
6 B -0.209827
7 A -0.150917
8 A -1.296696
In [37]: dff['grpb'] = np.NaN
In [38]: breaks = dff[dff.a == 'B'].index
In [39]: dff['grpb'][breaks] = range(len(breaks))
In [40]: dff.fillna(method='bfill').fillna(len(breaks))
Out[40]:
a b grpb
0 A 0.689785 0
1 A -0.374623 0
2 A 0.517337 0
3 B 1.549259 0
4 A 0.576892 1
5 A -0.833309 1
6 B -0.209827 1
7 A -0.150917 2
8 A -1.296696 2
Or using itertools to create 'grpb' is an option too.
def vGroup(dataFrame, edgeCondition, groupName='autoGroup'):
groupNum = 0
dataFrame[groupName] = ''
#loop over each row
for inx, row in dataFrame.iterrows():
if edgeCondition[inx]:
dataFrame.ix[inx, groupName] = 'edge'
groupNum += 1
else:
dataFrame.ix[inx, groupName] = groupNum
return dataFrame[groupName]
vGroup(df, df[0] == ' ')