Change dataframe pandas based one series - python

I have data and have convert using dataframe pandas :
import pandas as pd
d = [
(1,70399,0.988375133622),
(1,33919,0.981573492596),
(1,62461,0.981426807114),
(579,1,0.983018778374),
(745,1,0.995580488899),
(834,1,0.980942505189)
]
df_new = pd.DataFrame(e, columns=['source_target']).sort_values(['source_target'], ascending=[True])
and i need build series for mapping column source and target into another
e = []
for x in d:
e.append(x[0])
e.append(x[1])
e = list(set(e))
df_new = pd.DataFrame(e, columns=['source_target'])
df_new.source_target = (df_new.source_target.diff() != 0).cumsum() - 1
new_ser = pd.Series(df_new.source_target.values, index=new_source_old).drop_duplicates()
so i get series :
source_target
1 0
579 1
745 2
834 3
33919 4
62461 5
70399 6
dtype: int64
i have tried change dataframe df_beda based on new_ser series using :
df_beda.target = df_beda.target.mask(df_beda.target.isin(new_ser), df_beda.target.map(new_ser)).astype(int)
df_beda.source = df_beda.source.mask(df_beda.source.isin(new_ser), df_beda.source.map(new_ser)).astype(int)
but result is :
source target weight
0 0 70399 0.988375
1 0 33919 0.981573
2 0 62461 0.981427
3 579 0 0.983019
4 745 0 0.995580
5 834 0 0.980943
it's wrong, ideal result is :
source target weight
0 0 6 0.988375
1 0 4 0.981573
2 0 5 0.981427
3 1 0 0.983019
4 2 0 0.995580
5 3 0 0.980943
maybe anyone can help me for show where my mistake
Thanks

If the order doesn't matter, you can do the following. Avoid for loop unless it's absolutely necessary.
uniq_vals = np.unique(df_beda[['source','target']])
map_dict = dict(zip(uniq_vals, xrange(len(uniq_vals))))
df_beda[['source','target']] = df_beda[['source','target']].replace(map_dict)
print df_beda
source target weight
0 0 6 0.988375
1 0 4 0.981573
2 0 5 0.981427
3 1 0 0.983019
4 2 0 0.995580
5 3 0 0.980943
If you want to roll back, you can create an inverse map from the original one, because it is guaranteed to be 1-to-1 mapping.
inverse_map = {v:k for k,v in map_dict.iteritems()}
df_beda[['source','target']] = df_beda[['source','target']].replace(inverse_map)
print df_beda
source target weight
0 1 70399 0.988375
1 1 33919 0.981573
2 1 62461 0.981427
3 579 1 0.983019
4 745 1 0.995580
5 834 1 0.980943

Related

Counting frequencies of a list of words in each row in a data frame in python

I would like to ask a question about how to create new column names for an existing data frame from a list of column names. I was counting verb frequencies in each string in a data frame. The verb list looks as below:
<bound method DataFrame.to_dict of verb
0 agree
1 bear
2 care
3 choose
4 be>
The code below works but the output is the total frequencies of all the words, instead of creating column names for each word in a word list.
#ver.1 code
import pandas as pd
verb = pd.read_csv('cog_verb.csv')
df2 = pd.DataFrame(df.answer_id)
for x in verb:
df2[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))
The code was updated reflecting the helpful comment by Drakax, as below:
#updated code
for x in verb:
df2.to_dict()[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))
but both of the codes produced the same following output:
<bound method DataFrame.to_dict of answer_id count_verb
0 312 91
1 1110 123
2 2700 102
3 2764 217
4 2806 182
.. ... ...
321 33417 336
322 36558 517
323 37316 137
324 37526 119
325 45683 1194
[326 rows x 2 columns]>
----- updated info----
As advised by Drakax, I add the first data frame below.
df.to_dict
<bound method DataFrame.to_dict of answer_id text
0 312 ANON_NAME_0\n Here are a few instructions for ...
1 1110 October16,2006 \nDear Dad,\n\n I am going to g...
2 2700 My Writing Habits\n I do many things before I...
3 2764 My Ideas about Writing\n I have many ideas bef...
4 2806 I've main habits for writing and I sure each o...
.. ... ...
321 33417 ????????????????????????\n???????????????? ?? ...
322 36558 In this world, there are countless numbers of...
323 37316 My Friend's Room\nWhen I was kid I used to go ...
324 37526 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ...
325 45683 Primary and Secondary Education in South Korea...
[326 rows x 2 columns]>
While the above output is correct, I want each word's frequency data as applied to each column.
I appreciate any help you can provide. Many thanks in advance!
Well it seems to still be a mess but I think I've understood what you want and you can adapt/update your code with mine:
1. This step is only for me; creating new DF with randomly generated str:
from pandas._testing import rands_array
randstr = pd.util.testing.rands_array(10, 10)
df = pd.DataFrame(data=randstr, columns=["randstr"])
df
index
randstr
count
0
20uDmHdBL5
1
1
E62AeycGdy
1
2
tHz99eI8BC
1
3
iZLXfs7R4k
1
4
bURRiuxHvc
2
5
lBDzVuB3z9
1
6
GuIZHOYUr5
1
7
k4wVvqeRkD
1
8
oAIGt8pHbI
1
9
N3BUMfit7a
2
2. Then to count the occurrences of your desired regex simply do this:
reg = ['a','e','i','o','u'] #this is where you stock your verbs
def count_reg(df):
for i in reg:
df[i] = df['randstr'].str.count(i)
return df
count_reg(df)
index
randstr
a
e
i
o
u
0
h2wcd5yULo
0
0
0
1
0
1
uI400TZnJl
0
0
0
0
1
2
qMiI7morYG
0
0
1
1
0
3
f6Aw6AH3TL
0
0
0
0
0
4
nJ0h9IsDn6
0
0
0
0
0
5
tWyNxnzLwv
0
0
0
0
0
6
V4sTYcPsiB
0
0
1
0
0
7
tSgni67247
0
0
1
0
0
8
sUZn3L08JN
0
0
0
0
0
9
qDiG3Zynk0
0
0
1
0
0

Conditional sum of non zero values

I have a daraframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of continious non zero values of Column (Fn)
I want my result dataframe as below:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 1
2 18747.392361 7050.0 0 0
3 18747.395833 8240.0 1 1
4 18747.399306 5158.0 1 2 <<<
5 18747.402778 3926.0 0 0
6 18747.406250 4043.0 0 0
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
You can use groupby() and cumsum():
groups = df.Fn.eq(0).cumsum()
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
Details
First use df.Fn.eq(0).cumsum() to create pseudo-groups of consecutive non-zeros. Each zero will get a new id while consecutive non-zeros will keep the same id:
groups = df.Fn.eq(0).cumsum()
# groups Fn (Fn added just for comparison)
# 0 1 0
# 1 1 1
# 2 2 0
# 3 2 1
# 4 2 1
# 5 3 0
# 6 4 0
# 7 4 1
# 8 4 1
# 9 4 1
Then group df.Fn.ne(0) on these pseudo-groups and cumsum() to generate the within-group sequences:
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
# Datetime Data Fn Sum
# 0 18747.385417 11275.0 0 0
# 1 18747.388889 8872.0 1 1
# 2 18747.392361 7050.0 0 0
# 3 18747.395833 8240.0 1 1
# 4 18747.399306 5158.0 1 2
# 5 18747.402778 3926.0 0 0
# 6 18747.406250 4043.0 0 0
# 7 18747.409722 2752.0 1 1
# 8 18747.420139 3502.0 1 2
# 9 18747.423611 4026.0 1 3
How about using cumsum and reset when value is 0
df['Fn2'] = df['Fn'].replace({0: False, 1: True})
df['Fn2'] = df['Fn2'].cumsum() - df['Fn2'].cumsum().where(df['Fn2'] == False).ffill().astype(int)
df
You can store the fn column in a list and then create a new list and iterate over the stored fn column and check the previous index value if it is greater than zero then add it to current index else do not update it and after this u can make a dataframe for the list and concat column wise to existing dataframe
fn=df[Fn]
sum_list[0]=fn first value
for i in range(1,lenghtofthe column):
if fn[i-1]>0:
sum_list.append(fn[i-1]+fn[i])
else:
sum_list.append(fn[i])
dfsum=pd.Dataframe(sum_list)
df=pd.concat([df,dfsum],axis=1)
Hope this will help you.there may me syntax errors that you can refer google.But the idea is this
try this:
sum_arr = [0]
for val in df['Fn']:
if val > 0:
sum_arr.append(sum_arr[-1] + 1)
else:
sum_arr.append(0)
df['sum'] = sum_arr[1:]
df

Checking for subset in a column?

I'm trying to flag some price data as "stale" if the quoted price of the security hasn't changed over lets say 3 trading days. I'm currently trying it with:
firm["dev"] = np.std(firm["Price"],firm["Price"].shift(1),firm["Price"].shift(2))
firm["flag"] == np.where(firm["dev"] = 0, 1, 0)
But I'm getting nowhere with it. This is what my dataframe would look like.
Index
Price
Flag
1
10
0
2
11
0
3
12
0
4
12
0
5
12
1
6
11
0
7
13
0
Any help is appreciated!
If you are okay with other conditions, you can first check if series.diff equals 0 and take cumsum to check if you have a cumsum of 2 (n-1). Also check if the next row is equal to current, when both these conditions suffice, assign a flag of 1 else 0.
n=3
firm['Flag'] = (firm['Price'].diff().eq(0).cumsum().eq(n-1) &
firm['Price'].eq(firm['Price'].shift())).astype(int)
EDIT, to make it a generalized function with consecutive n, use this:
def fun(df,col,n):
c = df[col].diff().eq(0)
return (c|c.shift(-1)).cumsum().ge(n) & df[col].eq(df[col].shift())
firm['flag_2'] = fun(firm,'Price',2).astype(int)
firm['flag_3'] = fun(firm,'Price',3).astype(int)
print(firm)
Price Flag flag_2 flag_3
Index
1 10 0 0 0
2 11 0 0 0
3 12 0 0 0
4 12 0 1 0
5 12 1 1 1
6 11 0 0 0
7 13 0 0 0

Compute average of the pandas df conditioned on a parameter

I have the following df:
import numpy as np
import pandas as pd
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df =
0 1 2 3 lvl
0 0.928623 0.868600 0.854186 0.129116 0
1 0.667870 0.901285 0.539412 0.883890 0
2 0.384494 0.697995 0.242959 0.725847 0
3 0.993400 0.695436 0.596957 0.142975 0
4 0.518237 0.550585 0.426362 0.766760 0
5 0.359842 0.417702 0.873988 0.217259 0
6 0.820216 0.823426 0.585223 0.553131 0
7 0.492683 0.401155 0.479228 0.506862 0
..............................................
3 0.505096 0.426465 0.356006 0.584958 3
4 0.145472 0.558932 0.636995 0.318406 3
5 0.957969 0.068841 0.612658 0.184291 3
6 0.059908 0.298270 0.334564 0.738438 3
7 0.662056 0.074136 0.244039 0.848246 3
8 0.997610 0.043430 0.774946 0.097294 3
9 0.795873 0.977817 0.780772 0.849418 3
0 0.577173 0.430014 0.133300 0.760223 4
1 0.916126 0.623035 0.240492 0.638203 4
2 0.165028 0.626054 0.225580 0.356118 4
3 0.104375 0.137684 0.084631 0.987290 4
4 0.934663 0.835608 0.764334 0.651370 4
5 0.743265 0.072671 0.911947 0.925644 4
6 0.212196 0.587033 0.230939 0.994131 4
7 0.945275 0.238572 0.696123 0.536136 4
8 0.989021 0.073608 0.720132 0.254656 4
9 0.513966 0.666534 0.270577 0.055597 4
I am learning neat pandas functionality and thus wondering, what is the easiest way to compute average along lvl column?
What I mean is:
(df[df.lvl ==0 ] + df[df.lvl ==1 ] + df[df.lvl ==2 ] + df[df.lvl ==3 ] + df[df.lvl ==4 ]) / 5
The desired output should be a table of shape (10,4), without the column lvl, where each element is the average of 5 elements (with lvl = [0,1,2,3,4]. I hope it helps.
I think need:
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
#print (df)
df1 = (df[df.lvl ==0 ] + df[df.lvl ==1 ] +
df[df.lvl ==2 ] + df[df.lvl ==3 ] +
df[df.lvl ==4 ]) / 5
print (df1)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
df = df.set_index('lvl')
df2 = df.groupby(df.groupby('lvl').cumcount()).mean()
print (df2)
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
EDIT:
If each subset of DataFrame have index from 0 to len(subset):
df2 = df.mean(level=0)
print (df2)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
The groupby function is exactly what you want. It will group based on a condition, in this case where 'lvl' is the same, and then apply the mean function to the values for each column in that group.
df.groupby('lvl').mean()
it seems like you want to group by the index and take average of all the columns except lvl
i.e.
df.groupby(df.index)[[0,1,2,3]].mean()
For a dataframe generated using
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df.groupby(df.index)[[0,1,2,3]].mean()
outputs:
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
which is identical to the output from
df.groupby(df.groupby('lvl').cumcount()).mean()
without resorting to double groupby.
IMO this is cleaner to read and will for large dataframe, will be much faster.

best way to implement Apriori in python pandas

What is the best way to implement the Apriori algorithm in pandas? So far I got stuck on transforming extracting out the patterns using for loops. Everything from the for loop onward does not work. Is there a vectorized way to do this in pandas?
import pandas as pd
import numpy as np
trans=pd.read_table('output.txt', header=None,index_col=0)
def apriori(trans, support=4):
ts=pd.get_dummies(trans.unstack().dropna()).groupby(level=1).sum()
#user input
collen, rowlen =ts.shape
#max length of items
tssum=ts.sum(axis=1)
maxlen=tssum.loc[tssum.idxmax()]
items=list(ts.columns)
results=[]
#loop through items
for c in range(1, maxlen):
#generate patterns
pattern=[]
for n in len(pattern):
#calculate support
pattern=['supp']=pattern.sum/rowlen
#filter by support level
Condit=pattern['supp']> support
pattern=pattern[Condit]
results.append(pattern)
return results
results =apriori(trans)
print results
When I insert this with support 3
a b c d e
0
11 1 1 1 0 0
666 1 0 0 1 1
10101 0 1 1 1 0
1010 1 1 1 1 0
414147 0 1 1 0 0
10101 1 1 0 1 0
1242 0 0 0 1 1
101 1 1 1 1 0
411 0 0 1 1 1
444 1 1 1 0 0
it should output something like
Pattern support
a 6
b 7
c 7
d 7
e 3
a,b 5
a,c 4
a,d 4
Assuming I understand what you're after, maybe
from itertools import combinations
def get_support(df):
pp = []
for cnum in range(1, len(df.columns)+1):
for cols in combinations(df, cnum):
s = df[list(cols)].all(axis=1).sum()
pp.append([",".join(cols), s])
sdf = pd.DataFrame(pp, columns=["Pattern", "Support"])
return sdf
would get you started:
>>> s = get_support(df)
>>> s[s.Support >= 3]
Pattern Support
0 a 6
1 b 7
2 c 7
3 d 7
4 e 3
5 a,b 5
6 a,c 4
7 a,d 4
9 b,c 6
10 b,d 4
12 c,d 4
14 d,e 3
15 a,b,c 4
16 a,b,d 3
21 b,c,d 3
[15 rows x 2 columns]
add support, confidence, and lift caculation。
def apriori(data, set_length=2):
import pandas as pd
df_supports = []
dataset_size = len(data)
for combination_number in range(1, set_length+1):
for cols in combinations(data.columns, combination_number):
supports = data[list(cols)].all(axis=1).sum() * 1.0 / dataset_size
confidenceAB = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[0]]==1])
confidenceBA = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[-1]]==1])
liftAB = confidenceAB * dataset_size / len(data[data[cols[-1]]==1])
liftBA = confidenceAB * dataset_size / len(data[data[cols[0]]==1])
df_supports.append([",".join(cols), supports, confidenceAB, confidenceBA, liftAB, liftBA])
df_supports = pd.DataFrame(df_supports, columns=['Pattern', 'Support', 'ConfidenceAB', 'ConfidenceBA', 'liftAB', 'liftBA'])
df_supports.sort_values(by='Support', ascending=False)
return df_supports

Categories

Resources