I'm new to Pandas and would love some help I'm trying to take:
factor
1
1
2
1
1
3
1
2
and produce:
factor running_div
1 1
1 1
2 0.5
1 0.5
1 0.5
3 0.1666667
1 0.1666667
2 0.0833333
I can do it by looping through using .iloc, but trying to use vector math for efficiency. Have looked at rolling window and using .shift(1), but can't get it working. Would appreciate any guideance anyone could provide.
Use numpy ufunc.accumulate
df['cum_div'] = np.divide.accumulate(df.factor.to_numpy())
factor cum_div
0 1 1.000000
1 1 1.000000
2 2 0.500000
3 1 0.500000
4 1 0.500000
5 3 0.166667
6 1 0.166667
7 2 0.083333
You can try this:
import pandas as pd
df=pd.DataFrame([1,1,2,1,1,3,1,2], columns=["factor"])
df["running_div"]=df["factor"].iloc[0]
df["running_div"].loc[df.index[1:]]=1/df["factor"].loc[df.index[1:]]
df["running_div"]=df["running_div"].cumprod()
print(df)
Output:
factor running_div
0 1 1.000000
1 1 1.000000
2 2 0.500000
3 1 0.500000
4 1 0.500000
5 3 0.166667
6 1 0.166667
7 2 0.083333
[Program finished]
A cumulative division is done by keeping the first element, and them cumulatively multiplying by the inverse of all next elements until the end.
Hence, using np.cumprod
df['division'] = np.cumprod([df.factor.iloc[0], *1/df.factor.iloc[1:]])
factor division
0 1 1.000000
1 1 1.000000
2 2 0.500000
3 1 0.500000
4 1 0.500000
5 3 0.166667
6 1 0.166667
7 2 0.083333
Related
I have code implemented in pandas, but am having trouble converting to dask because I need to use set_index(), what is the best work around? Using dask because I need to scale this to much larger dataframes.
I am looking to return a dataframe where each element is divided by the column-wise sum of a group.
Example dataframe that looks like this
df = [
[1,4,2,1],
[4,4,0,-1],
[2,3,1,6],
[-2,1,0,-1],
[6,-3,-2,-1],
[1,0,5,5],
]
df = pd.DataFrame(df)
lab_id = ['a','b','a','b','a','c']
df['lab_id'] = lab_id
df
0 1 2 3 lab_id
0 1 4 2 1 a
1 4 4 0 -1 b
2 2 3 1 6 a
3 -2 1 0 -1 b
4 6 -3 -2 -1 a
5 1 0 5 5 c
Currently in pandas I do a groupby by sum to return a dataframe:
sum_df = df.groupby('lab_id').sum()
sum_df
0 1 2 3
lab_id
a 9 4 1 6
b 2 5 0 -2
c 1 0 5 5
And then I set the index of the original data frame and divide by the sum dataframe:
df.set_index('lab_id')/sum_df
0 1 2 3
lab_id
a 0.111111 1.00 2.0 0.166667
a 0.222222 0.75 1.0 1.000000
a 0.666667 -0.75 -2.0 -0.166667
b 2.000000 0.80 NaN 0.500000
b -1.000000 0.20 NaN 0.500000
c 1.000000 NaN 1.0 1.000000
The main problem is that I am having a huge issue setting index in dask, which explicitly mentions to avoid using set_index() and reset_index() methods. I simply can't find a way around doing so!
I have tried many arcane ways to set index outside of dask such as creating a new dataframe with the index already set and a row of dummy data and iteratively assigning the columns from the old dataframe (this is some of the worst code i've written).
Try with transform
df.loc[:,[0,1,2,3]] = df/df.groupby('lab_id').transform('sum')[[0,1,2,3]]
df
Out[767]:
0 1 2 3 lab_id
0 0.111111 1.00 2.0 0.166667 a
1 2.000000 0.80 NaN 0.500000 b
2 0.222222 0.75 1.0 1.000000 a
3 -1.000000 0.20 NaN 0.500000 b
4 0.666667 -0.75 -2.0 -0.166667 a
5 1.000000 NaN 1.0 1.000000 c
I have a dataframe df like:
GROUP TYPE COUNT
A 1 5
A 2 10
B 1 3
B 2 9
C 1 20
C 2 100
I would like to add a row for each group such that the new row calculates the quotient of COUNT where TYPE equals 2 and COUNT where TYPE equals 1 for each GROUP ala:
GROUP TYPE COUNT
A 1 5
A 2 10
A .5
B 1 3
B 2 9
B .33
C 1 20
C 2 100
C .2
Thanks in advance.
df2 = df.pivot(index='GROUP', columns='TYPE', values='COUNT')
df2['div'] = df2[1]/df2[2]
df2.reset_index().melt('GROUP').sort_values('GROUP')
Output:
GROUP TYPE value
0 A 1 5.000000
3 A 2 10.000000
6 A div 0.500000
1 B 1 3.000000
4 B 2 9.000000
7 B div 0.333333
2 C 1 20.000000
5 C 2 100.000000
8 C div 0.200000
My approach would be to reshape the dataframe by pivoting, so every type has its own column. Then the division is very easy, and then by melting you reshape it back to the original shape. In my opinion this is also a very readable solution.
Of course, if you prefer np.nan to div as a type, you can replace it very easily, but I'm not sure if that's what you want.
s=df[df.TYPE.isin([1,2])].sort_values(['GROUP','TYPE']).groupby('GROUP').COUNT.apply(lambda x : x.iloc[0]/x.iloc[1])
# I am sort and filter your original df ,to make they are ordered and only have type 1 and 2
pd.concat([df,s.reset_index()]).sort_values('GROUP')
# cancat your result back
Out[77]:
COUNT GROUP TYPE
0 5.000000 A 1.0
1 10.000000 A 2.0
0 0.500000 A NaN
2 3.000000 B 1.0
3 9.000000 B 2.0
1 0.333333 B NaN
4 20.000000 C 1.0
5 100.000000 C 2.0
2 0.200000 C NaN
You can do:
import numpy as np
import pandas as pd
def add_quotient(x):
last_row = x.iloc[-1]
last_row['COUNT'] = x[x.TYPE == 1].COUNT.min() / x[x.TYPE == 2].COUNT.max()
last_row['TYPE'] = np.nan
return x.append(last_row)
print(df.groupby('GROUP').apply(add_quotient))
Output
GROUP TYPE COUNT
GROUP
A 0 A 1.0 5.000000
1 A 2.0 10.000000
1 A NaN 0.500000
B 2 B 1.0 3.000000
3 B 2.0 9.000000
3 B NaN 0.333333
C 4 C 1.0 20.000000
5 C 2.0 100.000000
5 C NaN 0.200000
Note that the function select the min of the TYPE == 1 and the max of the TYPE == 2, in case there is more than one value per group. And the TYPE is set to np.nan, but that can be easily changed.
Here's a way first using sort_values' by '['GROUP', 'TYPE'] so ensuring that TYPE 2 comes before 1 and then GroupBy GROUP.
Then use first and last to compute the quocient and outer merging with df:
g = df.sort_values(['GROUP', 'TYPE']).groupby('GROUP')
s = (g.first()/ g.nth(1)).COUNT.reset_index()
df.merge(s, on = ['GROUP','COUNT'], how='outer').fillna(' ').sort_values('GROUP')
GROUP TYPE COUNT
0 A 1 5.000000
1 A 2 10.000000
6 A 0.500000
2 B 1 3.000000
3 B 2 9.000000
7 B 0.333333
4 C 1 20.000000
5 C 2 100.000000
8 C 0.200000
series outcome
1 T
1 F
1 T
2 T
2 F
3 T
4 F
4 T
5 F
I have a data frame looking something like this and I am trying to look at the proportion of T on the outcome for each series. However I do not understand why i m not being able to make it work
series = np.unique(series)
count = 0
pcorrect = np.zeros(len(nseries))
for s in nseries:
if data.loc[data['series'] == s]:
outcome_count = data['outcome'].value_counts()
nstarted_trials = outcome_count['T'] + outcome_count[F']
pcorrect[count]= outcome_count['T'] / nstarted_trials
count +=1
I think you can using crosstab
pd.crosstab(df.series,df.outcome,margins = True)
Out[698]:
outcome F T All
series
1 1 2 3
2 1 1 2
3 0 1 1
4 1 1 2
5 1 0 1
All 4 5 9
If need percentage
pd.crosstab(df.series,df.outcome,margins = True, normalize=True)
Out[699]:
outcome F T All
series
1 0.111111 0.222222 0.333333
2 0.111111 0.111111 0.222222
3 0.000000 0.111111 0.111111
4 0.111111 0.111111 0.222222
5 0.111111 0.000000 0.111111
All 0.444444 0.555556 1.000000
I have a pandas dataframe that contains dates, items, and 2 values. All I'm looking to do is output another column that is the product of column A / column B if column B is greater than 0, and 0 if column B is equal to 0.
date item A B C
1/1/2017 a 0 3 0
1/1/2017 b 2 0 0
1/1/2017 c 5 2 2.5
1/1/2017 d 4 1 4
1/1/2017 e 3 3 1
1/1/2017 f 0 4 0
1/2/2017 a 3 3 1
1/2/2017 b 2 2 1
1/2/2017 c 3 9 0.333333333
1/2/2017 d 4 0 0
1/2/2017 e 5 3 1.666666667
1/2/2017 f 3 0 0
this is the code I've written, but the kernel keeps dying (keep in mind this is just an example table, I have about 30,000 rows so nothing too crazy)
df['C'] = df.loc[df['B'] > 0, 'A'] / df['B'])
any idea on what's going on? Is something running infinitely that's causing it to crash? Thanks for the help.
You get that using np.where
df['C'] = np.round(np.where(df['B'] > 0, df['A']/df['B'], 0), 1)
Or if you want to use loc
df.loc[df['B'] > 0, 'C'] = df['A']/df['B']
and then fillna(0)
Option 1
You use pd.Series.mask to hide zeros, and then just empty cells with fillna.
v = (df.A / df.B.mask(df.B == 0)).fillna(0)
v
0 0.000000
1 0.000000
2 2.500000
3 4.000000
4 1.000000
5 0.000000
6 1.000000
7 1.000000
8 0.333333
9 0.000000
10 1.666667
11 0.000000
dtype: float64
df['C'] = v
Alternatively, replace those zeros with np.inf, because x / inf = 0.
df['C'] = (df.A / df.B.mask(df.B == 0, np.inf))
Option 2
Direct replacement with df.replace
df.A / df.B.replace(0, np.inf)
0 0.000000
1 0.000000
2 2.500000
3 4.000000
4 1.000000
5 0.000000
6 1.000000
7 1.000000
8 0.333333
9 0.000000
10 1.666667
11 0.000000
dtype: float64
Keep in mind, you can do an astype conversion, if you want mixed integers and floats as your result:
df.A.div(df.B.replace(0, np.inf)).astype(object)
0 0
1 0
2 2.5
3 4
4 1
5 0
6 1
7 1
8 0.333333
9 0
10 1.66667
11 0
dtype: object
I have an adjency matrix (dm) of items vs items; the value between two items (e.g., item0,item1) refers to the number of times these items appear together. How can I scale all the values in pandas between 0 to 1?
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
However, I am not sure how to apply scaler to the pandas data frame.
You can assign the resulting array back to the dataframe with loc:
df = pd.DataFrame(np.random.randint(1, 5, (5, 5)))
df
Out[277]:
0 1 2 3 4
0 2 3 2 3 1
1 2 3 4 4 2
2 2 3 4 3 2
3 1 1 2 1 4
4 4 2 2 3 1
df.loc[:,:] = scaler.fit_transform(df)
df
Out[279]:
0 1 2 3 4
0 0.333333 1.0 0.0 0.666667 0.000000
1 0.333333 1.0 1.0 1.000000 0.333333
2 0.333333 1.0 1.0 0.666667 0.333333
3 0.000000 0.0 0.0 0.000000 1.000000
4 1.000000 0.5 0.0 0.666667 0.000000
You can do the same with (df - df.min()) / (df.max() - df.min()).