Python pandas dataframe restriction

Python pandas dataframe restriction - python

series outcome
1 T
1 F
1 T
2 T
2 F
3 T
4 F
4 T
5 F
I have a data frame looking something like this and I am trying to look at the proportion of T on the outcome for each series. However I do not understand why i m not being able to make it work
series = np.unique(series)
count = 0
pcorrect = np.zeros(len(nseries))
for s in nseries:
if data.loc[data['series'] == s]:
outcome_count = data['outcome'].value_counts()
nstarted_trials = outcome_count['T'] + outcome_count[F']
pcorrect[count]= outcome_count['T'] / nstarted_trials
count +=1

I think you can using crosstab
pd.crosstab(df.series,df.outcome,margins = True)
Out[698]:
outcome F T All
series
1 1 2 3
2 1 1 2
3 0 1 1
4 1 1 2
5 1 0 1
All 4 5 9
If need percentage
pd.crosstab(df.series,df.outcome,margins = True, normalize=True)
Out[699]:
outcome F T All
series
1 0.111111 0.222222 0.333333
2 0.111111 0.111111 0.222222
3 0.000000 0.111111 0.111111
4 0.111111 0.111111 0.222222
5 0.111111 0.000000 0.111111
All 0.444444 0.555556 1.000000

Related

Divide element by sum of groupby in dask without setting index for every column

I have code implemented in pandas, but am having trouble converting to dask because I need to use set_index(), what is the best work around? Using dask because I need to scale this to much larger dataframes.
I am looking to return a dataframe where each element is divided by the column-wise sum of a group.
Example dataframe that looks like this
df = [
[1,4,2,1],
[4,4,0,-1],
[2,3,1,6],
[-2,1,0,-1],
[6,-3,-2,-1],
[1,0,5,5],
]
df = pd.DataFrame(df)
lab_id = ['a','b','a','b','a','c']
df['lab_id'] = lab_id
df
0 1 2 3 lab_id
0 1 4 2 1 a
1 4 4 0 -1 b
2 2 3 1 6 a
3 -2 1 0 -1 b
4 6 -3 -2 -1 a
5 1 0 5 5 c
Currently in pandas I do a groupby by sum to return a dataframe:
sum_df = df.groupby('lab_id').sum()
sum_df
0 1 2 3
lab_id
a 9 4 1 6
b 2 5 0 -2
c 1 0 5 5
And then I set the index of the original data frame and divide by the sum dataframe:
df.set_index('lab_id')/sum_df
0 1 2 3
lab_id
a 0.111111 1.00 2.0 0.166667
a 0.222222 0.75 1.0 1.000000
a 0.666667 -0.75 -2.0 -0.166667
b 2.000000 0.80 NaN 0.500000
b -1.000000 0.20 NaN 0.500000
c 1.000000 NaN 1.0 1.000000
The main problem is that I am having a huge issue setting index in dask, which explicitly mentions to avoid using set_index() and reset_index() methods. I simply can't find a way around doing so!
I have tried many arcane ways to set index outside of dask such as creating a new dataframe with the index already set and a row of dummy data and iteratively assigning the columns from the old dataframe (this is some of the worst code i've written).

Try with transform
df.loc[:,[0,1,2,3]] = df/df.groupby('lab_id').transform('sum')[[0,1,2,3]]
df
Out[767]:
0 1 2 3 lab_id
0 0.111111 1.00 2.0 0.166667 a
1 2.000000 0.80 NaN 0.500000 b
2 0.222222 0.75 1.0 1.000000 a
3 -1.000000 0.20 NaN 0.500000 b
4 0.666667 -0.75 -2.0 -0.166667 a
5 1.000000 NaN 1.0 1.000000 c

Pandas running division for a column

I'm new to Pandas and would love some help I'm trying to take:
factor
1
1
2
1
1
3
1
2
and produce:
factor running_div
1 1
1 1
2 0.5
1 0.5
1 0.5
3 0.1666667
1 0.1666667
2 0.0833333
I can do it by looping through using .iloc, but trying to use vector math for efficiency. Have looked at rolling window and using .shift(1), but can't get it working. Would appreciate any guideance anyone could provide.

Use numpy ufunc.accumulate
df['cum_div'] = np.divide.accumulate(df.factor.to_numpy())
factor cum_div
0 1 1.000000
1 1 1.000000
2 2 0.500000
3 1 0.500000
4 1 0.500000
5 3 0.166667
6 1 0.166667
7 2 0.083333

You can try this:
import pandas as pd
df=pd.DataFrame([1,1,2,1,1,3,1,2], columns=["factor"])
df["running_div"]=df["factor"].iloc[0]
df["running_div"].loc[df.index[1:]]=1/df["factor"].loc[df.index[1:]]
df["running_div"]=df["running_div"].cumprod()
print(df)
Output:
factor running_div
0 1 1.000000
1 1 1.000000
2 2 0.500000
3 1 0.500000
4 1 0.500000
5 3 0.166667
6 1 0.166667
7 2 0.083333
[Program finished]

A cumulative division is done by keeping the first element, and them cumulatively multiplying by the inverse of all next elements until the end.
Hence, using np.cumprod
df['division'] = np.cumprod([df.factor.iloc[0], *1/df.factor.iloc[1:]])
factor division
0 1 1.000000
1 1 1.000000
2 2 0.500000
3 1 0.500000
4 1 0.500000
5 3 0.166667
6 1 0.166667
7 2 0.083333

pandas- new calculated row for each unique string/group in a column

I have a dataframe df like:
GROUP TYPE COUNT
A 1 5
A 2 10
B 1 3
B 2 9
C 1 20
C 2 100
I would like to add a row for each group such that the new row calculates the quotient of COUNT where TYPE equals 2 and COUNT where TYPE equals 1 for each GROUP ala:
GROUP TYPE COUNT
A 1 5
A 2 10
A .5
B 1 3
B 2 9
B .33
C 1 20
C 2 100
C .2
Thanks in advance.

df2 = df.pivot(index='GROUP', columns='TYPE', values='COUNT')
df2['div'] = df2[1]/df2[2]
df2.reset_index().melt('GROUP').sort_values('GROUP')
Output:
GROUP TYPE value
0 A 1 5.000000
3 A 2 10.000000
6 A div 0.500000
1 B 1 3.000000
4 B 2 9.000000
7 B div 0.333333
2 C 1 20.000000
5 C 2 100.000000
8 C div 0.200000
My approach would be to reshape the dataframe by pivoting, so every type has its own column. Then the division is very easy, and then by melting you reshape it back to the original shape. In my opinion this is also a very readable solution.
Of course, if you prefer np.nan to div as a type, you can replace it very easily, but I'm not sure if that's what you want.

s=df[df.TYPE.isin([1,2])].sort_values(['GROUP','TYPE']).groupby('GROUP').COUNT.apply(lambda x : x.iloc[0]/x.iloc[1])
# I am sort and filter your original df ,to make they are ordered and only have type 1 and 2
pd.concat([df,s.reset_index()]).sort_values('GROUP')
# cancat your result back
Out[77]:
COUNT GROUP TYPE
0 5.000000 A 1.0
1 10.000000 A 2.0
0 0.500000 A NaN
2 3.000000 B 1.0
3 9.000000 B 2.0
1 0.333333 B NaN
4 20.000000 C 1.0
5 100.000000 C 2.0
2 0.200000 C NaN

You can do:
import numpy as np
import pandas as pd
def add_quotient(x):
last_row = x.iloc[-1]
last_row['COUNT'] = x[x.TYPE == 1].COUNT.min() / x[x.TYPE == 2].COUNT.max()
last_row['TYPE'] = np.nan
return x.append(last_row)
print(df.groupby('GROUP').apply(add_quotient))
Output
GROUP TYPE COUNT
GROUP
A 0 A 1.0 5.000000
1 A 2.0 10.000000
1 A NaN 0.500000
B 2 B 1.0 3.000000
3 B 2.0 9.000000
3 B NaN 0.333333
C 4 C 1.0 20.000000
5 C 2.0 100.000000
5 C NaN 0.200000
Note that the function select the min of the TYPE == 1 and the max of the TYPE == 2, in case there is more than one value per group. And the TYPE is set to np.nan, but that can be easily changed.

Here's a way first using sort_values' by '['GROUP', 'TYPE'] so ensuring that TYPE 2 comes before 1 and then GroupBy GROUP.
Then use first and last to compute the quocient and outer merging with df:
g = df.sort_values(['GROUP', 'TYPE']).groupby('GROUP')
s = (g.first()/ g.nth(1)).COUNT.reset_index()
df.merge(s, on = ['GROUP','COUNT'], how='outer').fillna(' ').sort_values('GROUP')
GROUP TYPE COUNT
0 A 1 5.000000
1 A 2 10.000000
6 A 0.500000
2 B 1 3.000000
3 B 2 9.000000
7 B 0.333333
4 C 1 20.000000
5 C 2 100.000000
8 C 0.200000

Pandas- Dividing a column by another column conditional on if values are greater than 0?

I have a pandas dataframe that contains dates, items, and 2 values. All I'm looking to do is output another column that is the product of column A / column B if column B is greater than 0, and 0 if column B is equal to 0.
date item A B C
1/1/2017 a 0 3 0
1/1/2017 b 2 0 0
1/1/2017 c 5 2 2.5
1/1/2017 d 4 1 4
1/1/2017 e 3 3 1
1/1/2017 f 0 4 0
1/2/2017 a 3 3 1
1/2/2017 b 2 2 1
1/2/2017 c 3 9 0.333333333
1/2/2017 d 4 0 0
1/2/2017 e 5 3 1.666666667
1/2/2017 f 3 0 0
this is the code I've written, but the kernel keeps dying (keep in mind this is just an example table, I have about 30,000 rows so nothing too crazy)
df['C'] = df.loc[df['B'] > 0, 'A'] / df['B'])
any idea on what's going on? Is something running infinitely that's causing it to crash? Thanks for the help.

You get that using np.where
df['C'] = np.round(np.where(df['B'] > 0, df['A']/df['B'], 0), 1)
Or if you want to use loc
df.loc[df['B'] > 0, 'C'] = df['A']/df['B']
and then fillna(0)

Option 1
You use pd.Series.mask to hide zeros, and then just empty cells with fillna.
v = (df.A / df.B.mask(df.B == 0)).fillna(0)
v
0 0.000000
1 0.000000
2 2.500000
3 4.000000
4 1.000000
5 0.000000
6 1.000000
7 1.000000
8 0.333333
9 0.000000
10 1.666667
11 0.000000
dtype: float64
df['C'] = v
Alternatively, replace those zeros with np.inf, because x / inf = 0.
df['C'] = (df.A / df.B.mask(df.B == 0, np.inf))
Option 2
Direct replacement with df.replace
df.A / df.B.replace(0, np.inf)
0 0.000000
1 0.000000
2 2.500000
3 4.000000
4 1.000000
5 0.000000
6 1.000000
7 1.000000
8 0.333333
9 0.000000
10 1.666667
11 0.000000
dtype: float64
Keep in mind, you can do an astype conversion, if you want mixed integers and floats as your result:
df.A.div(df.B.replace(0, np.inf)).astype(object)
0 0
1 0
2 2.5
3 4
4 1
5 0
6 1
7 1
8 0.333333
9 0
10 1.66667
11 0
dtype: object

How to take log of only non-zero values in dataframe and replace O's with NA's?

How do i take log of non-zero values in dataframe and replace 0's with NA's.
I have dataframe like below:
time y1 y2
0 2017-08-06 00:52:00 0 10
1 2017-08-06 00:52:10 1 20
2 2017-08-06 00:52:20 2 0
3 2017-08-06 00:52:30 3 0
4 2017-08-06 00:52:40 0 5
5 2017-08-06 00:52:50 4 6
6 2017-08-06 00:53:00 6 11
7 2017-08-06 00:53:10 7 12
8 2017-08-06 00:53:20 8 0
9 2017-08-06 00:53:30 0 13
I want to take log of all columns expect first column time and log should be calculate of only non-zero values and zero's should be replace with NA's? How do i do this?
So, I tried to do something like this:
cols = df.columns.difference(['time'])
# Replacing O's with NA's using below:
df[cols] = df[cols].mask(np.isclose(df[cols].values, 0), np.nan)
df[cols] = np.log(df[cols]) # but this will try take log of NA's also.
Please help.
Output should be dataframe with same time column, and all the zero's replaced with NA's and log equivalent of the remaining values of all columns expect 1st column.

If I understand correctly, you can just replace the zeros with np.nan and then call np.log directly - it ignores NaN values just fine.
np.log(df[['y1', 'y2']].replace(0, np.nan))
Example
>>> df = pd.DataFrame({'time': pd.date_range('20170101', '20170110'),
'y1' : np.random.randint(0, 3, 10),
'y2': np.random.randint(0, 3, 10)})
>>> df
time y1 y2
0 2017-01-01 1 2
1 2017-01-02 0 1
2 2017-01-03 2 0
3 2017-01-04 0 1
4 2017-01-05 1 0
5 2017-01-06 1 1
6 2017-01-07 2 0
7 2017-01-08 1 0
8 2017-01-09 0 1
9 2017-01-10 2 1
>>> df[['log_y1', 'log_y2']] = np.log(df[['y1', 'y2']].replace(0, np.nan))
>>> df
time y1 y2 log_y1 log_y2
0 2017-01-01 1 2 0.000000 0.693147
1 2017-01-02 0 1 NaN 0.000000
2 2017-01-03 2 0 0.693147 NaN
3 2017-01-04 0 1 NaN 0.000000
4 2017-01-05 1 0 0.000000 NaN
5 2017-01-06 1 1 0.000000 0.000000
6 2017-01-07 2 0 0.693147 NaN
7 2017-01-08 1 0 0.000000 NaN
8 2017-01-09 0 1 NaN 0.000000
9 2017-01-10 2 1 0.693147 0.000000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas dataframe restriction - python

Related

Divide element by sum of groupby in dask without setting index for every column

Pandas running division for a column

pandas- new calculated row for each unique string/group in a column

Pandas- Dividing a column by another column conditional on if values are greater than 0?

How to take log of only non-zero values in dataframe and replace O's with NA's?

Categories

Resources