I have an adjency matrix (dm) of items vs items; the value between two items (e.g., item0,item1) refers to the number of times these items appear together. How can I scale all the values in pandas between 0 to 1?
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
However, I am not sure how to apply scaler to the pandas data frame.
You can assign the resulting array back to the dataframe with loc:
df = pd.DataFrame(np.random.randint(1, 5, (5, 5)))
df
Out[277]:
0 1 2 3 4
0 2 3 2 3 1
1 2 3 4 4 2
2 2 3 4 3 2
3 1 1 2 1 4
4 4 2 2 3 1
df.loc[:,:] = scaler.fit_transform(df)
df
Out[279]:
0 1 2 3 4
0 0.333333 1.0 0.0 0.666667 0.000000
1 0.333333 1.0 1.0 1.000000 0.333333
2 0.333333 1.0 1.0 0.666667 0.333333
3 0.000000 0.0 0.0 0.000000 1.000000
4 1.000000 0.5 0.0 0.666667 0.000000
You can do the same with (df - df.min()) / (df.max() - df.min()).
Related
The objective is to multiply some constant value to a column in Pandas. Each column has its own constant value.
For example, the columns 'a_b_c','dd_ee','ff_ff','abc','devb' are multiply with constant 15,20,15,15,20, respectively.
The constants values and its associated column is store in a dict const_val
const_val=dict(a_b_c=15,
dd_ee=20,
ff_ff=15,
abc=15,
devb=20,)
Currently, I am using a for-loop to multiply each column to its associate constant value which is shown in code below
for dpair in const_val:
df[('per_a',dpair)]=df[dpair]*const_val[dpair]/reval
However, I wonder whether there is more elagent ways of doing this.
The full code is provided below
import pandas as pd
import numpy as np
np.random.seed(0)
const_val=dict(a_b_c=15,
dd_ee=20,
ff_ff=15,
abc=15,
devb=20,)
df = pd.DataFrame(data=np.random.randint(5, size=(3, 6)),
columns=['id','a_b_c','dd_ee','ff_ff','abc','devb'])
reval=6
for dpair in const_val:
df[('per_a',dpair)]=df[dpair]*const_val[dpair]/reval
The expected output is as below
id a_b_c dd_ee ... (per_a, ff_ff) (per_a, abc) (per_a, devb)
0 4 0 3 ... 7.5 7.5 3.333333
1 3 2 4 ... 0.0 0.0 13.333333
2 2 1 0 ... 2.5 2.5 0.000000
Please note that the
(per_a, ff_ff) (per_a, abc) (per_a, devb)
are multiindex column. The representative might be different in your compiler
p.s., I am using IntelliJ IDEA
If you only have numbers in your DataFrame:
out = df.mul(pd.Series(const_val).reindex(df.columns, fill_value=1), axis=1)
If you have a mix non numeric and non-numeric:
out = df.select_dtypes('number').mul(pd.Series(const_val), axis=1).combine_first(df)
update:
out = df.join(df[list(const_val)].mul(pd.Series(const_val), axis=1)
.div(reval).add_prefix('per_a_'))
Output
id a_b_c dd_ee ff_ff abc devb per_a_a_b_c per_a_dd_ee per_a_ff_ff per_a_abc per_a_devb
0 1 4 3 0 3 0 10.0 10.000000 0.0 7.5 0.0
1 2 3 0 1 3 3 7.5 0.000000 2.5 7.5 10.0
2 3 0 1 1 1 0 0.0 3.333333 2.5 2.5 0.0
Update for multiindex/tuple column headers:
cols = pd.Index(const_val.keys())
mi = pd.MultiIndex.from_product([['per_a'], cols])
df[mi] = df[cols] * pd.Series(const_val) / reval
print(df)
Output:
id a_b_c dd_ee ff_ff abc devb (per_a, a_b_c) (per_a, dd_ee) (per_a, ff_ff) (per_a, abc) (per_a, devb)
0 4 0 3 3 3 1 0.0 10.000000 7.5 7.5 3.333333
1 3 2 4 0 0 4 5.0 13.333333 0.0 0.0 13.333333
2 2 1 0 1 1 0 2.5 0.000000 2.5 2.5 0.000000
Try this using pandas intrinsic data alignment tenants to align data using indexing:
cols = pd.Index(const_val.keys())
df[cols + '_per_a'] = df[cols] * pd.Series(const_val) / reval
Output:
id a_b_c dd_ee ff_ff abc devb a_b_c_per_a dd_ee_per_a ff_ff_per_a abc_per_a devb_per_a
0 4 0 3 3 3 1 0.0 10.000000 7.5 7.5 3.333333
1 3 2 4 0 0 4 5.0 13.333333 0.0 0.0 13.333333
2 2 1 0 1 1 0 2.5 0.000000 2.5 2.5 0.000000
df
id a_b_c dd_ee ff_ff abc devb
0 4 0 3 3 3 1
1 3 2 4 0 0 4
2 2 1 0 1 1 0
make const_val to series
s = pd.Series(const_val)
s
a_b_c 15
dd_ee 20
ff_ff 15
abc 15
devb 20
dtype: int64
use broadcasting
out = df[['id']].join(df[df.columns[1:]].mul(s))
out
id a_b_c dd_ee ff_ff abc devb
0 4 0 60 45 45 20
1 3 30 80 0 0 80
2 2 15 0 15 15 0
I have code implemented in pandas, but am having trouble converting to dask because I need to use set_index(), what is the best work around? Using dask because I need to scale this to much larger dataframes.
I am looking to return a dataframe where each element is divided by the column-wise sum of a group.
Example dataframe that looks like this
df = [
[1,4,2,1],
[4,4,0,-1],
[2,3,1,6],
[-2,1,0,-1],
[6,-3,-2,-1],
[1,0,5,5],
]
df = pd.DataFrame(df)
lab_id = ['a','b','a','b','a','c']
df['lab_id'] = lab_id
df
0 1 2 3 lab_id
0 1 4 2 1 a
1 4 4 0 -1 b
2 2 3 1 6 a
3 -2 1 0 -1 b
4 6 -3 -2 -1 a
5 1 0 5 5 c
Currently in pandas I do a groupby by sum to return a dataframe:
sum_df = df.groupby('lab_id').sum()
sum_df
0 1 2 3
lab_id
a 9 4 1 6
b 2 5 0 -2
c 1 0 5 5
And then I set the index of the original data frame and divide by the sum dataframe:
df.set_index('lab_id')/sum_df
0 1 2 3
lab_id
a 0.111111 1.00 2.0 0.166667
a 0.222222 0.75 1.0 1.000000
a 0.666667 -0.75 -2.0 -0.166667
b 2.000000 0.80 NaN 0.500000
b -1.000000 0.20 NaN 0.500000
c 1.000000 NaN 1.0 1.000000
The main problem is that I am having a huge issue setting index in dask, which explicitly mentions to avoid using set_index() and reset_index() methods. I simply can't find a way around doing so!
I have tried many arcane ways to set index outside of dask such as creating a new dataframe with the index already set and a row of dummy data and iteratively assigning the columns from the old dataframe (this is some of the worst code i've written).
Try with transform
df.loc[:,[0,1,2,3]] = df/df.groupby('lab_id').transform('sum')[[0,1,2,3]]
df
Out[767]:
0 1 2 3 lab_id
0 0.111111 1.00 2.0 0.166667 a
1 2.000000 0.80 NaN 0.500000 b
2 0.222222 0.75 1.0 1.000000 a
3 -1.000000 0.20 NaN 0.500000 b
4 0.666667 -0.75 -2.0 -0.166667 a
5 1.000000 NaN 1.0 1.000000 c
I'm new to Pandas and would love some help I'm trying to take:
factor
1
1
2
1
1
3
1
2
and produce:
factor running_div
1 1
1 1
2 0.5
1 0.5
1 0.5
3 0.1666667
1 0.1666667
2 0.0833333
I can do it by looping through using .iloc, but trying to use vector math for efficiency. Have looked at rolling window and using .shift(1), but can't get it working. Would appreciate any guideance anyone could provide.
Use numpy ufunc.accumulate
df['cum_div'] = np.divide.accumulate(df.factor.to_numpy())
factor cum_div
0 1 1.000000
1 1 1.000000
2 2 0.500000
3 1 0.500000
4 1 0.500000
5 3 0.166667
6 1 0.166667
7 2 0.083333
You can try this:
import pandas as pd
df=pd.DataFrame([1,1,2,1,1,3,1,2], columns=["factor"])
df["running_div"]=df["factor"].iloc[0]
df["running_div"].loc[df.index[1:]]=1/df["factor"].loc[df.index[1:]]
df["running_div"]=df["running_div"].cumprod()
print(df)
Output:
factor running_div
0 1 1.000000
1 1 1.000000
2 2 0.500000
3 1 0.500000
4 1 0.500000
5 3 0.166667
6 1 0.166667
7 2 0.083333
[Program finished]
A cumulative division is done by keeping the first element, and them cumulatively multiplying by the inverse of all next elements until the end.
Hence, using np.cumprod
df['division'] = np.cumprod([df.factor.iloc[0], *1/df.factor.iloc[1:]])
factor division
0 1 1.000000
1 1 1.000000
2 2 0.500000
3 1 0.500000
4 1 0.500000
5 3 0.166667
6 1 0.166667
7 2 0.083333
If i have a dataframe:
A B C
0.0285714285714285 4 0.11428571
0.107142857142857 4 0.42857143
0.007142857142857 6 0.04285714
1.2 4 5.5
1.5 3 3
Desired output is;
A*B C Difference
0.114285714285714 0.11428571 0.000000004285714
0.428571428571428 0.42857143 -0.000000001428572
0.042857142857142 0.04285714 0.000000002857142
4.8 5.5 -0.7
4.5 3 1.5
Count: 2
I want to ignore the like 3 rows, because the difference is very small. only the first digit after the comma should be included.
Could you please help me about this?
EDIT:
Because values in column A are objects (obviously strings):
df['A'] = df['A'].astype(float)
If not working, because bad values (e.g. some strings) - bad values are repalced by NaNs:
df['A'] = pd.to_numeric(df['A'], errors='coerce')
Use Series.mask for set new column by condition with Series.between:
#multiple columns
df['A*B'] = df["A"]*df["B"]
#subtract to Series
diff = df['A*B'] - df['C']
#create mask
mask = diff.between(-0.1, 0.1)
df["difference"] = diff.mask(mask, 0)
print (df)
A B C A*B difference
0 0.028571 4 0.114286 0.114286 0.0
1 0.107143 4 0.428571 0.428571 0.0
2 0.007143 6 0.042857 0.042857 0.0
3 1.200000 4 5.500000 4.800000 -0.7
4 1.500000 3 3.000000 4.500000 1.5
print (f'Count: {(~mask).sum()}')
Count: 2
If order is important add DataFrame.insert with DataFrame.pop for extract columns:
df.insert(0, 'A*B', df.pop("A")*df.pop("B"))
diff = df['A*B'] - df['C']
mask = diff.between(-0.1, 0.1)
df["difference"] = diff.mask(mask, 0)
print (df)
A*B C difference
0 0.114286 0.114286 0.0
1 0.428571 0.428571 0.0
2 0.042857 0.042857 0.0
3 4.800000 5.500000 -0.7
4 4.500000 3.000000 1.5
print (f'Count: {(~mask).sum()}')
Count: 2
Using np.where to check whether the result is significant enough:
df["difference"] = np.where((df["A"]*df["B"]-df["C"]>=0.1)|(df["A"]*df["B"]-df["C"]<=-0.1),df["A"]*df["B"]-df["C"],0)
print (df)
#
A B C difference
0 0.028571 4 0.114286 0.0
1 0.107143 4 0.428571 0.0
2 0.007143 6 0.042857 0.0
3 1.200000 4 5.500000 -0.7
4 1.500000 3 3.000000 1.5
I have a dataframe df like:
GROUP TYPE COUNT
A 1 5
A 2 10
B 1 3
B 2 9
C 1 20
C 2 100
I would like to add a row for each group such that the new row calculates the quotient of COUNT where TYPE equals 2 and COUNT where TYPE equals 1 for each GROUP ala:
GROUP TYPE COUNT
A 1 5
A 2 10
A .5
B 1 3
B 2 9
B .33
C 1 20
C 2 100
C .2
Thanks in advance.
df2 = df.pivot(index='GROUP', columns='TYPE', values='COUNT')
df2['div'] = df2[1]/df2[2]
df2.reset_index().melt('GROUP').sort_values('GROUP')
Output:
GROUP TYPE value
0 A 1 5.000000
3 A 2 10.000000
6 A div 0.500000
1 B 1 3.000000
4 B 2 9.000000
7 B div 0.333333
2 C 1 20.000000
5 C 2 100.000000
8 C div 0.200000
My approach would be to reshape the dataframe by pivoting, so every type has its own column. Then the division is very easy, and then by melting you reshape it back to the original shape. In my opinion this is also a very readable solution.
Of course, if you prefer np.nan to div as a type, you can replace it very easily, but I'm not sure if that's what you want.
s=df[df.TYPE.isin([1,2])].sort_values(['GROUP','TYPE']).groupby('GROUP').COUNT.apply(lambda x : x.iloc[0]/x.iloc[1])
# I am sort and filter your original df ,to make they are ordered and only have type 1 and 2
pd.concat([df,s.reset_index()]).sort_values('GROUP')
# cancat your result back
Out[77]:
COUNT GROUP TYPE
0 5.000000 A 1.0
1 10.000000 A 2.0
0 0.500000 A NaN
2 3.000000 B 1.0
3 9.000000 B 2.0
1 0.333333 B NaN
4 20.000000 C 1.0
5 100.000000 C 2.0
2 0.200000 C NaN
You can do:
import numpy as np
import pandas as pd
def add_quotient(x):
last_row = x.iloc[-1]
last_row['COUNT'] = x[x.TYPE == 1].COUNT.min() / x[x.TYPE == 2].COUNT.max()
last_row['TYPE'] = np.nan
return x.append(last_row)
print(df.groupby('GROUP').apply(add_quotient))
Output
GROUP TYPE COUNT
GROUP
A 0 A 1.0 5.000000
1 A 2.0 10.000000
1 A NaN 0.500000
B 2 B 1.0 3.000000
3 B 2.0 9.000000
3 B NaN 0.333333
C 4 C 1.0 20.000000
5 C 2.0 100.000000
5 C NaN 0.200000
Note that the function select the min of the TYPE == 1 and the max of the TYPE == 2, in case there is more than one value per group. And the TYPE is set to np.nan, but that can be easily changed.
Here's a way first using sort_values' by '['GROUP', 'TYPE'] so ensuring that TYPE 2 comes before 1 and then GroupBy GROUP.
Then use first and last to compute the quocient and outer merging with df:
g = df.sort_values(['GROUP', 'TYPE']).groupby('GROUP')
s = (g.first()/ g.nth(1)).COUNT.reset_index()
df.merge(s, on = ['GROUP','COUNT'], how='outer').fillna(' ').sort_values('GROUP')
GROUP TYPE COUNT
0 A 1 5.000000
1 A 2 10.000000
6 A 0.500000
2 B 1 3.000000
3 B 2 9.000000
7 B 0.333333
4 C 1 20.000000
5 C 2 100.000000
8 C 0.200000