I have an input file foo.txt with the following lines
A 1 2 3
B 4 5 6
C 7 8 9
I have written the following lines
import numpy as np
import pandas as pd
file="foo.txt"
source_array=pd.read_csv(file, sep=" ", header=None)
name_array=source_array.iloc[:,0].to_numpy()
number_array=source_array.iloc[:,1:4].to_numpy()
r1=np.array([[1,0,0],[0,1,0],[0,0,1]])
r2=np.array([[0.5,-0.30902,-0.80902],[0.30902,-0.80902,0.5],[-0.80902,-0.5,-0.30902]])
r3=np.array([[0.5,0.30902,-0.80902],[-0.30902,-0.80902,-0.5],[-0.80902,0.5,-0.30902]])
mult_array=np.array([r1,r2,r3])
out_array=np.empty((0,3))
for i in range(number_array.shape[0]):
lad=number_array[i,0:3]
lad=lad.reshape(1,3)
print(lad)
for j in range(mult_array.shape[0]):
operated_array=np.dot(lad,mult_array[j])
out_array=np.append(out_array,operated_array,axis=0)
#print(operated_array)
np.savetxt('foo2.txt',out_array, fmt='%.2f')
After performing the do multiplication i get the following output
1.00 2.00 3.00
-1.31 -3.43 -0.74
-2.55 0.19 -2.74
4.00 5.00 6.00
-1.31 -8.28 -2.59
-4.40 0.19 -7.59
7.00 8.00 9.00
-1.31 -13.14 -4.44
-6.25 0.19 -12.44
But the expected output in foo2.txt is
A 1.00 2.00 3.00
A -1.31 -3.43 -0.74
A -2.55 0.19 -2.74
B 4.00 5.00 6.00
B -1.31 -8.28 -2.59
B -4.40 0.19 -7.59
C 7.00 8.00 9.00
C -1.31 -13.14 -4.44
C -6.25 0.19 -12.44
How can i duplicate the row name as many times as i perform the dot multiplication?
For clarity input print(df) output is
df
0 1 2 3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
We do not need for loop, also with the help of explode
df['new']=np.dot(df,mult_array).tolist()
s=df.new.explode()
output=pd.DataFrame(s.tolist(),index=s.index).round(2)
Out[30]:
0 1 2
A 1.00 2.00 3.00
A -1.31 -3.43 -0.74
A -2.55 0.19 -2.74
B 4.00 5.00 6.00
B -1.31 -8.28 -2.59
B -4.40 0.19 -7.59
C 7.00 8.00 9.00
C -1.31 -13.14 -4.44
C -6.25 0.19 -12.44
Data input
df
0 1 2
A 1 2 3
B 4 5 6
C 7 8 9
Related
I have data like below with nan index and nan cells
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1
1 1 0.18 computer 1.14 1.0 1
2 2 0.27 laptop 1.32 1.0 1
nan 0 0.18 apples 1.59 0.999655 4
4 1 0.84 vegetables 1.770008 0.99992 4
id no name percentage score result
0 0 nan chicken 0.84 0.974185 1
1 1 0.18 fish 1.14 . 1
2 2 0.27 meat 1.32 1.0 1
I want to keep the original index and drop all rows where nan index and nan cells or special character like . and without repeated header as below
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1
1 1 0.18 computer 1.14 1.0 1
2 2 0.27 laptop 1.32 1.0 1
4 1 0.84 vegetables 1.770008 0.99992 4
2 2 0.27 meat 1.32 1.0 1
I tried but, I cannot keep the original index.
You can do this:
In [394]: df
Out[394]:
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1.0
1 1 0.18 computer 1.14 1.0 1.0
2 2 0.27 laptop 1.32 1.0 1.0
NaN 0 0.18 apples 1.59 0.999655 4.0
4 1 0.84 vegetables 1.770008 0.99992 4.0
id no name percentage score result NaN
0 0 NaN chicken 0.84 0.974185 1.0
1 1 0.18 fish 1.14 . 1.0
2 2 0.27 meat 1.32 1.0 1.0
In [393]: df[~df.eq('.').any(1) & ~df.index.isin(df.columns) & df.index.notna()].dropna()
Out[393]:
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1.0
1 1 0.18 computer 1.14 1.0 1.0
2 2 0.27 laptop 1.32 1.0 1.0
4 1 0.84 vegetables 1.770008 0.99992 4.0
2 2 0.27 meat 1.32 1.0 1.0
I'm looking to create a rolling grouped cumulative sum across two dataframes. I can get the result via iteration, but wanted to see if there was a more intelligent way.
I need the 5 row block of A to roll through the rows of B and accumulate. Think of it as rolling balance with a block of contributions and rolling returns.
So, here's the calculation for C
A B
1 100.00 1 0.01 101.00
2 110.00 2 0.02 215.22 102.00
3 120.00 3 0.03 345.28 218.36 103.00
4 130.00 4 0.04 494.29 351.89 221.52 104.00
5 140.00 5 0.05 666.00 505.99 358.60 224.70 105.00
6 0.06 684.75 517.91 365.38 227.90 106.00
7 0.07 703.97 530.06 372.25 231.12
8 0.08 723.66 542.43 379.21
9 0.09 743.85 555.04
10 0.10 764.54
C Row 5
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.01 101.00
101.00 110.00 0.02 215.22
215.22 120.00 0.03 345.28
345.28 130.00 0.04 494.29
494.29 140.00 0.05 666.00
C Row 6
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.02 102.00
102.00 110.00 0.03 218.36
218.36 120.00 0.04 351.89
351.89 130.00 0.05 505.99
505.99 140.00 0.06 684.75
Here's what the source data looks like:
A B
1 100.00 1 0.01
2 110.00 2 0.02
3 120.00 3 0.03
4 130.00 4 0.04
5 140.00 5 0.05
6 0.06
7 0.07
8 0.08
9 0.09
10 0.10
Here is the desired result:
C
1 Nan
2 Nan
3 Nan
4 Nan
5 666.00
6 684.75
7 703.97
8 723.66
9 743.85
10 764.54
I want to make correlation in this DataFrame but not the way it is shown, but to rank values from the lowest to largest.
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
corr.style.background_gradient().set_precision(2)
0 1 2 3 4 5 6 7
0 1 0.42 0.031 -0.16 -0.35 0.23 -0.22 0.4
1 0.42 1 -0.24 -0.55 0.011 0.3 -0.26 0.23
2 0.031 -0.24 1 0.29 0.44 0.29 0.23 0.25
3 -0.16 -0.55 0.29 1 -0.33 -0.42 0.58 -0.37
4 -0.35 0.011 0.44 -0.33 1 0.46 0.074 0.19
5 0.23 0.3 0.29 -0.42 0.46 1 -0.41 0.71
6 -0.22 -0.26 0.23 0.58 0.074 -0.41 1 -0.66
7 0.4 0.23 0.25 -0.37 0.19 0.71 -0.66 1
You can use sort_values:
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
print(corr)
print(corr.sort_values(by=0, axis=1, inplace=False)) # by=0 means first row
Results:
0 1 2 3 4 5 6 7
0 1.000000 0.418246 0.030692 -0.160001 -0.352993 0.230069 -0.216804 0.395662
1 0.418246 1.000000 -0.244115 -0.549013 0.010745 0.299203 -0.262351 0.232681
2 0.030692 -0.244115 1.000000 0.288011 0.435907 0.285408 0.225205 0.253840
3 -0.160001 -0.549013 0.288011 1.000000 -0.326950 -0.415688 0.578549 -0.366539
4 -0.352993 0.010745 0.435907 -0.326950 1.000000 0.455738 0.074293 0.193905
5 0.230069 0.299203 0.285408 -0.415688 0.455738 1.000000 -0.413383 0.708467
6 -0.216804 -0.262351 0.225205 0.578549 0.074293 -0.413383 1.000000 -0.664207
7 0.395662 0.232681 0.253840 -0.366539 0.193905 0.708467 -0.664207 1.000000
0 1 7 5 2 3 6 4
0 1.000000 0.418246 0.395662 0.230069 0.030692 -0.160001 -0.216804 -0.352993
1 0.418246 1.000000 0.232681 0.299203 -0.244115 -0.549013 -0.262351 0.010745
2 0.030692 -0.244115 0.253840 0.285408 1.000000 0.288011 0.225205 0.435907
3 -0.160001 -0.549013 -0.366539 -0.415688 0.288011 1.000000 0.578549 -0.326950
4 -0.352993 0.010745 0.193905 0.455738 0.435907 -0.326950 0.074293 1.000000
5 0.230069 0.299203 0.708467 1.000000 0.285408 -0.415688 -0.413383 0.455738
6 -0.216804 -0.262351 -0.664207 -0.413383 0.225205 0.578549 1.000000 0.074293
7 0.395662 0.232681 1.000000 0.708467 0.253840 -0.366539 -0.664207 0.193905
To better explain by problem better lets pretend i have a shop with 3 unique customers and my dataframe contains every purchase of my customers with weekday, name and paid price.
name price weekday
0 Paul 18.44 0
1 Micky 0.70 0
2 Sarah 0.59 0
3 Sarah 0.27 1
4 Paul 3.45 2
5 Sarah 14.03 2
6 Paul 17.21 3
7 Micky 5.35 3
8 Sarah 0.49 4
9 Micky 17.00 4
10 Paul 2.62 4
11 Micky 17.61 5
12 Micky 10.63 6
The information i would like to get is the average price per unique customer per weekday. What i often do in similar situations is to group by several columns with sum and then take the average of a subset of the columns.
df = df.groupby(['name','weekday']).sum()
price
name weekday
Micky 0 0.70
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
2 3.45
3 17.21
4 2.62
Sarah 0 0.59
1 0.27
2 14.03
4 0.49
df = df.groupby(['weekday']).mean()
price
weekday
0 6.576667
1 0.270000
2 8.740000
3 11.280000
4 6.703333
5 17.610000
6 10.630000
Of course this only works if all my unique customers would have at least one purchase per day.
Is there an elegant way to get a zero value for all combinations between unique index values that have no sum after the first groupby?
My solutions has been so far to either to reindex on a multi index i created from the unique values of the grouped columns or the combination of unstack-fillna-stack but both solutions do not really satisfy me.
Appreciate your help!
IIUC, let's use unstack and fillna then stack:
df_out = df.groupby(['name','weekday']).sum().unstack().fillna(0).stack()
Output:
price
name weekday
Micky 0 0.70
1 0.00
2 0.00
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
1 0.00
2 3.45
3 17.21
4 2.62
5 0.00
6 0.00
Sarah 0 0.59
1 0.27
2 14.03
3 0.00
4 0.49
5 0.00
6 0.00
And,
df_out.groupby('weekday').mean()
Output:
price
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I think you can use pivot_table to do all the steps at once. I'm not exactly sure what you want but the default aggregation from pivot_table is the mean. You can change it to 'sum'.
df1 = df.pivot_table(index='name', columns='weekday', values='price',
fill_value=0, aggfunc='sum')
weekday 0 1 2 3 4 5 6
name
Micky 0.70 0.00 0.00 5.35 17.00 17.61 10.63
Paul 18.44 0.00 3.45 17.21 2.62 0.00 0.00
Sarah 0.59 0.27 14.03 0.00 0.49 0.00 0.00
And then take the mean of each column.
df1.mean()
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I'm trying to reshape a dataframe, but I'm not able to get the results I need.
The dataframe looks like this:
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26
I need to reshape the dataframe so it will look like this:
m r s p O W N p O W N p O W N
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
1 4 4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
1 4 5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
I tried to use the pivot_table function
df.pivot_table(index=['m','r','s'], columns=['p'], values=['O','W','N'])
but I'm not able to get quite what I want. Does anyone know how to do this?
As someone who fancies himself as pretty handy with pandas, the pivot_table and melt functions are confusing to me. I prefer to stick with a well-defined and unique index and use the stack and unstack methods of the dataframe itself.
First, I'll ask if you really need to repeat the p-column like that? I can sort of see its value when presenting data, but IMO pandas isn't really set up to work like that. We could shoehorn it in, but let's see if a simpler solution gets you what you need.
Here's what I would do:
from io import StringIO
import pandas
datatable = StringIO("""\
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26""")
df = (
pandas.read_table(datatable, sep='\s+')
.set_index(['m', 'r', 's', 'p'])
.unstack(level='p')
)
df.columns = df.columns.swaplevel(0, 1)
df.sort(axis=1, inplace=True)
print(df)
Which prints:
p 1 2 3
O W N O W N O W N
m r s
1 4 3 2.81 3.70 3.03 0.58 0.78 0.60 1.98 1.34 1.81
4 2.14 2.82 2.31 0.67 0.00 0.00 0.00 0.04 0.15
5 1.47 1.94 1.59 1.03 2.45 1.68 0.01 0.00 0.26
So now the columns are a MultiIndex and you can access, for example, all of the values where p = 2 with df[2] or df.xs(2, level='p', axis=1), which gives me:
O W N
m r s
1 4 3 0.58 0.78 0.60
4 0.67 0.00 0.00
5 1.03 2.45 1.68
Similarly, you can get all of the W columns with: df.xs('W', level=1, axis=1)
(we say level=1) because that column level does not have a name, so we use its position instead)
p 1 2 3
m r s
1 4 3 3.70 0.78 1.34
4 2.82 0.00 0.04
5 1.94 2.45 0.00
You can similarly query the columns by using axis=0.
If you really need the p values in a column, just add it there manually and reindex your columns:
for p in df.columns.get_level_values('p').unique():
df[p, 'p'] = p
cols = pandas.MultiIndex.from_product([[1,2,3], list('pOWN')])
df = df.reindex(columns=cols)
print(df)
1 2 3
p O W N p O W N p O W N
m r s
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
b = open('ss2.csv', 'w')
a = csv.writer(b)
sk = ''
with open ('df_col2.csv', 'r') as ann:
for col in ann:
an = col.lower().strip('\n').split(',')
suk += an[0] + ','
sk = sk[:-2]
a.writerow([sk])