I'm trying to reshape a dataframe, but I'm not able to get the results I need.
The dataframe looks like this:
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26
I need to reshape the dataframe so it will look like this:
m r s p O W N p O W N p O W N
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
1 4 4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
1 4 5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
I tried to use the pivot_table function
df.pivot_table(index=['m','r','s'], columns=['p'], values=['O','W','N'])
but I'm not able to get quite what I want. Does anyone know how to do this?
As someone who fancies himself as pretty handy with pandas, the pivot_table and melt functions are confusing to me. I prefer to stick with a well-defined and unique index and use the stack and unstack methods of the dataframe itself.
First, I'll ask if you really need to repeat the p-column like that? I can sort of see its value when presenting data, but IMO pandas isn't really set up to work like that. We could shoehorn it in, but let's see if a simpler solution gets you what you need.
Here's what I would do:
from io import StringIO
import pandas
datatable = StringIO("""\
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26""")
df = (
pandas.read_table(datatable, sep='\s+')
.set_index(['m', 'r', 's', 'p'])
.unstack(level='p')
)
df.columns = df.columns.swaplevel(0, 1)
df.sort(axis=1, inplace=True)
print(df)
Which prints:
p 1 2 3
O W N O W N O W N
m r s
1 4 3 2.81 3.70 3.03 0.58 0.78 0.60 1.98 1.34 1.81
4 2.14 2.82 2.31 0.67 0.00 0.00 0.00 0.04 0.15
5 1.47 1.94 1.59 1.03 2.45 1.68 0.01 0.00 0.26
So now the columns are a MultiIndex and you can access, for example, all of the values where p = 2 with df[2] or df.xs(2, level='p', axis=1), which gives me:
O W N
m r s
1 4 3 0.58 0.78 0.60
4 0.67 0.00 0.00
5 1.03 2.45 1.68
Similarly, you can get all of the W columns with: df.xs('W', level=1, axis=1)
(we say level=1) because that column level does not have a name, so we use its position instead)
p 1 2 3
m r s
1 4 3 3.70 0.78 1.34
4 2.82 0.00 0.04
5 1.94 2.45 0.00
You can similarly query the columns by using axis=0.
If you really need the p values in a column, just add it there manually and reindex your columns:
for p in df.columns.get_level_values('p').unique():
df[p, 'p'] = p
cols = pandas.MultiIndex.from_product([[1,2,3], list('pOWN')])
df = df.reindex(columns=cols)
print(df)
1 2 3
p O W N p O W N p O W N
m r s
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
b = open('ss2.csv', 'w')
a = csv.writer(b)
sk = ''
with open ('df_col2.csv', 'r') as ann:
for col in ann:
an = col.lower().strip('\n').split(',')
suk += an[0] + ','
sk = sk[:-2]
a.writerow([sk])
Related
I have a data frame like 1 and I am trying to create a new data frame 2 which consists of ratios of each column of above data frame.
I tried below mentioned logic.
df_new = pd.concat([df[df.columns.difference([col])].div(df[col], axis=0)\
.add_suffix('/R') for col in df.columns], axis=1)
Output is:
B/R C/R D/R A/R C/R D/R A/R B/R D/R A/R B/R C/R
0 0.46 1.16 0.78 2.16 2.50 1.69 0.86 0.40 0.68 1.28 0.59 1.48
1 1.05 1.25 1.64 0.95 1.19 1.55 0.80 0.84 1.30 0.61 0.64 0.77
2 1.56 2.78 2.78 0.64 1.79 1.79 0.36 0.56 1.00 0.36 0.56 1.00
3 0.54 2.23 0.35 1.86 4.14 0.64 0.45 0.24 0.16 2.89 1.56 6.44
However, here I am facing two issues. One is I am getting both A/B and B/A which are not needed and also increases number of columns. Is there a way to get the output only A/B and eliminate/restrict B/A.
Second issue is with Naming of columns using add suffix method which does not convey which is divided by which. Is there a way to create column names like A/B for Column A divided by column B.
Use combinations with divide columns in list comprehension:
df = pd.DataFrame({
'A':[5,3,6,9,2,4],
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,8],
})
from itertools import combinations
L = {f'{a}/{b}': df[a].div(df[b]) for a, b in combinations(df.columns, 2)}
df = pd.concat(L, axis=1)
print (df)
A/B A/C A/D B/C B/D C/D
0 1.25 0.714286 5.000000 0.571429 4.000000 7.000000
1 0.60 0.375000 1.000000 0.625000 1.666667 2.666667
2 1.50 0.666667 1.200000 0.444444 0.800000 1.800000
3 1.80 2.250000 1.285714 1.250000 0.714286 0.571429
4 0.40 1.000000 2.000000 2.500000 5.000000 2.000000
5 1.00 1.333333 0.500000 1.333333 0.500000 0.375000
I have an input file foo.txt with the following lines
A 1 2 3
B 4 5 6
C 7 8 9
I have written the following lines
import numpy as np
import pandas as pd
file="foo.txt"
source_array=pd.read_csv(file, sep=" ", header=None)
name_array=source_array.iloc[:,0].to_numpy()
number_array=source_array.iloc[:,1:4].to_numpy()
r1=np.array([[1,0,0],[0,1,0],[0,0,1]])
r2=np.array([[0.5,-0.30902,-0.80902],[0.30902,-0.80902,0.5],[-0.80902,-0.5,-0.30902]])
r3=np.array([[0.5,0.30902,-0.80902],[-0.30902,-0.80902,-0.5],[-0.80902,0.5,-0.30902]])
mult_array=np.array([r1,r2,r3])
out_array=np.empty((0,3))
for i in range(number_array.shape[0]):
lad=number_array[i,0:3]
lad=lad.reshape(1,3)
print(lad)
for j in range(mult_array.shape[0]):
operated_array=np.dot(lad,mult_array[j])
out_array=np.append(out_array,operated_array,axis=0)
#print(operated_array)
np.savetxt('foo2.txt',out_array, fmt='%.2f')
After performing the do multiplication i get the following output
1.00 2.00 3.00
-1.31 -3.43 -0.74
-2.55 0.19 -2.74
4.00 5.00 6.00
-1.31 -8.28 -2.59
-4.40 0.19 -7.59
7.00 8.00 9.00
-1.31 -13.14 -4.44
-6.25 0.19 -12.44
But the expected output in foo2.txt is
A 1.00 2.00 3.00
A -1.31 -3.43 -0.74
A -2.55 0.19 -2.74
B 4.00 5.00 6.00
B -1.31 -8.28 -2.59
B -4.40 0.19 -7.59
C 7.00 8.00 9.00
C -1.31 -13.14 -4.44
C -6.25 0.19 -12.44
How can i duplicate the row name as many times as i perform the dot multiplication?
For clarity input print(df) output is
df
0 1 2 3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
We do not need for loop, also with the help of explode
df['new']=np.dot(df,mult_array).tolist()
s=df.new.explode()
output=pd.DataFrame(s.tolist(),index=s.index).round(2)
Out[30]:
0 1 2
A 1.00 2.00 3.00
A -1.31 -3.43 -0.74
A -2.55 0.19 -2.74
B 4.00 5.00 6.00
B -1.31 -8.28 -2.59
B -4.40 0.19 -7.59
C 7.00 8.00 9.00
C -1.31 -13.14 -4.44
C -6.25 0.19 -12.44
Data input
df
0 1 2
A 1 2 3
B 4 5 6
C 7 8 9
I have a file like:
AFA MT 0 0 1.22 259 169 FOD 0 50.01 1.3 1.370 0.00 -0.02 1.78 0 0.0
S 2 50.620 1.960 2.452 0.00 -0.49 0.31
MKE MS 0 0 4.22 256 149 MDO 1 30.00 1.4 2.370 3.00 -0.52 4.82 0 0.0
KTE KL 0 0 1.22 259 169 FID 0 10.01 2.0 2.470 1.00 -0.12 0.78 1 1.0
S 3 70.610 1.960 2.52 0.00 -0.19 0.41
...
...
S lines are not always there, but always start with S .
And I like to split it and create a dictionary with keys only the first fields (AFA, KTE ...) but also keep the "S 2 50.60 ... 0.31" part as the keys values of the previous key, whenecver they exist.
(aka merge the S lines with the previous line whenever they ocure).
So far I did:
import collections
st = {}
with open("file.txt") as f:
for line in f:
if len(line.split())==17 or len(line.split())==8 :
key, value = line.split(None, 1)
st[key] = (value.split())
#order output as the order in file
st=collections.OrderedDict(st)
print ([key] ,[value])
but this gives me:
['AFA'] ['MT 0 0 1.22 259 169 FOD 0 50.01 1.3 1.370 0.00 -0.02 1.78 0 0.0']
['S'] ['2 50.620 1.960 2.452 0.00 -0.49 0.31']
['MKE'] ['MS 0 0 4.22 256 149 MDO 1 30.00 1.4 2.370 3.00 -0.52 4.82 0 0.0']
['KTE'] ['KL 0 0 1.22 259 169 FID 0 10.01 2.0 2.470 1.00 -0.12 0.78 1 1.0']
['S'] ['3 70.610 1.960 2.52 0.00 -0.19 0.41']
while I try and want though to get:
['AFA'] ['MT 0 0 1.22 259 169 FOD 0 50.01 1.3 1.370 0.00 -0.02 1.78 0 0.0 S 2 50.620 1.960 2.452 0.00 -0.49 0.31']
['MKE'] ['MS 0 0 4.22 256 149 MDO 1 30.00 1.4 2.370 3.00 -0.52 4.82 0 0.0']
['KTE'] ['KL 0 0 1.22 259 169 FID 0 10.01 2.0 2.470 1.00 -0.12 0.78 1 1.0 S 3 70.610 1.960 2.52 0.00 -0.19 0.41']
The logic you could use is:
remember the last non-S line you've read
If you read an S-line, add it to the remembered non-S line & put that in your dictionary that, then forget the remembered non-S line
If you read a non-S-line, put any remembered non-S line in your dictionary & remember the new line instead
When you are done with the file, put any remembered non-S line in your dictionary
This seems to do what you want, without much modification to your attempt:
import pprint
st = {}
with open("file.txt") as f:
for line in f:
if len(line.split())==17 :
key, value = line.split(None, 1)
st[key] = (value.split())
elif len(line.split())==8 :
st[key] += line.split()
pprint.pprint(st)
To better explain by problem better lets pretend i have a shop with 3 unique customers and my dataframe contains every purchase of my customers with weekday, name and paid price.
name price weekday
0 Paul 18.44 0
1 Micky 0.70 0
2 Sarah 0.59 0
3 Sarah 0.27 1
4 Paul 3.45 2
5 Sarah 14.03 2
6 Paul 17.21 3
7 Micky 5.35 3
8 Sarah 0.49 4
9 Micky 17.00 4
10 Paul 2.62 4
11 Micky 17.61 5
12 Micky 10.63 6
The information i would like to get is the average price per unique customer per weekday. What i often do in similar situations is to group by several columns with sum and then take the average of a subset of the columns.
df = df.groupby(['name','weekday']).sum()
price
name weekday
Micky 0 0.70
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
2 3.45
3 17.21
4 2.62
Sarah 0 0.59
1 0.27
2 14.03
4 0.49
df = df.groupby(['weekday']).mean()
price
weekday
0 6.576667
1 0.270000
2 8.740000
3 11.280000
4 6.703333
5 17.610000
6 10.630000
Of course this only works if all my unique customers would have at least one purchase per day.
Is there an elegant way to get a zero value for all combinations between unique index values that have no sum after the first groupby?
My solutions has been so far to either to reindex on a multi index i created from the unique values of the grouped columns or the combination of unstack-fillna-stack but both solutions do not really satisfy me.
Appreciate your help!
IIUC, let's use unstack and fillna then stack:
df_out = df.groupby(['name','weekday']).sum().unstack().fillna(0).stack()
Output:
price
name weekday
Micky 0 0.70
1 0.00
2 0.00
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
1 0.00
2 3.45
3 17.21
4 2.62
5 0.00
6 0.00
Sarah 0 0.59
1 0.27
2 14.03
3 0.00
4 0.49
5 0.00
6 0.00
And,
df_out.groupby('weekday').mean()
Output:
price
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I think you can use pivot_table to do all the steps at once. I'm not exactly sure what you want but the default aggregation from pivot_table is the mean. You can change it to 'sum'.
df1 = df.pivot_table(index='name', columns='weekday', values='price',
fill_value=0, aggfunc='sum')
weekday 0 1 2 3 4 5 6
name
Micky 0.70 0.00 0.00 5.35 17.00 17.61 10.63
Paul 18.44 0.00 3.45 17.21 2.62 0.00 0.00
Sarah 0.59 0.27 14.03 0.00 0.49 0.00 0.00
And then take the mean of each column.
df1.mean()
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I have data that looks like this:
1.00 1.00 1.00
3.23 4.23 0.33
1.23 0.13 3.44
4.55 12.3 14.1
2.00 2.00 2.00
1.21 1.11 1.11
3.55 5.44 5.22
4.11 1.00 4.00
It comes in chunk of 4. The first line of the chunk is index and the rest are the values.
The chunk always comes in 4 lines, but number of columns can be more than 3.
For example:
1.00 1.00 1.00 <- 1st chunk, the index = 1
3.23 4.23 0.33 <- values
1.23 0.13 3.44 <- values
4.55 12.3 14.1 <- values
My example above only contains 2 chunks, but actually it can contain more than that.
What I want to do is to create a dictionary of data frames so I can process them
chunk by chunk. Namely from this:
In [1]: import pandas as pd
In [2]: df = pd.read_table("http://dpaste.com/29R0BSS.txt",header=None, sep = " ")
In [3]: df
Out[3]:
0 1 2
0 1.00 1.00 1.00
1 3.23 4.23 0.33
2 1.23 0.13 3.44
3 4.55 12.30 14.10
4 2.00 2.00 2.00
5 1.21 1.11 1.11
6 3.55 5.44 5.22
7 4.11 1.00 4.00
Into list of data frame, such that I can do something like this (I do this by hand):
>> # Let's call new data frame `nd`.
>> nd[1]
>> 0 1 2
0 3.23 4.23 0.33
1 1.23 0.13 3.44
2 4.55 12.30 14.10
There are lots of ways to do this; I tend to use groupby, e.g. something like
>>> grouped = df.groupby(np.arange(len(df)) // 4)
>>> d = {v.iloc[0][0]: v.iloc[1:].reset_index(drop=True) for k,v in grouped}
>>> for k,v in d.items():
... print(k)
... print(v)
...
1.0
0 1 2
0 3.23 4.23 0.33
1 1.23 0.13 3.44
2 4.55 12.30 14.10
2.0
0 1 2
0 1.21 1.11 1.11
1 3.55 5.44 5.22
2 4.11 1.00 4.00