How to unstack a dictionary of dataframe with multiple entries? - python

Hi I have this dictionary below
str1 x y
a 1.0 -3.0
b 2.0 -2.5
str2: x y
a 3.0 -2.0
b 4.0 -1.5
str3: x y
a 5.0 -1.0
b 6.0 -0.5
The result I would like is to be able to unstack it so I get a dataframe with index=[str1,str2,str3] and columns=[a,b]. To choose whether I use values on columns x or y to fill the row of my expected dataframe, I use the integer N.
You can see N as the limit stating that every row above use x values and below, y values.
If N=1, I use x values for str 1, y values for str 2 and str 3.
If N=2, I use x values for str 1 and str 2 , y values for str 3.
If N=3, I use x values for str 1, str 2 and str 3.
Which will give me for i = 1:
a b
str1 1.0 2.0 (x values)
str2 -2.0 -1.5 (y values)
str3 -1.0 -0.5 (y values)
I know that I can get two data frames, unstacking on x and y, then concatenating rows that I want to keep but I wanted to know if there were a faster way.

To better resolve the question in a Pythonic way, you could first translate your rule (using x or y values) to a dictionary (probably with dictionary comprehension):
# replicate the dictionary in the post
>>> d = {'str1':{'a':{'x':1, 'y':-3}, 'b':{'x':2,'y':-2.5}}, 'str2':{'a':{'x':3, 'y':-2}, 'b':{'x':4,'y':-1.5}}, 'str3':{'a':{'x':5, 'y':-1}, 'b':{'x':6,'y':-0.5}}}
>>> indexes = ['str1', 'str2', 'str3']
>>> N_map = {1:{'str1':'x', 'str2':'y', 'str3':'y'}, 2:{'str1':'x', 'str2':'x', 'str3':'y'}}
Then we could loop through N=1,... and construct the dataframe with list/dictionary comprehension:
# only take the first two rules as an example
>>> for i in range(1, 3):
... df_d = {col:[d[index][col][N_map[i][index]] for index in indexes] for col in ['a', 'b']}
... pd.DataFrame(df_d, index=indexes)
a b
str1 1 2.0
str2 -2 -1.5
str3 -1 -0.5
a b
str1 1 2.0
str2 3 4.0
str3 -1 -0.5

Here is a code using dictcomp from an Ordered dictionary (a bit more pythonic):
def N_unstack(d,N):
d = collections.OrderedDict(d)
idx = list('x'*N+'y'*(len(d)-N))
return pd.DataFrame({k:v[idx[i]] for i,(k,v) in enumerate(d.items())}).T
Output for N_unstack(d,1) where d is the dictionary of dataframes:
a b
str1 1.0 2.0
str2 -2.0 -1.5
str3 -1.0 -0.5
Here is how I would do it (using pd.concat). It's a bit verbose:
def N_unstack(d,N):
idx = list('x'*N+'y'*(len(d)-N))
df = pd.concat([d['str1'][idx[0]],d['str2'][idx[1]],d['str3'][idx[2]]], axis=1).T
df.index = ['str1','str2','str3']
return df
Edit: made the code a bit more pythonic

With this dictionary of Dataframes :
d2
"""
{'str1': a b
x 1.0 2.0
y -3.0 -2.5,
'str2': a b
x 3.0 4.0
y -2.0 -1.5,
'str3': a b
x 5.0 6.0
y -1.0 -0.5}
"""
Define
df2 = pd.concat(d2)
df2.set_index(df2.index.droplevel(1),inplace=True) # remove 'x','y' labels
select = { N:[ 2*i + (i>=N) for i in range(3)] for N in range(1,4) }
Then with for example N = 1
In [3]: df2.iloc[select[N]]
Out[3]:
a b
str1 1.0 2.0
str2 -2.0 -1.5
str3 -1.0 -0.5

Related

How to storage the outputs of an iterable function

Probably this is very simple, but I can not figure it out how is the proper way to produce a dataframe in pandas, with the outputs of my function.
Let's say that I have a function that divide each element of a list (let's omitting the easiest way to divide a list):
X = [1,2,3,4,5,6]
for i in X:
def SUM(X):
output = i / 2
return output
df = SUM(X)
At the end 'df' represent the last operation performed by my function. But how can I append all the outputs in a Dataframe?
Thanks by your suggestions
Why not create DataFrame in first step and then processing column values by Series.apply?
X = [1,2,3,4,5,6]
def SUM(X):
output = X / 2
return output
df = pd.DataFrame({'in':X})
df['out'] = df['in'].apply(SUM)
print (df)
in out
0 1 0.5
1 2 1.0
2 3 1.5
3 4 2.0
4 5 2.5
5 6 3.0
Your solution should be used:
X = [1,2,3,4,5,6]
def SUM(X):
output = X / 2
return output
out = [SUM(i) for i in X]
df = pd.DataFrame({'out':out})
print (df)
out
0 0.5
1 1.0
2 1.5
3 2.0
4 2.5
5 3.0

TypeError: can't multiply sequence by non-int of type 'numpy.float64' - multiply column by value

I am having problems creating a new column in my dataframe by multiplying an existing column by a value - I have looked over similar questions but have been unable to understand how to fix my code below:
list = []
i = 1
for col in df.columns[1:19]:
#calculations
x = df[[df.columns[i], df.columns[i+1], df.columns[i+2]]].values
Q = np.cov(x.T)
eval, evec = np.linalg.eig(Q)
w = np.array([2*(evec[0,2]/evec[1,2]),2*(evec[1,2]/evec[1,2]),2*(evec[2,2]/evec[1,2])])
#create new columns in dataframe with applied weights
df['w1_PCA'] = df.columns[i] * w[0]
df['b_PCA'] = df.columns[i+1] * w[1]
df['w2_PCA'] = df.columns[i+2] * w[2]
i = i + 1
print(x)
Receiving the error as follows:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-d7d86010b8f8> in <module>
19
20 #create new columns in dataframe for back-applied PCA weights
---> 21 df['w1_PCA'] = df.columns[i] * w[0]
22 df['b_PCA'] = df.columns[i+1] * w[1]
23 df['w2_PCA'] = df.columns[i+2] * w[2]
TypeError: can't multiply sequence by non-int of type 'numpy.float64'
Could someone please advise me as to where I am going wrong with this?
Any help is much appreciated!
The error is thrown because the column number i of your data frame df is either a string (in my case with your code) or an integer. You first need to convert the int to a float by using float().
I created a short example of your problem and could get rid of the errors as I understand it, while adding three further columns with some values inserted. I hope you can apply this solution to your data frame or data set. Below you can find two examples, depending on what you want to precisely do.
Solution 1:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c': [2,3,4], 'd': [2,3,4], 'e': [2,3,4], 'f': [2,3,4], 'g': [2,3,4]})
list = []
i = 1
for col in df.columns[1:5]:
#calculations
x = df[[df.columns[i], df.columns[i+1], df.columns[i+2]]].values
Q = np.cov(x.T)
eval, evec = np.linalg.eig(Q)
w = np.array([2*(evec[0,2]/evec[1,2]),2*(evec[1,2]/evec[1,2]),2*(evec[2,2]/evec[1,2])])
#create new columns in dataframe with applied weights
df['w1_PCA'] = float(df['a'][0]) * w[0]
df['b_PCA'] = float(df['b'][0]) * w[1]
df['w2_PCA'] = df['c'][0] * w[2]
i = i + 1
The resulting df in this case is:
a b c d e f g w1_PCA b_PCA w2_PCA
0 1 2 2 2 2 2 2 -0.0 4.0 -4.0
1 2 3 3 3 3 3 3 -0.0 4.0 -4.0
2 3 4 4 4 4 4 4 -0.0 4.0 -4.0
Alternatively you could apply a function on the column df['a'] and store the results in new columnns. You will have to change lines 21 to 23 of your code with the below standing three lines.
Here is the mapping of the function to the whole column:
Solution 2
df['w1_PCA'] = df['a'].apply(lambda x: float(x) * w[0])
df['b_PCA'] = df['b'].apply(lambda x: float(x) * w[1])
df['w2_PCA'] = df['c'].apply(lambda x: float(x) * w[2])
Result:
a b c d e f g w1_PCA b_PCA w2_PCA
0 1 2 2 2 2 2 2 -0.0 4.0 -4.0
1 2 3 3 3 3 3 3 -0.0 6.0 -6.0
2 3 4 4 4 4 4 4 -0.0 8.0 -8.0

Pandas conditional map/fill/replace

d1=pd.DataFrame({'x':['a','b','c','c'],'y':[-1,-2,-3,0]})
d2=pd.DataFrame({'x':['d','c','a','b'],'y':[0.1,0.2,0.3,0.4]})
I want to replace d1.y where y<0 with the correspondent y in d2. It's something like vlookup in Excel. The core problem is replace y according to x rather than just simply manipulate y. What I want is
Out[40]:
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
Use Series.map with condition:
s = d2.set_index('x')['y']
d1.loc[d1.y < 0, 'y'] = d1['x'].map(s)
print (d1)
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
You can try this:
d1.loc[d1.y < 0, 'y'] = d2.loc[d1.y < 0, 'y']

constrain a series or array to a range of values

I have a series of values that I want to have constrained to be within +1 and -1.
s = pd.Series(np.random.randn(10000))
I know I can use apply, but is there a simple vectorized approach?
s_ = s.apply(lambda x: min(max(x, -1), 1))
s_.head()
0 -0.256117
1 0.879797
2 1.000000
3 -0.711397
4 -0.400339
dtype: float64
Use clip:
s = s.clip(-1,1)
Example Input:
s = pd.Series([-1.2, -0.5, 1, 1.1])
0 -1.2
1 -0.5
2 1.0
3 1.1
Example Output:
0 -1.0
1 -0.5
2 1.0
3 1.0
You can use the between Series method:
In [11]: s[s.between(-1, 1)]
Out[11]:
0 -0.256117
1 0.879797
3 -0.711397
4 -0.400339
5 0.667196
...
Note: This discards the values outside of the between range.
Use nested np.where
pd.Series(np.where(s < -1, -1, np.where(s > 1, 1, s)))
Timing
One more suggestion:
s[s<-1] = -1
s[s>1] = 1

Interpolate on the fly to get previous valid entry from pandas DataFrame

If I have an indexed pandas.DataFrame like this:
>>> Dxz = pandas.DataFrame({"x": [False,False,True], "z": [0,2,0], "p": [0.2,0.4,1]})
>>> Dxz.set_index(["x","z"], inplace=True)
>>> Dxz
p
x z
False 0 0.4
2 0.2
True 0 1.0
How do I get it to return me the value for p given a valid index tuple, and the value of the previous present index tuple if the index is not valid? For example, assuming it was a method “lookup_or_interpolate”, I'd like to see something like this:
>>> Dxz.lookup_or_interpolate((False, 0))["p"]
0.4
>>> Dxz.lookup_or_interpolate((False, 1))["p"]
0.4
>>> Dxz.lookup_or_interpolate((True, 23))["p"]
1.0
>>> Dxz
p
x z
False 0 0.4
2 0.2
True 0 1.0
use reindex:
import pandas as pd
Dxz = pd.DataFrame({"x": [False,False,True], "z": [0,2,0], "p": [0.4,0.2,1]})
Dxz.set_index(["x","z"], inplace=True)
print Dxz.reindex(pd.MultiIndex.from_tuples([(False, 0), (False, 1), (False, 100), (True, 23)]), method="ffill")
output:
p
False 0 0.4
1 0.4
100 0.2
True 23 1.0

Categories

Resources