How to reshape a dataframe with identical column names - python

I want to reshape this table:
Back Vowels
x
x
x
y
y
y
a:
-26.69
-40.06
-49.59
-15.56
-7.5
-11.89
o:
...
...
...
...
...
...
to the following format:
Back Vowels
x
y
a:
-26.69
-15.56
a:
-40.06
-7.5
a:
-49.59
-11.89
o:
...
....
What pandas function should I use?
Thank you!:)
I am having trouble formulating the right question, I looked into pivot_table(), melt(), and stack(), but they seem to be doing other stuff.

I don't know exactly how you got multiple columns with same name, so i supposed that come from some kind of merge.
This input make the following output
df = pd.DataFrame({'Back Vowels': ["a:", "o:"],
'x': [-26.69,"..."],
'y': [-15.56,"..."],})
df2 = pd.DataFrame({'Back Vowels': ["a:", "o:"],
'x': [-40.06,"..."],
'y': [-11.89,"..."],})
df3 = pd.DataFrame({'Back Vowels': ["a:", "o:"],
'x': [-49.59,"..."],
'y': [-7.5,"..."],})
df = pd.merge(pd.merge(df,df2,'inner','Back Vowels'),df3,'inner', 'Back Vowels')
df= df.rename(columns={"x_x": "x", "x_y": "x", "y_x": "y" , "y_y": "y"})
Output:
Back Vowels x y x y x y
0 a: -26.69 -15.56 -40.06 -11.89 -49.59 -7.5
1 o: ... ... ... ... ... ...
This function will merge columns with same names together:
def same_merge(x): return ','.join(x[x.notnull()].astype(str))
Apply the function :
df_new = df.groupby(level=0, axis=1).apply(lambda x: x.apply(same_merge, axis=1))
Make the columns be arrays :
df_new['x'] = df_new['x'].str.split(",")
df_new['y'] = df_new['y'].str.split(",")
Finally explode the columns to get desired output:
df_final = df_new.explode(list('xy')).reset_index(drop=True)
Final output:
Back Vowels x y
0 a: -26.69 -15.56
1 a: -40.06 -11.89
2 a: -49.59 -7.5
3 o: ... ...
4 o: ... ...
5 o: ... ...

TLDR
Transpose the dataframe, aggregate x and y into lists, then transpose back and explode:
df = df.set_index('Back Vowels').T.groupby(level=0).agg(list).T.explode(['x', 'y'])
# x y
# Back Vowels
# a: -26.69 -15.56
# a: -40.06 -7.5
# a: -49.59 -11.89
# o: 6.69 5.56
# o: 0.06 0.5
# o: 9.59 1.89
Details
Given this dataframe:
df = pd.DataFrame(
data=[['a:', -26.69, -40.06, -49.59, -15.56, -7.5, -11.89], ['o:', 6.69, 0.06, 9.59, 5.56, 0.5, 1.89]],
columns=['Back Vowels', *'xxx', *'yyy'],
)
# Back Vowels x x x y y y
# 0 a: -26.69 -40.06 -49.59 -15.56 -7.5 -11.89
# 1 o: 6.69 0.06 9.59 5.56 0.5 1.89
Work with the transposed dataframe:
df = df.set_index('Back Vowels').T
# Back Vowels a: o:
# x -26.69 6.69
# x -40.06 0.06
# x -49.59 9.59
# y -15.56 5.56
# y -7.50 0.50
# y -11.89 1.89
Group by the index (level 0) and aggregate x and y into lists:
df = df.groupby(level=0).agg(list)
# Back Vowels a: o:
# x [-26.69, -40.06, -49.59] [6.69, 0.06, 9.59]
# y [-15.56, -7.5, -11.89] [5.56, 0.5, 1.89]
Then transpose back and explode the x and y lists into rows:
df = df.T.explode(['x', 'y'])
# x y
# Back Vowels
# a: -26.69 -15.56
# a: -40.06 -7.5
# a: -49.59 -11.89
# o: 6.69 5.56
# o: 0.06 0.5
# o: 9.59 1.89

Related

Applying/Composing a function N times to a pandas column, N being different for each row

Suppose we have this simple pandas.DataFrame:
import pandas as pd
df = pd.DataFrame(
columns=['quantity', 'value'],
data=[[1, 12.5], [3, 18.0]]
)
>>> print(df)
quantity value
0 1 12.5
1 3 18.0
I would like to create a new column, say modified_value, that applies a function N times to the value column, N being the quantity column.
Suppose that function is new_value = round(value/2, 1), the expected result would be:
quantity value modified_value
0 1 12.5 6.2 # applied 1 time
1 3 9.0 1.1 # applied 3 times, 9.0 -> 4.5 -> 2.2 -> 1.1
What would be an elegant/vectorized way to do so?
You can write a custom repeat function, then use apply:
def repeat(func, x, n):
ret = x
for i in range(int(n)):
ret = func(ret)
return ret
def my_func(val): return round(val/2, 1)
df['new_col'] = df.apply(lambda x: repeat(my_func, x['value'], x['quantity']),
axis=1)
# or without apply
# df['new_col'] = [repeat(my_func, v, n) for v,n in zip(df['value'], df['quantity'])]
Use reduce:
from functools import reduce
def repeated(f, n):
def rfun(p):
return reduce(lambda x, _: f(x), range(n), p)
return rfun
def myfunc(value): return round(value/2, 1)
df['modified_valued'] = df.apply(lambda x: repeated(myfunc,
int(x['quantity']))(x['value']),
axis=1)
We can also use list comprehension instead apply
df['modified_valued'] = [repeated(myfunc, int(quantity))(value)
for quantity, value in zip (df['quantity'], df['value'])]
Output
quantity value modified_valued
0 1 12.5 6.2
1 3 18.0 2.2

For loop to iterate the operation

import numpy as np
import pandas as pd
from scipy.spatial.distance import directed_hausdorff
df:
1 1.1 2 2.1 3 3.1 4 4.1
45.13 7.98 45.10 7.75 45.16 7.73 NaN NaN
45.35 7.29 45.05 7.68 45.03 7.96 45.05 7.65
Calculated distance for 1 couple
x = df['3']
y = df['3.1']
P = np.array([x, y])
q = df['4']
w = df['4.1']
Q = np.array([q, w])
Q_final = list(zip(Q[0], Q[1]))
P_final = list(zip(P[0], P[1]))
directed_hausdorff(P_final, Q_final)[0]
Desired output:
Same process with for loop for the whole dataset
distance from a['0'], a['0']is 0
from a['0'], a['1'] is 0.234 (some number)
from a['0'], a['2'] is .. ...
From [0] to all, then to [1] to all and etc.
Finally I should get a matrix with 0s` in diagonal
I Have tried:
space = list(df.index)
dist = []
for j in space:
for k in space:
if k != j:
dist.append((j, k, directed_hausdorff(P_final, Q_final)[0]))
But getting same value of distance between [3] and [4]
I am not entirely sure what you are trying to do.. but based on how you calculated the first one, here is a possible solution:
import pandas as pd
import numpy as np
from scipy.spatial.distance import directed_hausdorff
df = pd.read_csv('something.csv')
groupby = lambda l, n: [tuple(l[i:i+n]) for i in range(0, len(l), n)]
values = groupby(df.columns.values, 2)
matrix = np.zeros((4, 4))
for Ps in values:
x = df[str(Ps[0])]
y = df[str(Ps[1])]
P = np.array([x, y])
for Qs in values:
q = df[str(Qs[0])]
w = df[str(Qs[1])]
Q = np.array([q, w])
Q_final = list(zip(Q[0], Q[1]))
P_final = list(zip(P[0], P[1]))
matrix[values.index(Ps), values.index(Qs)] = directed_hausdorff(P_final, Q_final)[0]
print(matrix)
Output:
[[0. 0.49203658 0.47927028 0.46861498]
[0.31048349 0. 0.12083046 0.1118034 ]
[0.25179357 0.22135944 0. 0.31064449]
[0.33955854 0.03 0.13601471 0. ]]

While loop for iterating all combinations between two values

I want to create a loop that loads all the iterations of two variables into a dataframe in seperate columns. I want variable "a" to hold values between 0 and 1 in 0.1 increments, and the same for variable "b". In otherwords there should be 100 iterations when complete, starting with 0 & 0, and ending with 1 & 1.
I've tried the following code
data = [['Decile 1', 10], ['Decile_2', 15], ['Decile_3', 14]]
staging_table = pd.DataFrame(data, columns = ['Decile', 'Volume'])
profile_table = pd.DataFrame(columns = ['Decile', 'Volume'])
a = 0
b = 0
finished = False
while not finished:
if b != 1:
if a != 1:
a = a + 0.1
staging_table['CAM1_Modifier'] = a
staging_table['CAM2_Modifier'] = b
profile_table = profile_table.append(staging_table)
else:
b = b + 0.1
else:
finished = True
profile_table
You can use itertools.product to get all the combinations:
import itertools
import pandas as pd
x = [i / 10 for i in range(11)]
df = pd.DataFrame(
list(itertools.product(x, x)),
columns=["a", "b"]
)
# a b
# 0 0.0 0.0
# 1 0.0 0.1
# 2 0.0 0.2
# ... ... ...
# 118 1.0 0.8
# 119 1.0 0.9
# 120 1.0 1.0
#
# [121 rows x 2 columns]
itertools is your friend.
from itertools import product
for a, b in product(map(lambda x: x / 10, range(10)),
map(lambda x: x / 10, range(10))):
...
range(10) gives us the integers from 0 to 10 (regrettably, range fails on floats). Then we divide those values by 10 to get your range from 0 to 1. Then we take the Cartesian product of that iterable with itself to get every combination.

How can I remove the first element from a list with pandas.read_csv?

I have:
X = pd.read_csv(
"data/train.csv", header=0, usecols=['Type', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health', 'Quantity', 'Fee', 'VideoAmt', 'PhotoAmt'])
Y = pd.read_csv(
"data/train.csv", header=0, usecols=['AdoptionSpeed'])
print(Y)
This gives:
AdoptionSpeed
0 2
1 0
2 3
3 2
4 2
5 2
6 1
7 3
I assume the first column is the index, and the second is the AdoptionSpeed. I want to then map over the values, but when I do something like:
Y = map(lambda y: float(y) / 4, Y)
I get an error:
ValueError: could not convert string to float: AdoptionSpeed
So how do I remove that first row? Or better yet - is there a better way to map?
Use:
Y = map(lambda y: float(y) / 4, Y['AdoptionSpeed'].tolist())
To make it work.
Even better:
Y = Y.apply(lambda y: float(y) / 4)
When working with pandas, do not use map like this. Use column wise operation. Or pandas's apply
Something like this for divide:
# cast type
Y['AdoptionSpeed'] = Y['AdoptionSpeed'].astype(float)
# devide by 4, assign to a new columns
Y['AdoptionSpeed_4'] = Y['AdoptionSpeed'] / 4
# or apply
Y['AdoptionSpeed_4'] = Y['AdoptionSpeed'].apply(lambda v: v / 4)
More like
df.AdoptionSpeed.map(lambda x : x/4)
Out[52]:
0 0.50
1 0.00
2 0.75
3 0.50
4 0.50
5 0.50
6 0.25
7 0.75
Name: AdoptionSpeed, dtype: float64

Python Pandas: Passing arguments to a function in agg()

I am trying to reduce data in a pandas dataframe by using different kind of functions and argument values. However, I did not manage to change the default arguments in the aggregation functions. Here is an example:
>>> df = pd.DataFrame({'x': [1,np.nan,2,1],
... 'y': ['a','a','b','b']})
>>> df
x y
0 1.0 a
1 NaN a
2 2.0 b
3 1.0 b
Here is an aggregation function, for which I would like to test different values of b:
>>> def translate_mean(x, b=10):
... y = [elem + b for elem in x]
... return np.mean(y)
In the following code, I can use this function with the default b value, but I would like to pass other values:
>>> df.groupby('y').agg(translate_mean)
x
y
a NaN
b 11.5
Any ideas?
Just pass as arguments to agg (this works with apply, too).
df.groupby('y').agg(translate_mean, b=4)
Out:
x
y
a NaN
b 5.5
Maybe you can try using apply in this case:
df.groupby('y').apply(lambda x: translate_mean(x['x'], 20))
Now the result is:
y
a NaN
b 21.5
Just in case you have multiple columns, and you want to apply different functions and different parameters for each column, you can use lambda function with agg function.
For example:
>>> df = pd.DataFrame({'x': [1,np.nan,2,1],
... 'y': ['a','a','b','b']
'z': ['0.1','0.2','0.3','0.4']})
>>> df
x y z
0 1.0 a 0.1
1 NaN a 0.2
2 2.0 b 0.3
3 1.0 0.4
>>> def translate_mean(x, b=10):
... y = [elem + b for elem in x]
... return np.mean(y)
To groupby column 'y', and apply function translate_mean with b=10 for col 'x'; b=25 for col 'z', you can try this:
df_res = df.groupby(by='a').agg({
'x': lambda x: translate_mean(x, 10),
'z': lambda x: translate_mean(x, 25)})
Hopefully, it helps.

Categories

Resources