Python pandas: flatten with arrays in column - python

I have a pandas Data Frame having one column containing arrays. I'd like to "flatten" it by repeating the values of the other columns for each element of the arrays.
I succeed to make it by building a temporary list of values by iterating over every row, but it's using "pure python" and is slow.
Is there a way to do this in pandas/numpy? In other words, I try to improve the flatten function in the example below.
Thanks a lot.
toConvert = pd.DataFrame({
'x': [1, 2],
'y': [10, 20],
'z': [(101, 102, 103), (201, 202)]
})
def flatten(df):
tmp = []
def backend(r):
x = r['x']
y = r['y']
zz = r['z']
for z in zz:
tmp.append({'x': x, 'y': y, 'z': z})
df.apply(backend, axis=1)
return pd.DataFrame(tmp)
print(flatten(toConvert).to_string(index=False))
Which gives:
x y z
1 10 101
1 10 102
1 10 103
2 20 201
2 20 202

Here's a NumPy based solution -
np.column_stack((toConvert[['x','y']].values.\
repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))
Sample run -
In [78]: toConvert
Out[78]:
x y z
0 1 10 (101, 102, 103)
1 2 20 (201, 202)
In [79]: np.column_stack((toConvert[['x','y']].values.\
...: repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))
Out[79]:
array([[ 1, 10, 101],
[ 1, 10, 102],
[ 1, 10, 103],
[ 2, 20, 201],
[ 2, 20, 202]])

You need numpy.repeat with str.len for creating columns x and y and for z use this solution:
import pandas as pd
import numpy as np
from itertools import chain
df = pd.DataFrame({
"x": np.repeat(toConvert.x.values, toConvert.z.str.len()),
"y": np.repeat(toConvert.y.values, toConvert.z.str.len()),
"z": list(chain.from_iterable(toConvert.z))})
print (df)
x y z
0 1 10 101
1 1 10 102
2 1 10 103
3 2 20 201
4 2 20 202

Related

Find nearest index in one dataframe to another

I am new to python and its libraries. Searched all the forums but could not find a proper solution. This is the first time posting a question here. Sorry if I did something wrong.
So, I have two DataFrames like below containing X Y Z coordinates (UTM) and other features.
In [2]: a = {
...: 'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
...: 'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
...: 'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19],
...: }
...:
In [3]: b = {
...: 'X': [1, 8, 20, 7, 32],
...: 'Y': [6, 4, 17, 45, 32],
...: 'Z': [52, 12, 6, 8, 31],
...: }
In [4]: df1 = pd.DataFrame(data=a)
In [5]: df2 = pd.DataFrame(data=b)
In [6]: print(df1)
X Y Z
0 1 3 12
1 2 4 4
2 5 8 9
3 7 15 16
4 10 20 13
5 5 12 1
6 2 23 8
7 3 22 17
8 24 14 11
9 21 7 19
In [7]: print(df2)
X Y Z
0 1 6 52
1 8 4 12
2 20 17 6
3 7 45 8
4 32 32 31
I need to find the closest point (distance) in df1 to each point of df2 and creating new DataFrame.
So I wrote the code below and actually find the closest point (distance) to df2.iloc[0].
In [8]: x = (
...: np.sqrt(
...: ((df1['X'].sub(df2["X"].iloc[0]))**2)
...: .add(((df1['Y'].sub(df2["Y"].iloc[0]))**2))
...: .add(((df1['Z'].sub(df2["Z"].iloc[0]))**2))
...: )
...: ).idxmin()
In [9]: x1 = df1.iloc[[x]]
In[10]: print(x1)
X Y Z
3 7 15 16
So, I guess I need a loop to iterate through df2 and apply above code to each row. As a result I need a new updated df1 containing all the closest points to each point of df2. But couldn't make it. Please advise.
This is actually a great example of a case where numpy's broadcasting rules have distinct advantages over pandas.
Manually aligning df1's coordinates as column vectors (by referencing df1[[col]].to_numpy()) and df2's coordinates as row vectors (df2[col].to_numpy()), we can get the distance from every element in each dataframe to each element in the other very quickly with automatic broadcasting:
In [26]: dists = np.sqrt(
...: (df1[['X']].to_numpy() - df2['X'].to_numpy()) ** 2
...: + (df1[['Y']].to_numpy() - df2['Y'].to_numpy()) ** 2
...: + (df1[['Z']].to_numpy() - df2['Z'].to_numpy()) ** 2
...: )
In [27]: dists
Out[27]:
array([[40.11234224, 7.07106781, 24.35159132, 42.61455151, 46.50806382],
[48.05205511, 10. , 22.29349681, 41.49698784, 49.12229636],
[43.23193264, 5.83095189, 17.74823935, 37.06750599, 42.29657197],
[37.58989226, 11.74734012, 16.52271164, 31.04834939, 33.74907406],
[42.40283009, 16.15549442, 12.56980509, 25.67099531, 30.85449724],
[51.50728104, 13.92838828, 16.58312395, 33.7934905 , 45.04442252],
[47.18050445, 20.32240143, 19.07878403, 22.56102835, 38.85871846],
[38.53569774, 19.33907961, 20.85665361, 25.01999201, 33.7194306 ],
[47.68647607, 18.89444363, 7.07106781, 35.48239 , 28.0713377 ],
[38.60051813, 15.06651917, 16.43167673, 41.96427052, 29.83286778]])
Argmin will now give you the correct vector of positional indices:
In [28]: dists.argmin(axis=0)
Out[28]: array([3, 2, 8, 6, 8])
Or, to select the appropriate values from df1:
In [29]: df1.iloc[dists.argmin(axis=0)]
Out[29]:
X Y Z
3 7 15 16
2 5 8 9
8 24 14 11
6 2 23 8
8 24 14 11
Edit
An answer popped up just after mine, then was deleted, which made reference to scipy.spatial.distance_matrix, computing dists with:
distance_matrix(df1[list('XYZ')].to_numpy(), df2[list('XYZ')].to_numpy())
Not sure why that answer was deleted, but this seems like a really nice, clean approach to getting the array I produced manually above!
Performance Note
Note that if you are just trying to get the closest value, there's no need to take the square root, as this is a costly operation compared to addition, subtraction, and powers, and sorting on dist**2 is still valid.
First, you define a function that returns the closest point using numpy.where. Then you use the apply function to run through df2.
import pandas as pd
import numpy as np
a = {
'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19]
}
b = {
'X': [1, 8, 20, 7, 32],
'Y': [6, 4, 17, 45, 32],
'Z': [52, 12, 6, 8, 31]
}
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
dist = lambda dx,dy,dz: np.sqrt(dx**2+dy**2+dz**2)
def closest(row):
darr = dist(df1['X']-row['X'], df1['Y']-row['Y'], df1['Z']-row['Z'])
idx = np.where(darr == np.amin(darr))[0][0]
return df1['X'][idx], df1['Y'][idx], df1['Z'][idx]
df2['closest'] = df2.apply(closest, axis=1)
print(df2)
Output:
X Y Z closest
0 1 6 52 (7, 15, 16)
1 8 4 12 (5, 8, 9)
2 20 17 6 (24, 14, 11)
3 7 45 8 (2, 23, 8)
4 32 32 31 (24, 14, 11)

fastest way to create a matrix who cols are a product of each other

Suppose I have a matrix X, with n columns. I want to create a new matrix, Y such that each column of Y is a product of two different columns of X.
Currently, I am doing a loop, something like this (not my actual code, but captures the essence of the code):
Y = np.array(X.shape[0], int(n * (n-1)/2))
cnt = 0
for j1 in range(0, n-1):
for j2 in range(j1+1, n):
Y[:, cnt] = X[:, j1] * X[:, j2]
cnt += 1
I was wondering if anyone knows if there is a faster way to generate (populate) matrix Y than the double loop that I am doing ? For instance, any function in numpy that be re-used to generate such a matrix quickly ?
Since you are looking for combinations of columns without repetition (i.e. col 0 * col 1 is the same as col 1 * col 0), I would use itertools since the combination is over something relatively smaller (the indices):
>>> x = np.arange(24).reshape(6,4)
>>> list(combinations(range(x.shape[1]), 2)) # For illustrative purposes. We want all pairs of different columns.
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)] |
>>> np.vstack([x[:, i]*x[:, j] for i, j in combinations(range(x.shape[1]), 2)]).T
array([[ 0, 0, 0, 2, 3, 6],
[ 20, 24, 28, 30, 35, 42],
[ 72, 80, 88, 90, 99, 110],
[156, 168, 180, 182, 195, 210],
[272, 288, 304, 306, 323, 342],
[420, 440, 460, 462, 483, 506]])
Using broadcasting (I think depending on your input might be faster):
Z = X.T[:,None]*X.T
output = Z[np.triu_indices(X.shape[1],k=1)].T
example input/output:
X = np.arange(24).reshape(6,4)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
output:
[[ 0 0 0 2 3 6]
[ 20 24 28 30 35 42]
[ 72 80 88 90 99 110]
[156 168 180 182 195 210]
[272 288 304 306 323 342]
[420 440 460 462 483 506]]

How to aggregate two largest values per group in pandas?

I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}

How to get pandas dataframe where columns are the subsequent n-elements from another column dataframe?

A very simple example just for understanding.
I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame({'A':pd.Series([1, 2, 13, 14, 25, 26, 37, 38])})
df
A
0 1
1 2
2 13
3 14
4 25
5 26
6 37
8 38
Set n = 3
First example
How to get a new dataframe df1 (in an efficient way), like the following:
D1 D2 D3 T
0 1 2 13 14
1 2 13 14 25
2 13 14 25 26
3 14 25 26 37
4 25 26 37 38
Hint: think at the first n-columns as the data (Dx) and the last columns as the target (T). In the 1st example the target (e.g 25) depends on the preceding n-elements (2, 13, 14).
Second example
What if the target is some element ahead (e.g.+3)?
D1 D2 D3 T
0 1 2 13 26
1 2 13 14 37
2 13 14 25 38
Thank you for your help,
Gilberto
P.S. If you think that the title can be improved, please suggest me how to modify it.
Update
Thanks to #Divakar and this post the rolling function can be defined as:
import numpy as np
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = np.arange(1000000000)
b = rolling(a, 4)
In less than 1 second!
Let's see how we can solve it with NumPy tools. So, let's imagine you have the column data as a NumPy array, let's call it a. For such sliding windowed operations, we have a very efficient tool in NumPy as strides, as they are views into the input array without actually making copies.
Let's directly use the methods with the sample data and start with case #1 -
In [29]: a # Input data
Out[29]: array([ 1, 2, 13, 14, 25, 26, 37, 38])
In [30]: m = a.strides[0] # Get strides
In [31]: n = 3 # parameter
In [32]: nrows = a.size - n # Get number of rows in o/p
In [33]: a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,n+1),strides=(m,m))
In [34]: a2D
Out[34]:
array([[ 1, 2, 13, 14],
[ 2, 13, 14, 25],
[13, 14, 25, 26],
[14, 25, 26, 37],
[25, 26, 37, 38]])
In [35]: np.may_share_memory(a,a2D)
Out[35]: True # a2D is a view into a
Case #2 would be similar with an additional parameter for the Target column -
In [36]: n2 = 3 # Additional param
In [37]: nrows = a.size - n - n2 + 1
In [38]: part1 = np.lib.stride_tricks.as_strided(a,shape=(nrows,n),strides=(m,m))
In [39]: part1 # These are D1, D2, D3, etc.
Out[39]:
array([[ 1, 2, 13],
[ 2, 13, 14],
[13, 14, 25]])
In [43]: part2 = a[n+n2-1:] # This is target col
In [44]: part2
Out[44]: array([26, 37, 38])
I found another method: view_as_windows
import numpy as np
from skimage.util.shape import view_as_windows
window_shape = (4, )
aa = np.arange(1000000000) # 1 billion!
bb = view_as_windows(aa, window_shape)
bb
array([[ 0, 1, 2, 3],
[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
...,
[999999994, 999999995, 999999996, 999999997],
[999999995, 999999996, 999999997, 999999998],
[999999996, 999999997, 999999998, 999999999]])
Around 1 second.
What do you think?

Replacing row values in pandas

I would like to replace row values in pandas.
In example:
import pandas as pd
import numpy as np
a = np.array(([100, 100, 101, 101, 102, 102],
np.arange(6)))
pd.DataFrame(a.T)
Result:
array([[100, 0],
[100, 1],
[101, 2],
[101, 3],
[102, 4],
[102, 5]])
Here, I would like to replace the rows with the values [101, 3] with [200, 10] and the result should therefore be:
array([[100, 0],
[100, 1],
[101, 2],
[200, 10],
[102, 4],
[102, 5]])
Update
In a more general case I would like to replace multiple rows.
Therefore the old and new row values are represented by nx2 sized matrices (n is number of row values to replace). In example:
old_vals = np.array(([[101, 3]],
[[100, 0]],
[[102, 5]]))
new_vals = np.array(([[200, 10]],
[[300, 20]],
[[400, 30]]))
And the result is:
array([[300, 20],
[100, 1],
[101, 2],
[200, 10],
[102, 4],
[400, 30]])
For the single row case:
In [35]:
df.loc[(df[0]==101) & (df[1]==3)] = [[200,10]]
df
Out[35]:
0 1
0 100 0
1 100 1
2 101 2
3 200 10
4 102 4
5 102 5
For the multiple row-case the following would work:
In [60]:
a = np.array(([100, 100, 101, 101, 102, 102],
[0,1,3,3,3,4]))
df = pd.DataFrame(a.T)
df
Out[60]:
0 1
0 100 0
1 100 1
2 101 3
3 101 3
4 102 3
5 102 4
In [61]:
df.loc[(df[0]==101) & (df[1]==3)] = 200,10
df
Out[61]:
0 1
0 100 0
1 100 1
2 200 10
3 200 10
4 102 3
5 102 4
For multi-row update like you propose the following would work where the replacement site is a single row, first construct a dict of the old vals to search for and use the new values as the replacement value:
In [78]:
old_keys = [(x[0],x[1]) for x in old_vals]
new_valss = [(x[0],x[1]) for x in new_vals]
replace_vals = dict(zip(old_keys, new_vals))
replace_vals
Out[78]:
{(100, 0): array([300, 20]),
(101, 3): array([200, 10]),
(102, 5): array([400, 30])}
We can then iterate over the dict and then set the rows using the same method as my first answer:
In [93]:
for k,v in replace_vals.items():
df.loc[(df[0] == k[0]) & (df[1] == k[1])] = [[v[0],v[1]]]
df
0 1
0 100 0
0 1
5 102 5
0 1
3 101 3
Out[93]:
0 1
0 300 20
1 100 1
2 101 2
3 200 10
4 102 4
5 400 30
The simplest way should be this one:
df.loc[[3],0:1] = 200,10
In this case, 3 is the third row of the data frame while 0 and 1 are the columns.
This code instead, allows you to iterate over each row, check its content and replace it with what you want.
target = [101,3]
mod = [200,10]
for index, row in df.iterrows():
if row[0] == target[0] and row[1] == target[1]:
row[0] = mod[0]
row[1] = mod[1]
print(df)
Replace 'A' with 1 and 'B' with 2.
df = df.replace(['A', 'B'],[1, 2])
This is done over the entire DataFrame no matter the column.
However, we can target a single column in this way
df[column] = df[column].replace(['A', 'B'],[1, 2])
More in-depth examples are available HERE.
Another possibility is:
import io
a = np.array(([100, 100, 101, 101, 102, 102],
np.arange(6)))
df = pd.DataFrame(a.T)
string = df.to_string(header=False, index=False, index_names=False)
dictionary = {'100 0': '300 20',
'101 3': '200 10',
'102 5': '400 30'}
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
string = replace_all(string, dictionary)
df = pd.read_csv(io.StringIO(string), delim_whitespace=True)
I found this solution better, since when dealing with large amount of data to replace, the time is shorter than by EdChum's solution.

Categories

Resources