Numpy get maximum value based on XYZ - python

I'm trying to read an CSV file with some XYZ data but when gridding using Python Natgrid is causing an error: two input triples have the same x/y coordinates. Here is my array:
np.array([[41.540588, -100.348335, 0.052785],
[41.540588, -100.348335, 0.053798],
[42.540588, -102.348335, 0.021798],
[42.540588, -102.348335, 0.022798],
[43.540588, -103.348335, 0.031798]])
I want to remove XY duplicates and get the maximum Z value. Based on the example above, I want to remove any minimum values of this array:
np.array([[41.540588, -100.348335, 0.053798],
[42.540588, -102.348335, 0.022798],
[43.540588, -103.348335, 0.031798]])
I have tried using np.unique, but so far I haven't had any luck because it doesn't work with rows (only columns).

Here is a numpy way, sorting first by Z, then finding the first of each unique X and Y pair, and indexing:
a = np.array([[41.540588, -100.348335, 0.052785],
[41.540588, -100.348335, 0.053798],
[42.540588, -102.348335, 0.021798],
[42.540588, -102.348335, 0.022798],
[43.540588, -103.348335, 0.031798]])
# sort by Z
b = a[np.argsort(a[:,2])[::-1]]
# get first index for each unique x,y pair
u = np.unique(b[:,:2],return_index=True,axis=0)[1]
# index
c = b[u]
>>> c
array([[ 4.15405880e+01, -1.00348335e+02, 5.37980000e-02],
[ 4.25405880e+01, -1.02348335e+02, 2.27980000e-02],
[ 4.35405880e+01, -1.03348335e+02, 3.17980000e-02]])

If you are able to use pandas, you can take advantage of groupby and max
>>> pandas.DataFrame(arr).groupby([0,1], as_index=False).max().values
array([[ 4.15405880e+01, -1.00348335e+02, 5.37980000e-02],
[ 4.25405880e+01, -1.02348335e+02, 2.27980000e-02],
[ 4.35405880e+01, -1.03348335e+02, 3.17980000e-02]])

You can use Pandas via sorting and dropping duplicates:
import pandas as pd
df = pd.DataFrame(arr)
res = df.sort_values(2, ascending=False)\
.drop_duplicates([0, 1])\
.sort_values(0).values
print(res)
array([[ 4.15405880e+01, -1.00348335e+02, 5.37980000e-02],
[ 4.25405880e+01, -1.02348335e+02, 2.27980000e-02],
[ 4.35405880e+01, -1.03348335e+02, 3.17980000e-02]])

Related

Remove first values repeated in an array... Python, Numpy, Pandas, Arrays

so I do have this NumPy array result(final), and I want to reduce it, I mean, if the value is repeated, then I want to delete the first value and maintain the second,third value repeated and so on...
import hmac
import hashlib
import time
from argparse import _MutuallyExclusiveGroup
from tkinter import *
import pandas as pd
import base64
import matplotlib.pyplot as plt
import numpy as np
key="800070FF00FF08012"
key=bytes(key,'utf-8')
collision=[]
for x in range(1,1000001):
msg=bytes(f'{x}','utf-8')
digest = hmac.new(key, msg,"sha256").digest()
code = base64.b64encode(digest).decode('utf-8')
code=code[:6]
key=key.replace(key,digest)
collision.append(code)
df=pd.DataFrame(collision)
df=df[df.duplicated(keep=False)]
df_index=df.index.to_numpy()
df=df.values.flatten()
final=np.stack((df_index,df),axis=1)
Results of the variable "final":
I HAVE:
[[14093 'JRp1kX']
[43985 'KGlW7X']
[59212 'pU97Tr']
[90668 'ecTjTB']
[140615 'JRp1kX']
[218480 '25gtjT']
[344174 'dtXg6E']
[380467 'DdHQ3M']
[395699 'vnFw/c']
[503504 'dtXg6E']
[531073 'KGlW7X']
[633091 'ecTjTB']
[671091 'vnFw/c']
[672111 '25gtjT']
[785568 'pU97Tr']
[991540 'DdHQ3M']
[991548 'JRp1kX']]
And I WANT TO HAVE:
[[140615 'JRp1kX']
[503504 'dtXg6E']
[531073 'KGlW7X']
[633091 'ecTjTB']
[671091 'vnFw/c']
[672111 '25gtjT']
[785568 'pU97Tr']
[991540 'DdHQ3M']
[991548 'JRp1kX']]
Eliminating the first values that were repeated in the array.
Does someone have some code that could work for my case?
In more simple terms it would be, if you have this list [1,2,3,4,5,1,3,5,5]
I would like to have [2,4,1,3,5,5]
df = pd.DataFrame([1, 2, 3, 4, 5, 1, 3, 5, 5])
# keep the unique rows
unique_mask = ~df.duplicated(keep=False)
# keep the repeated rows (skipping the first for each non-unique)
repeated_mask = df.duplicated()
df.loc[unique_mask | repeated_mask]
0
1 2
3 4
5 1
6 3
7 5
8 5
final is a numpy array, so you can use np.unique on the second column to get the indices of the first occurrence and number of occurrences to avoid deleting single values
_, idx, counts = np.unique(final[:, 1], return_index=True, return_counts=True)
idx = idx[counts > 1]
final = np.delete(final, idx, axis=0)
This will work on the ndarray, for your second 1d array example use
_, idx, counts = np.unique(final, return_index=True, return_counts=True)
Maybe you could create for cycle.
to_remove = list()
for i in range(len(your_list)):
if your_list[i] in your_list[i:]:
to_remove.append(i)
removed_count = 0
for i in to_remove:
del your_list[i - removed_count]
removed_count += 1
You cannot del instantly in the first cycle because i is gonna iterate next number, which would lead to skipping number every time you delete one.
[i - removed_count] because every time you delete lower index then higher indexes gets instantly decreased by one.
I think it could be written in more effective way but this shoudl work, maybe with little changes.
After you generate df, add the following lines:
df=pd.DataFrame(collision)
# ... your code ends here
removed_already=[]
for idx in df[df.duplicated(keep=False)].index:
if df.loc[idx][0] not in removed_already:
removed_already.append(df.loc[idx][0])
df.drop(index=idx, inplace=True)
# your code continues
df_index=df.index.to_numpy()
df=df.values.flatten()
final=np.stack((df_index,df),axis=1)

Sort numpy string array using positional data

I have a numpy array of strings
names = array([
'p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00',
'p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01',
'p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02',
'p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03',
'p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04',
'p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05'])
And corresponding position data
X = array([2.102235, 2.094113, 2.086038, 2.077963, 2.069849, 2.061699])
Y = array([-7.788431, -7.780364, -7.772306, -7.764247, -7.756188, -7.748114])
How can I sort names using X and Y such that I get out a sorted grid of names with shape (6, 6)? Note that there are essentially 6 unique X and Y positions -- I'm not just arbitrarily choosing 6x6.
names = array([
['p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00'],
['p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01'],
['p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02'],
['p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03'],
['p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04'],
['p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05']])
I realize in this case that I could simply reshape the array, but in general the data will not work out this neatly.
You can use numpy.argsort to get the indexes of the elements of an array after it's sorted. These indices you can then use to sort your names array.
import numpy as np
names = np.array([
'p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00',
'p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01',
'p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02',
'p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03',
'p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04',
'p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05'])
X = np.array([2.102235, 2.094113, 2.086038, 2.077963, 2.069849, 2.061699])
Y = np.array([-7.788431, -7.780364, -7.772306, -7.764247, -7.756188, -7.748114])
x_order = np.argsort(X)
y_order = np.argsort(Y)
names_ordered = names.reshape(6,6)[np.meshgrid(x_order,y_order)]
print(names_ordered)
gives the following output:
[['p00x05' 'p00x04' 'p00x03' 'p00x02' 'p00x01' 'p00x00']
['p01x05' 'p01x04' 'p01x03' 'p01x02' 'p01x01' 'p01x00']
['p02x05' 'p02x04' 'p02x03' 'p02x02' 'p02x01' 'p02x00']
['p03x05' 'p03x04' 'p03x03' 'p03x02' 'p03x01' 'p03x00']
['p04x05' 'p04x04' 'p04x03' 'p04x02' 'p04x01' 'p04x00']
['p05x05' 'p05x04' 'p05x03' 'p05x02' 'p05x01' 'p05x00']]

Creating new pandas columns with original value plus random number in error range

I have a pandas dataframe which has a column 'INTENSITY' and a numpy array of same length containing the error for each intensity. I would like to generate columns with randomly generated numbers in the error range.
So far I use two nested for loops to create the new columns but I feel like this is inefficient:
theor_err = [ sqrt(abs(x)) for x in theor_df[str(INTENSITY)] ]
theor_err = np.asarray(theor_err)
for nr_sample in range(2):
sample = np.zeros(len(theor_df[str(INTENSITY)]))
for i, error in enumerate(theor_err):
sample[i] = theor_df[str(INTENSITY)][i] + random.uniform(-error, error)
theor_df['gen_{}'.format(nr_sample)] = Series(sample, index=theor_df.index)
theor_df.head()
Is there a more efficient way of approaching a problem like this?
Numpy can handle arrays for you. So, you can do it like this:
import pandas as pd
import numpy as np
a=pd.DataFrame([10,20,15,30],columns=['INTENSITY'])
a['theor_err']=np.sqrt(np.abs(a.INTENSITY))
a['sample']=np.random.uniform(-a['theor_err'],a['theor_err'])
Suppose you want to generate 6 samples. You can try to code below. You can tune the number of samples you want by setting the value k.
df = pd.DataFrame([[1],[2],[3],[4],[-5]], columns=["intensity"])
k = 6
sample_names = ["sample" + str(i+1) for i in range(k)]
df["err"] = np.sqrt(np.abs((df["intensity"])))
df[sample_names] = pd.DataFrame(
df["err"].map(lambda x: np.random.uniform(-x, x, k)).values.tolist())
df.loc[:,sample_names] = df.loc[:,sample_names].add(df.intensity, axis=0)

Create nested list from Pandas dataframe

i have a simple pandas dataframe with two columns. i would like to generate a nested list of those two columns.
geo = pd.DataFrame({'lat': [40.672304, 40.777169, 40.712196],
'lon': [-73.935385, -73.988911, -73.957649]})
my solution to this problem is the following:
X = [[i] for i in geo['lat'].tolist()]
Y = [i for i in geo['lon'].tolist()]
for key, value in enumerate(X):
X[key].append(Y[key])
however, i feel there must be a better way than this.
thanks!
pandas is built on top of numpy. A DataFrame stores its values in a numpy array, which has a tolist method.
>>> geo = pd.DataFrame({'lat': [40.672304, 40.777169, 40.712196],
...: 'lon': [-73.935385, -73.988911, -73.957649]})
...:
>>> geo.values
>>>
array([[ 40.672304, -73.935385],
[ 40.777169, -73.988911],
[ 40.712196, -73.957649]])
>>>
>>> geo.values.tolist()
[[40.672304, -73.935385], [40.777169, -73.988911], [40.712196, -73.957649]]
How about
out_list = []
for index, row in geo.iterrows():
out_list.append([row.lat, row.long])

Pandas Dataframe or Panel to 3d numpy array

Setup:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
output:
c d e
a b
0.439502 0.115087 0.832546 0.760513 0.776555
0.609107 0.247642 0.031650 0.727773
0.995370 0.299640 0.053523 0.565753 0.857235
0.392132 0.832560 0.774653 0.213692
Each data series is grouped by the index ID a and b represents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the a groupings? Currently it reads the data as two dimensional so pdf.shape outputs (4, 5). What I would like is for the array to be of the variable form:
array([[[-1.38655912, -0.90145951, -0.95106951, 0.76570984],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576]],
[[ 0.0768149 , -0.7566995 , -2.57770951, 0.70834656],
[-0.99097395, -0.81592084, -1.21075386, 0.12361382]]])
Is there a native Pandas way to do this? Note that number of rows per a grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?
I just had an extremely similar problem and solved it like this:
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
output:
array([[[ 0.47780308, 0.93422319, 0.00526572, 0.41645868, 0.82089215],
[ 0.47780308, 0.15372096, 0.20948369, 0.76354447, 0.27743855]],
[[ 0.75146799, 0.39133973, 0.25182206, 0.78088926, 0.30276705],
[ 0.75146799, 0.42182369, 0.01166461, 0.00936464, 0.53208731]]])
verifying it is 3d, a3d.shape gives (2, 2, 5).
Lastly, to make the newly created dimension the last dimension (instead of the first) then use:
a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
which has a shape of (2, 5, 2)
For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.
Example setup with ragged shape:
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
dataframe:
c d e
a b
0.460013 0.577535 0.299304 0.617103 0.378887
0.167907 0.244972 0.615077 0.311497
0.318823 0.640575 0.768187 0.652760 0.822311
0.424744 0.958405 0.659617 0.998765
0.077048 0.407182 0.758903 0.273737
One possible solution:
n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
.apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))
a3d.shape gives (2, 3, 5)
as_matrix is deprecated, and here we assume first key is a , then groups in a may have different length, this method solve all the problem .
import pandas as pd
import numpy as np
from typing import List
def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
"""Make an array cube from a Dataframe
Args:
df: Dataframe
idx_cols: columns defining the dimensions of the cube
Returns:
multi-dimensional array
"""
assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'
df = df.set_index(keys=idx_cols) # don't overwrite a parameter, thus copy!
idx_dims = [len(level) + 1 for level in df.index.levels]
idx_dims.append(len(df.columns))
cube = np.empty(idx_dims)
cube.fill(np.nan)
cube[tuple(np.array(df.index.to_list()).T)] = df.values
return cube
Test:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
# a, b must be integer
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give : (2, 2, 3)
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give s (2, 3, 3) .
panel.values
will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).
Instead of deprecated .as_matrix or alternativly .values() pandas documentation recommends to use .to_numpy()
'Warning: We recommend using DataFrame.to_numpy() instead.'

Categories

Resources