Remove specific elements from a numpy array - python

I have an np.array I would like to remove specific elements from based on the element "name" and not the index. Is this sort of thing possible with np.delete() ?
Namely my original ndarray is
textcodes= data['CODES'].unique()
which captures unique text codes given the quarter.
Specifically I want to remove certain codes which I need to run through a separate process and put them into a separate ndarray
sep_list = np.array(['SPCFC_CODE_1','SPCFC_CODE_2','SPCFC_CODE_3','SPCFC_CODE_4])
I have trouble finding a solution on removing these specific codes in sep_list from textcodes because I don't know exactly where these sep_list codes would be indexed as it would be different each quarter and I would like to automate it based on the specific names instead because those will always be the same.
Any help is greatly appreciated. Thank you.

You should be able to do something like:
import numpy as np
data = [3,2,1,0,10,5]
bad_list = [1, 2]
data = np.asarray(data)
new_list = np.asarray([x for x in data if x not in bad_list])
print("BAD")
print(data)
print("GOOD")
print(new_list)
Yields:
BAD
[ 3 2 1 0 10 5]
GOOD
[ 3 0 10 5]
It is impossible to tell for sure since you did not provide sample data, but the following implementation using your variables should work:
import numpy as np
textcodes= data['CODES'].unique()
sep_list = np.array(['SPCFC_CODE_1','SPCFC_CODE_2','SPCFC_CODE_3','SPCFC_CODE_4'])
final_list = [x for x in textcodes if x not in sep_list]

Related

Print Pandas without dtype

I've read a few other posts about this but the other solutions haven't worked for me. I'm trying to look at 2 different CSV files and compare data from 1 column from each file. Here's what I have so far:
import pandas as pd
import numpy as np
dataBI = pd.read_csv("U:/eu_inventory/EO BI Orders.csv")
dataOrderTrimmed = dataBI.iloc[:,1:2].values
dataVA05 = pd.read_csv("U:\eu_inventory\VA05_Export.csv")
dataVAOrder = dataVA05.iloc[:,1:2].values
dataVAList = []
ordersInBoth = []
ordersInBI = []
ordersInVA = []
for order in np.nditer(dataOrderTrimmed):
if order in dataVAOrder:
ordersInBoth.append(order)
else:
ordersInBI.append(order)
So if the order number from dataOrderTrimmed is also in dataVAOrder I want to add it to ordersInBoth, otherwise I want to add it to ordersInBI. I think it splits the information correctly but if I try to print ordersInBoth each item prints as array(5555555, dtype=int64) I want to have a list of the order numbers not as an array and not including the dtype information. Let me know if you need more information or if the way I've typed it out is confusing. Thanks!
The way you're using .iloc is giving you a DataFrame, which becomes 2D array when you access values. If you just want the values in the column at index 1, then you should just say:
dataOrderTrimmed = dataBI.iloc[:, 1].values
Then you can iterate over dataOrderTrimmed directly (i.e. you don't need nditer), and you will get regular scalar values.

DataFrame is empty, expected data in it

I want to find duplicate items within 2 rows in Excel. So for example my Excel consists of:
list_A list_B
0 ideal ideal
1 brown colour
2 blue blew
3 red red
I checked the pandas documentation and tried duplicate method but I simply don't know why it keeps saying "DataFrame is empty". It finds both columns and I guess it's iterated over it but why doesn't it find the values and compare them?
I also tried using iterrows but honestly don't know how to implement it.
When running the code I get this output:
Empty DataFrame
Columns: [list A, list B]
Index: []
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
dfObj = pd.DataFrame(pt)
doubles = dfObj[dfObj.duplicated()]
print(doubles)
The output I'm looking for is:
list_A list_B
0 ideal ideal
3 red red
Final solved code looks like this:
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
doubles = pt[pt['list_A'] == pt['list_B']]
print(doubles)
The term "duplicate" is usually used to mean rows that are exact duplicates of previous rows (see the documentation of pd.DataFrame.duplicate).
What you are looking for is just the rows where these two columns are equal. For that, you want:
doubles = pt[pt['list_A'] == pt['list_B']]

Pulling elements in order based on first element using key array

I'm looking for a vectorized approach for the following problem:
Suppose I have two arrays, one with a bunch of non-contiguous ids in the first column and some data in the remaining columns, and a second array suggesting which datalines I need to pull:
data_array = np.array([[101,4],[102,7],[201,2],[203,9],[403,12]])
key_array = np.array([101,403,201])
The output must stay in the order given by the key_array, leading to the following:
output_array = np.array([[101,4],[403,12],[201,2]])
I can easily do this through a list comprehension:
output_array = np.array([data_array[i==data_array[:,0]][0] for i in key_array])
but this is not a vectorized solution. Using the numpy isin() is very close to working, but does not preserve the given order:
data_array[np.isin(data_array[:,0],key_array)]
#[[101 4]
# [201 2] not the order given by the key_array!
# [403 12]]
I tried making the above work by some use of argsort(), haven't been able to get anything working. Any help would be greatly appreciated.
We can use np.searchsorted -
s = data_array[:,0].argsort()
out = data_array[s[np.searchsorted(data_array[:,0],key_array,sorter=s)]]
If the first column of data_array is already sorted, simplifies to one-liner -
out = data_array[np.searchsorted(data_array[:,0],key_array)]

How to print column in python array?

I have an array of 3 numbers per row, 4 columns deep.
I am struggling to figure out how I can write the code to print all numbers from a specified column rather than from a row.
I have searched for tutorials that explain this easily and just cannot find any that have helped.
Can anyone point me in the right direction?
If you're thinking of python lists as rows and columns, probably better to use numpy arrays (if you're not already). Then you can print the various rows and columns easily, E.g.
import numpy as np
a = np.array([[1,2,6],[4,5,8],[8,3,5],[6,5,4]])
#Print first column
print(a[:,0])
#Print second row
print(a[1,:])
Note that otherwise you have a list of lists and you'd need to use something like,
b = [[1,2,6],[4,5,8],[8,3,5],[6,5,4]]
print([i[0] for i in b])
You can do this:
>>> a = [[1,2,3],[1,1,1],[2,1,1],[4,1,2]]
>>> print [row[0] for row in a]
[1, 1, 2, 4]

how to read a file, sort based on specified column

I am trying to convert all my codes, written in MATLAB, to python. I have a problem and I couldn't find a way to solve it. Maybe someone has an idea.
I have a file which has m rows and two columns. I want to read file, and then sort file based on the second column. Later, I must use the sorted first column (from first row to 1000th row) and find values larger than threshold (here for example 0.2) and sum them.
Hope someone has an idea.
Thanks
If the file is for example with fields separated by tabs and rows separated by columns the problem is quite simple:
f = open("filename.csv")
data = [map(float, x.split("\t")) for x in f.readlines()]
data.sort(key = lambda x:x[1])
result = sum(x[0] for x in data[:1000] if x[0] > 0.2)
Consider using Numpy arrays and its accompanying functions. They are (usually) quite similar to those in MATLAB, which might make your conversion from the latter easier.
import numpy as np
data = np.genfromtext("filename.csv", delimiter="\t", dtype=np.float)
idx = np.argsort(data[:, 1])
data1000 = data[idx[:1000]] # First 1000 of sorted data
result = np.sum(data1000[data1000[:, 0] > 0.2, 0])

Categories

Resources