Print numpy array without ellipsis - python

I want to print a numpy array without truncation. I have seen other solutions but those don't seem to work.
Here is the code snippet:
total_list = np.array(total_list)
np.set_printoptions(threshold=np.inf)
print(total_list)
And this is what the output looks like:
22 A
23 G
24 C
25 T
26 A
27 A
28 A
29 G
..
232272 G
232273 T
232274 G
232275 C
232276 T
232277 C
232278 G
232279 T
This is the entire code. I might be making a mistake in type casting.
import csv
import pandas as pd
import numpy as np
seqs = pd.read_csv('BAP_GBS_BTXv2_imp801.hmp.csv')
plts = pd.read_csv('BAP16_PlotPlan.csv')
required_rows = np.array([7,11,14,19,22,31,35,47,50,55,58,63,66,72,74,79,82,87,90,93,99])
total_list = []
for i in range(len(required_rows)):
curr_row = required_rows[i];
print(curr_row)
for j in range(len(plts.RW)):
if(curr_row == plts.RW[j]):
curr_plt = plts.PI[j]
curr_range = plts.RA1[j]
curr_plt = curr_plt.replace("_", "").lower()
if curr_plt in seqs.columns:
new_item = [curr_row,curr_range,seqs[curr_plt]]
total_list.append(new_item)
print(seqs[curr_plt])
total_list = np.array(total_list)
'''
np.savetxt("foo.csv", total_list[:,2], delimiter=',',fmt='%s')
total_list[:,2].tofile('seqs.csv',sep=',',format='%s')
'''
np.set_printoptions(threshold='nan')
print(total_list)

use the following snippet to get no ellipsis.
import numpy
import sys
numpy.set_printoptions(threshold=sys.maxsize)
EDIT:
If you have a pandas.DataFrame use the following snippet to print your array:
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
Or you can use the pandas.DataFrame.to_string() method to get the desired result.
EDIT':
An earlier version of this post suggested the option below
numpy.set_printoptions(threshold='nan')
Technically, this might work, however, the numpy documentation specifies int and None as allowed types. Reference: https://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html.

You can get around the weird Numpy repr/print behavior by changing it to a list:
print list(total_list)
should print out your list of 2-element np arrays.

You are not printing numpy arrays.
Add the following line after the imports:
pd.set_option('display.max_rows', 100000)

#for a 2d array
def print_full(x):
dim = x.shape
pd.set_option('display.max_rows', dim[0])#dim[0] = len(x)
pd.set_option('display.max_columns', dim[1])
print(x)
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')

It appears that as of Python 3, the threshold can no longer be unlimited.
Therefore, the recommended option is:
import numpy
import sys
numpy.set_printoptions(threshold=sys.maxsize)

Related

Python - change numpy (int) array to HH:MM

I'm trying to get my code to take times in the 24hr format (such as 0930 (09:30 AM) or 20h45 (08:45 PM)) and output it as 09:30 and 20:45, respectively.
I tried using datetime and strftime, etc., but I can't use it on Numpy arrays.
I've also tried formatting as Numpy datetime, but I can't seem to get the HH:MM format.
This is my array (example)
[0845 0925 1046 2042 2153]
and I would like to output this:
[08:45 09:25 10:46 20:42 21:53]
Thank you all in advance.
Although I'm not entirely sure what are you trying to acomplish, I think this is the desired output.
For parsing dates you should first use "strptime" to get a datetime object and then "strftime" to get it back into desired string.
You are saying you got numpy arrays, but u got leading zeroes in this example you gave, so I guess it is an np array with defined string dytpe.
Custom functions can be vectorized to work on numpy arrays.
import numpy as np
from datetime import datetime
a=np.array(["0845", "0925", "1046", "2042", "2153"],dtype = str)
def fun(x):
x=datetime.strptime(x,"%H%M")
return datetime.strftime(x,"%H:%M")
vfunc = np.vectorize(fun)
result = vfunc(a)
print(result)
You can leverage pandas.to_datetime():
import numpy as np
import pandas as pd
x = np.array(["0845","0925","1046","2042","2153"])
y = pd.to_datetime(x, format="%H%M").to_numpy()
Outputs:
>>> x
['0845' '0925' '1046' '2042' '2153']
>>> y
['1900-01-01T08:45:00.000000000' '1900-01-01T09:25:00.000000000'
'1900-01-01T10:46:00.000000000' '1900-01-01T20:42:00.000000000'
'1900-01-01T21:53:00.000000000']
More info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
Assuming that the initial array includes only valid values (eg. values like '08453' or '2500' are not valid), this is a simple solution that does not require importing extra modules:
import numpy
arr = numpy.array(["0845","0925","1046","2042","2153"])
new_arr = []
for x in arr:
elem = x[:2] + ":" + x[2:]
new_arr.append(elem)
print(new_arr)

Improving performance to assign class to x/y coordinates in python

I have following data using python 2.7.17:
import numpy as np
from collections import defaultdict
N = 600000
x = np.random.randint(1,160,N)
y = np.random.randint(1,160,N)
classes = np.random.randint(1,10,N)
data = np.append([x],[y],axis=0)
data = np.append(data,[classes],axis=0)
x/y are coordinates of a map and classes is a corresponding class to these coordinates.
x/y can appear more often with different class values.
I'm getting these data from a previous function.
I need to get one class per x/y coordinates. At the moment I'm counting the times the class is appearing per x/y. I take the class with the highest count. If there are equal counts, the higher class is assigned. I have done it with dictionaries see here:
data_zip = zip(zip(data[0],data[1]),data[2])
dic = defaultdict(list)
for k,v in data_zip:
dic[k].append(v)
for k,v in dic.items():
un,co = np.unique(v,return_counts=True)
value = un[co.argmax()].max()
I need to speed up the process. Don't know how I can reach better performance.
I tried it with dict comprehension, using pandas instead of numpy.
What can I do to speed up the process?
Thanks all
If I understand the goal correctly, using Pandas core functions would probably be faster in this case.
The logic used is:
1.transform to get the count of each (x, y, class) combination;
2.sort by count, then by class to get the highest one in the top;
3.dropduplicates and leave only the top row of the combination.
Here goes:
import numpy as np
import pandas as pd
N = 600000
df = pd.DataFrame()
df['x'] = np.random.randint(1,160,N)
df['y'] = np.random.randint(1,160,N)
df['classes'] = np.random.randint(1,10,N)
df['class_count'] = df.groupby(['x','y'])['classes'].transform('size')
df.sort_values(['class_count','classes'], ascending=False, inplace=True)
df.drop_duplicates(['x', 'y'], inplace=True)
Here is a smaple of the output:
x y classes class_count
2259 89 80 9 46
14854 151 12 9 44
35451 152 42 9 44
That means that the (89,80) point most frequent class is 9 (and it occured 46 times)
if the final array can be of different shape (last previous solution)
i think the fast approach is to use numpy indexed
import numpy as np
import numpy_indexed as npi
N = 600000
data = np.random.randint(1,160,(3, N))
data[2] = np.random.randint(1,10,N)
uniq_data = npi.group_by(data, axis=1).unique
This is pretty fast and full vectorial

Replace cells in a dataframe with a range of values

I have a large dataframe that has certain cells which have values like: <25-27>. Is there a simple way to convert these into something like:25|26|27 ?
Source data frame:
import pandas as pd
import numpy as np
f = {'function':['2','<25-27>','200'],'CP':['<31-33>','210','4001']}
filter = pd.DataFrame(data=f)
filter
Output Required
output = {'function':['2','25|26|27','200'],'CP':['31|32|33','210','4001']}
op = pd.DataFrame(data=output)
op
thanks a lot !
import re
def convert_range(x):
m = re.match("<([0-9]+)+\-([0-9]+)>", x)
if m is None:
return x
s1, s2 = m.groups()
return "|".join([str(s) for s in range(int(s1), int(s2)+1)])
op = filter.applymap(convert_range)

Explain Function Mistake

I managed to write my first function. however I do not understand it :-)
I approached my real problem with a simplified on. See the following code:
import pandas as pd
import matplotlib as plt
import numpy as np
from pyXSteam.XSteam import XSteam
steamTable = XSteam(XSteam.UNIT_SYSTEM_MKS)
T1_T_in = [398,397,395]
T1_p_in = [29,29,29]
T1_mPkt_in = [2.2,3,3.5]
def Power(druck,temp,menge):
H = []
Q = []
for i in range(len(druck)):
H.append(steamTable.h_pt(druck[i],temp[i]))
Q.append(H[i]*menge[i])
return Q
t1Q=Power(T1_p_in,T1_T_in,T1_mPkt_in)
t3Q = Power(T3_p_in,T3_T_in,T3_mPkt_in)
print(t1Q)
print(t3Q)
It works. The real problem now is different in that way that I read the data from an excel file. I got an error message and (according my learnings from this good homepage :-)) I added ".tolist()" in the function and it works. I do not understand why I need to change it to a list? Can anybody explain it to me? Thank you for your help.
import pandas as pd
import matplotlib as plt
import numpy as np
from pyXSteam.XSteam import XSteam
steamTable = XSteam(XSteam.UNIT_SYSTEM_MKS)
pfad="XXX.xlsx"
df = pd.read_excel(pfad)
T1T_in = df.iloc[2:746,1]
T1p_in = df.iloc[2:746,2]
T1mPkt_in = df.iloc[2:746,3]
def Power(druck,temp,menge):
H = []
Q = []
for i in range(len(druck)):
H.append(steamTable.h_pt(druck.tolist()[i],temp.tolist()[i]))
Q.append(H[i]*menge.tolist()[i])
return Q
t1Q=Power(T1p_in,T1T_in,T1mPkt_in)
t1Q[0:10]
The reason your first example works is because you are passing the T1_mPkt_in variable into the menge parameter as a list:
T1_mPkt_in = [2.2,3,3.5]
Your second example is not working because you pass the T1_mPkt_in variable into the menge parameter as a series and not a list:
T1mPkt_in = df.iloc[2:746,3]
If you print out the type of T1_mPkt_in, you will get:
<class 'pandas.core.series.Series'>
In pandas, to convert a series back into a list, you can call .tolist() to store the data in a list so that you can properly index it.

Get the same hash value for a Pandas DataFrame each time

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file.
Whole point is to get the same hash each time I call hash() on it.
My idea was that I create the function
def _get_array_hash(arr):
arr_hashable = arr.values
arr_hashable.flags.writeable = False
hash_ = hash(arr_hashable.data)
return hash_
that is calling underlying numpy array, set it to immutable state and get hash of the buffer.
INLINE UPD.
As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use
hash(df.values.tobytes())
See comments for the Most efficient property to hash for numpy array.
END OF INLINE UPD.
It works for regular pandas array:
In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})
In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165
In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165
But then I try to apply it to DataFrame obtained from a .csv file:
In [15]: fpath = 'foo/bar.csv'
In [16]: data_from_file = pd.read_csv(fpath)
In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085
In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730
Can somebody explain me, how's that possible?
I can create new DataFrame out of it, like
new_data = pd.DataFrame(data=data_from_file.values,
columns=data_from_file.columns,
index=data_from_file.index)
and it works again
In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241
In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241
But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.
As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)
import pandas as pd
import numpy as np
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)
print(df)
# 0 1 2 3
# 0 42 foo 42 42
# 1 foo foo 42 bar
# 2 42 42 42 42
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
print(h)
# 0 5559921529589760079
# 1 16825627446701693880
# 2 7171023939017372657
# dtype: uint64
You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.
Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).
import joblib
joblib.hash(df)
I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.
import pandas as pd
import hashlib
DATA_FILE = 'data.json'
data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)
assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()
This function seems to work fine:
from hashlib import sha256
def hash_df(df):
s = str(df.columns) + str(df.index) + str(df.values)
return sha256(s.encode()).hexdigest()

Categories

Resources