Hello all I have a list of delimiter separated strings:
lists=['1|Abra|23|43|0','2|Cadabra|15|18|0','3|Grabra|4|421|0','4|Lol|1|15|0']
I need to convert it to numpy array than sort it just like excel do first by Column 3, then by Column 2, and finaly by the last column
Ive tried this:
def man():
a = np.array(lists[0].split('|'))
for line in lists:
temp = np.array(line.split('|'),)
a=np.concatenate((a, temp))
a.sort(order=[0, 1])
man()
Of course no luck because it is wrong! Unfortunately im not strong in numpy arrays. Can somebody help me pls? :(
This works just perfect for me but here numpy builds array from file so to make it work i've write my list of strings to file than read it and convert to array
import numpy as np
# let numpy guess the type with dtype=None
my_data = np.genfromtxt('Selector/tmp.txt',delimiter='|', dtype=None, names ["Num", "Date", "Desc", "Rgh" ,"Prc", "Color", "Smb", "MType"])
my_data.sort(order=["Color","Prc", "Rgh"])
# save specifying required format (tab separated values)
print(my_data)
How to remain everything as is but change the conversion function to make it build the same array not from file but from list
There may be better solutions, but for a start I would sort the array once by each column in reverse order.
I assume you want to sort by column 3 and ties are resolved by column 2. Finally, remaining ties are resolved by the last column. Thus, you'd actually sort by the last column first, then by 2, then by 3.
Furthermore, you can easily convert the list to an array using a list comprehension.
import numpy as np
lists=['1|Abra|23|43|0','2|Cadabra|15|18|0','3|Grabra|4|421|0','4|Lol|1|15|0']
# convert to numpy array by splitting each row
a = np.array([l.split('|') for l in lists])
# specify columns to sort by, in order
sort_cols = [3, 2, -1]
# sort by columns in reverse order.
# This only works correctly if the sorting algorithm is stable.
for sc in sort_cols[::-1]:
order = np.argsort(a[:, sc])
a = a[order]
print(a)
You can use a list comprehension in order to split your strings and convert the integers to int. Then use a proper dtype to create your numpy array then use np.sort() function by passing the expected order:
>>> dtype = [('1st', int), ('2nd', '|S7'), ('3rd', int), ('4th', int), ('5th', int)]
>>>
>>> a = np.array([tuple([int(i) if i.isdigit() else i for i in sub.split('|')]) for sub in delimit_strs], dtype=dtype)
>>> np.sort(a, axis=0, order=['3rd','2nd', '5th'])
array([(4, 'Lol', 1, 15, 0), (3, 'Grabra', 4, 421, 0),
(2, 'Cadabra', 15, 18, 0), (1, 'Abra', 23, 43, 0)],
dtype=[('1st', '<i8'), ('2nd', 'S7'), ('3rd', '<i8'), ('4th', '<i8'), ('5th', '<i8')])
You can also do this in python which for shorter data sets in more optimized. You can simple use sorted() function by passing a proper key function.
from operator import itemgetter
sorted([[int(i) if i.isdigit() else i for i in sub.split('|')]) for sub in delimit_strs], key=itemgetter(3, 2, 4))
Related
This is what I wrote:
import numpy as np
a = np.array([[]])
np.insert(a, 0, 1, axis=1)
My code just ignores the insert line for some reason. I even tried
np.put_along_axis() but it's showing an error
I just want to insert or append or put a number into an ndarray.
This forces me to turn it into a normal list, append and turn it back.
Please help
Referring to the documentation:
Returns: out: ndarray
A copy of arr with values inserted. Note that insert does not occur in-place: a new array is returned. If axis is None, out is a flattened array.
So I think all that's missing here is to assign the modified array back to a:
a = np.insert(a, 0, 1, axis=1)
Is there a Pythonic way to filter a pd.DataFrame based on the type of its index elements? When reading an Excel file of time-series data, I often wish to discard rows whose indices are not datetime objects. My current solution is as follows.
import datetime
import pandas as pd
df = pd.DataFrame(index=[1, datetime.datetime(2020, 1, 1), '2019'], data=[1, 2, 3])
df[df.index.map(lambda i: isinstance(i, datetime.datetime))]
You could use a list comprehension instead of the map-lambda construction:
df[[isinstance(df.index[i], datetime.datetime) for i in range(len(df))]]
But I'm not sure that's more Pythonic.
I have a pandas dataframe with a mix of datatypes (dtypes) that I wish to convert to a numpy structured array (or record array, basically the same thing in this case). For purely numeric dataframes, this is easy to do with the to_records() method. I also need the dtypes of pandas columns to be converted to strings rather than objects so that I can use the numpy method tofile() which will output numbers and strings to a binary file, but will not output objects.
In a nutshell, I need to convert pandas columns with dtype=object to numpy structured arrays of string or unicode dtype.
Here's an example, with code that would be sufficient if all columns had a numerical (float or int) dtype.
import pandas as pd
df=pd.DataFrame({'f_num': [1.,2.,3.], 'i_num':[1,2,3],
'char': ['a','bb','ccc'], 'mixed':['a','bb',1]})
struct_arr=df.to_records(index=False)
print('struct_arr',struct_arr.dtype,'\n')
# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'),
# ('char', 'O'), ('mixed', 'O')])
But because I want to end up with string dtypes, I need to add this additional and somewhat involved code:
lst=[]
for col in struct_arr.dtype.names: # this was the only iterator I
# could find for the column labels
dt=struct_arr[col].dtype
if dt == 'O': # this is 'O', meaning 'object'
# it appears an explicit string length is required
# so I calculate with pandas len & max methods
dt = 'U' + str( df[col].astype(str).str.len().max() )
lst.append((col,dt))
struct_arr = struct_arr.astype(lst)
print('struct_arr',struct_arr.dtype)
# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'),
# ('char', '<U3'), ('mixed', '<U2')])
See also: How to change the dtype of certain columns of a numpy recarray?
This seems to work, as the character and mixed dtypes are now <U3 and <U2 rather than 'O' or 'object'. I'm just checking if there is a simpler or more elegant approach. But since pandas does not have a native string type as numpy does, maybe there is not?
Combining suggestions from #jpp (list comp for conciseness) & #hpaulj (cannibalize to_records for speed), I came up with the following, which is cleaner code and also about 5x faster than my original code (tested by expanding the sample dataframe above to 10,000 rows):
names = df.columns
arrays = [ df[col].get_values() for col in names ]
formats = [ array.dtype if array.dtype != 'O'
else f'{array.astype(str).dtype}' for array in arrays ]
rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )
The above will output unicode rather than strings which is probably better in general but in my case I need to convert to strings because I'm reading the binary file in fortran and strings seem to read in more easily. Hence, it may be better to replace the "formats" line above with this:
formats = [ array.dtype if array.dtype != 'O'
else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]
E.g. a dtype of <U4 becomes S4.
As far as I am aware, there is no native functionality for this. For example, the maximum length of all values within a series is not stored anywhere.
However, you can implement your logic more efficiently via a list comprehension and f-strings:
data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \
f'U{df[col].astype(str).str.len().max()}') for col in arr.dtype.names]
This question already has answers here:
The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas
(4 answers)
Closed 11 months ago.
I'm confused about the results for indexing columns in pandas.
Both
db['varname']
and
db[['varname']]
give me the column value of 'varname'. However it looks like there is some subtle difference, since the output from db['varname'] shows me the type of the value.
The first looks for a specific Key in your df, a specific column, the second is a list of columns to sub-select from your df so it returns all columns matching the values in the list.
The other subtle thing is that the first by default will return a Series object whilst the second returns a DataFrame even if you pass a list containing a single item
Example:
In [2]:
df = pd.DataFrame(columns=['VarName','Another','me too'])
df
Out[2]:
Empty DataFrame
Columns: [VarName, Another, me too]
Index: []
In [3]:
print(type(df['VarName']))
print(type(df[['VarName']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
so when you pass a list then it tries to match all elements:
In [4]:
df[['VarName','Another']]
Out[4]:
Empty DataFrame
Columns: [VarName, Another]
Index: []
but without the additional [] then this will raise a KeyError:
df['VarName','Another']
KeyError: ('VarName', 'Another')
Because you're then trying to find a column named: 'VarName','Another' which doesn't exist
This is close to a dupe of another, and I got this answer from it at: https://stackoverflow.com/a/45201532/1331446, credit to #SethMMorton.
Answering here as this is the top hit on Google and it took me ages to "get" this.
Pandas has no [[ operator at all.
When you see df[['col_name']] you're really seeing:
col_names = ['col_name']
df[col_names]
In consequence, the only thing that [[ does for you is that it makes the
result a DataFrame, rather than a Series.
[ on a DataFrame looks at the type of the parameter; it ifs a scalar, then you're only after one column, and it hands it back as a Series; if it's a list, then you must be after a set of columns, so it hands back a DataFrame (with only these columns).
That's it!
As #EdChum pointed out, [] will return pandas.core.series.Series whereas [[]] will return pandas.core.frame.DataFrame.
Both are different data structures in pandas.
For sklearn, it is better to use db[['varname']], which has a 2D shape.
for example:
from sklearn.preprocessing import KBinsDiscretizer kbinsDiscretizer
est = KBinsDiscretizer(n_bins=3, encode='onehot-dense', strategy='uniform')
est.fit(db[['varname']]) # where use dfb['varname'] causes error
In [84]: single_brackets = np.array( [ 0, 13, 31, 1313 ] )
In [85]: single_brackets.shape, single_brackets.ndim
Out[85]: ((4,), 1)
# (4, ) : is 4-Elements/Values
# 1 : is One_Dimensional array (Generally...In Pandas we call 1D-Array as "SERIES")
In [86]: double_brackets = np.array( [[ 0, 13, 31, 1313 ]] )
In [87]: double_brackets.shape, double_brackets.ndim
Out[87]: ((1, 4), 2)
#(1, 4) : is 1-row and 4-columns
# 2 : is Two_Dimensional array (Generally...In Pandas we call 2D-Array as "DataFrame")
This is the concept of NumPy ...don't blame Pandas
[ ] -> One_Dimensional array which yields SERIES
[[ ]] -> Two_Dimensional array which yields DataFrame
Still don't believe:
check this:
In [89]: three_brackets = np.array( [[[ 0, 13, 31, 1313 ]]] )
In [93]: three_brackets.shape, three_brackets.ndim
Out[93]: ((1, 1, 4), 3)
# (1, 1, 4) -> In general....(rows, rows, columns)
# 3 -> Three_Dimensional array
Work on creating some NumPy Arrays and 'reshape' and check 'ndim'
I know how to use numpy.savetxt to write an array to a file. How can I write multiple arrays to the same file?
Essentially I want to do math to a column of numbers, and then replace the old column with the modified numbers. I read the easiest way to do this is to write a new file completely, put the modified numbers in, and just 'copy and paste' the other numbers in the file.
Any help is appreciated.
Thanks!
Answering a very old post for my own use. I've used the following to write out two 1D arrays of same size as CSV.
import numpy as np
x = [1, 2, 3]
y = [4, 5, 6]
zipped = zip(x, y)
# >>> [(1, 4), (2, 5), (3, 6)]
# Save the array back to the file
np.savetxt('z.csv', zipped, fmt='%i,%i')
If you're wanting to write multiple arrays to a file for later use, Look into numpy.savez.
However, from your description, it sounds like you're wanting to do something with a particular column of a delimited text file.
In that case, just load the entire thing in and operate on just the column you need to.
E.g.
import numpy as np
data = np.loadtxt('test.txt')
# Multiply the 4th column by 5
data[:,3] *= 5
# Do something more complicated to the 2nd column
data[:,1] = np.cos(data[:,1])
# Save the array back to the file
np.savetxt('test.txt', data)
import numpy
list1 = [1, 2, 3, 4]
list2 = [0.45, 0.98, 0.89, 0.21]
dat = numpy.array([list1, list2])
dat = dat.T
numpy.savetxt('data.txt', dat, delimiter = ',')