Python equivalent of R c() function, for dataframe column indices? - python

I would like to select from a pandas dataframe specific columns using column index.
In particular, I would like to select columns index by the column index generated by c(12:26,69:85,96:99,134:928,933:935,940:967) in R. I wonder how can I do that in Python?
I am thinking something like the following, but of course, python does not have a function called c()...
input2 = input2.iloc[:,c(12:26,69:85,96:99,134:928,933:935,940:967)]

The equivalent is numpy's r_. It combines integer slices without needing to call ranges for each of them:
np.r_[2:4, 7:11, 21:25]
Out: array([ 2, 3, 7, 8, 9, 10, 21, 22, 23, 24])
df = pd.DataFrame(np.random.randn(1000))
df.iloc[np.r_[2:4, 7:11, 21:25]]
Out:
0
2 2.720383
3 0.656391
7 -0.581855
8 0.047612
9 1.416250
10 0.206395
21 -1.519904
22 0.681153
23 -1.208401
24 -0.358545

Putting #hrbrmstr 's comment into an answer, because it solved my issue and I want to make it clear that this question is resolved. In addition, please note that range(a,b) gives the numbers (a, a+1, ..., b-2, b-1), and doesn't include b.
R's combine function
c(4,12:26,69:85,96:99,134:928,933:935)
is translated into Python as
[4] + list(range(12,27)) + list(range(69,86)) + list(range(96,100)) + list(range(134,929)) + list(range(933,936))

To answer the actual question,
Python equivalent of R c() function, for dataframe column indices?
I'm using this definition of c()
c = lambda v: v.split(',') if ":" not in v else eval(f'np.r_[{v}]')
Then we can do things like:
df = pd.DataFrame({'x': np.random.randn(1000),
'y': np.random.randn(1000)})
# row selection
df.iloc[c('2:4,7:11,21:25')]
# columns by name
df[c('x,y')]
# columns by range
df.T[c('12:15,17:25,500:750')]
That's pretty much as close as it gets in terms of R-like syntax.
To the curious mind
Note there is a performance penality in using c() as per above v.s. np.r_. To paraphrase Knuth, let's not optimize prematurely ;-)
%timeit np.r_[2:4, 7:11, 21:25]
27.3 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit c("2:4, 7:11, 21:25")
53.7 µs ± 977 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Related

How to iterate with different values in for loop?

I want to know whether we can specify the next value to increment in "for loop" by hard coding?
Currently, I am iterating this way:
Eg:
for i in range(0, 10, 2):
print(i)
output will be 0,2,4,6,8
If I want a value of 5 and 7 along with increments of 2, how can I do that?
Eg:
for i in range(0, 10, 2, 4, 5, 6, 7, 8):
If I understand your question correctly, this code is for you:
Using generators comprehension:
pred = i%2 is 0
forced_values = [5, 7]
list((i for i in range(0, 10) if pred or i in forced_values))
# output: [0, 2, 4, 5, 6, 7, 8]
or equivalently:
sorted(list(range(0, 10, 2)) + forced_values)
Comparison of execution times:
Benchmark:
n = 10000000 # size of range values
m = 10000000 # size of forced_value list
1. Solution with generators comprehension:
%%timeit
list((i for i in range(0, n) if i%2 is 0 or i in range(0, m)))
# 3.47 s ± 265 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2. Solution with sorting:
%%timeit
sorted(list(range(0, n, 2)) + list(range(0, m)))
# 1.59 s ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
or with unordered list, if the order is not important:
%%timeit
list(range(0, n, 2)) + list(range(0, m))
# 1.03 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3. Solution proposed by #blhsing with more_itertools package, and specifically collate function:
%%timeit
l = []
for i in collate(range(0, n, 2), range(0, m)):
l.append(i)
# 6.89 s ± 886 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The best solution, even on very large lists, seems to be the second one that is 2 to 4 times faster than the other proposed solutions.
To clarify, here are some relevant comments:
if I need to parse 1 through 1000 with exceptions of increments in between, is there a way other than specifying indexes?
Ex: for i in range(1, 1000, 20); I need i value of 178, 235, 650 in between. Can I do that in for loop?
The technical answer is: yes and no. Yes, because of course you can do it in a for loop. No, because there is no way around specifying the exceptions. (Otherwise they wouldn't be exceptions, would they?)
You still use a for loop, because Python's for loop is not really about indices or ranges. It's about iterating over arbitrary objects. It so happens that the simple numeric for loop that many other languages have is most directly translated into Python as a loop over a range. But really, a Python for loop is simply of the form
for x in y:
# do stuff here
And the loop iterates over y, no matter what y is, as long as it's iterable, with x taking the value of one element of y on each iteration. That's it.
But, what you seem to be really after is a way to loop over a bunch of numbers that mostly follow a simple pattern. I would probably do it like this:
values = list(range(1, 1000, 20)) + [178, 235, 650]
for i in sorted(values):
print(i)
Or, if you don't mind a longer line:
for i in sorted(list(range(1, 1000, 20)) + [178, 235, 650]):
print(i)
You can give the for loop a list:
for i in [2, 4, 5, 6, 7, 8]:
You could also create a custom iterator if it's following an algorithm. That's described in another answer here: Build a Basic Python Iterator.

Vectorized conversion of decimal integer array to binary array in numpy

I'm trying to convert an array of integers into their binary representations in python. I know that native python has a function called bin that does this. Numpy also has a similar function: numpy.binary_repr.
The problem is that none of these are vectorized approaches, as in, they only take one single value at a time. So, in order for me to convert a whole array of inputs, I have to use a for-loop and call these functions multiple times, which isn't very efficient.
Is there any way to perform this conversion without for-loops? Are there any vectorized forms of these functions? I've tried numpy.apply_along_axis but no luck. I've also tried using np.fromiter and map and it was also a no go.
I know similar questions have been asked a few other times (like here), but none of the answers given are actually vectorized.
Pointing me into any direction would be greatly appreciated!
Thanks =)
The easiest way is to use binary_repr with vectorize, it will preserve the original array shape, e.g.:
binary_repr_v = np.vectorize(np.binary_repr)
x = np.arange(-9, 21).reshape(3, 2, 5)
print(x)
print()
print(binary_repr_v(x, 8))
The output:
[[[-9 -8 -7 -6 -5]
[-4 -3 -2 -1 0]]
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
[[11 12 13 14 15]
[16 17 18 19 20]]]
[[['11110111' '11111000' '11111001' '11111010' '11111011']
['11111100' '11111101' '11111110' '11111111' '00000000']]
[['00000001' '00000010' '00000011' '00000100' '00000101']
['00000110' '00000111' '00001000' '00001001' '00001010']]
[['00001011' '00001100' '00001101' '00001110' '00001111']
['00010000' '00010001' '00010010' '00010011' '00010100']]]
The quickest way I've found (so far) is to use the pd.Series.apply() function.
Here are the testing results:
import pandas as pd
import numpy as np
x = np.random.randint(1,10000000,1000000)
# Fastest method
%timeit pd.Series(x).apply(bin)
# 135 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# rafaelc's method
%timeit [np.binary_repr(z) for z in x]
# 725 ms ± 5.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# aparpara's method
binary_repr_v = np.vectorize(np.binary_repr)
%timeit binary_repr_v(x, 8)
# 7.46 s ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

reduce one dimension within an ndarray

I have points coordinates stored in a 3-dimensional array:
(UPD. the array is actually numpy-derived ndarray, sorry for the confusion in the initial version)
a = [ [[11,12]], [[21,22]], [[31,32]], [[41,42]] ]
you see that each coordinate pair is stored as nested 2-d array like [[11,12]], while I would like it to be [11,12], i.e. my array should have this content:
b = [ [11,12], [21,22], [31,32], [41,42] ]
So, how to get from a to b form? For now my solution is to create a list and then convert it to an array with numpy:
b = numpy.array([p[0] for p in a])
This works but I assume there must be a simpler and cleaner way...
UPD. originally I tried to do a simple comprehension: b = [p[0] for p in a] - but then b turned out to be a list, not an array - I assume that's because the original a array is ndarray from numpy
If you do want to use numpy:
b = np.array(a)[:, 0, :]
This will be faster than a comprehension.
Well... I certainly thought it would be
a = np.random.random((100_000, 1, 2)).tolist()
%timeit np.array([x[0] for x in a])
41.1 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.array(a)[:, 0, :]
57.6 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit x = np.array(a); x.shape = len(a), 2
58.2 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
edit
Oh if its a numpy array then definitely use this method. Or use .squeeze() if you're sure it's not empty.
Here is another solution using list comprehension:
b = [x[0] for x in a]
If you are going to use numpy later, then it's best to avoid the list comprehension. Also it's always good practice to automate things as much as possible, so instead of manually selecting the singleton dimension just let numpy take care of:
b=numpy.array(a).squeeze()
Unless there are other singleton dimensions that you need to keep.
In order to flatten a "nested 2-d array like" as you call them, you just need to get the first element. arr[0]
Apply this concept in several ways:
list comprehension (most performing) : flatter_a_compr = [e[0] for e in a]
iterating (second best performing):
b =[]
for e in a:
b.append(e[0])
lambda (un-Pythonic): flatter_a = list(map(lambda e : e[0], a))
numpy (worst performing) : flatter_a_numpy = np.array(a)[:, 0, :]

Get part of array plus first element in numpy (In a pythonic way)

I have a numpy array and i need to get (without changing the original) the same array, but with the first item places at the end. Since i am using this a lot i am looking for clean way of getting this.
So for example, if my original array is [1,2,3,4] , i would like to get an array [4,1,2,3] without modifying the original array.
I found one solution:
x = [1,2,3,4]
a = np.append(x[1:],x[0])]
However, i am looking for a more pythonic way. Basically something like this:
x = [1,2,3,4]
a = x[(:1,0)]
However, this of course doesn't work. Is there a better way of doing what i want than using the append() function?
np.roll is easy to use, but not the fastest method. It is general purpose, with multiple dimensions and shifts.
Its action can be simplified to:
def simple_roll(x):
res = np.empty_like(x)
res[0] = x[-1]
res[1:] = x[:-1]
return res
In [90]: np.roll(np.arange(1,5),1)
Out[90]: array([4, 1, 2, 3])
In [91]: simple_roll(np.arange(1,5))
Out[91]: array([4, 1, 2, 3])
time tests:
In [92]: timeit np.roll(np.arange(1001),1)
36.8 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [93]: timeit simple_roll(np.arange(1001))
5.54 µs ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We could also use r_ to construct one index array to do the copy. But it is slower (due to advanced indexing as opposed to slicing):
def simple_roll1(x):
idx = np.r_[-1,0:x.shape[0]-1]
return x[idx]
In [101]: timeit simple_roll1(np.arange(1001))
34.2 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can use np.roll, as from the docs:
Roll array elements along a given axis.
Elements that roll beyond the last position are re-introduced at the
first.
np.roll([1,2,3,4], 1)
# array([4, 1, 2, 3])
To roll in the other direction, use a negative shift:
np.roll([1,2,3,4], -1)
# array([2, 3, 4, 1])

get column names from csv file using pandas [duplicate]

I want to get a list of the column headers from a Pandas DataFrame. The DataFrame will come from user input, so I won't know how many columns there will be or what they will be called.
For example, if I'm given a DataFrame like this:
>>> my_dataframe
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
I would get a list like this:
>>> header_list
['y', 'gdp', 'cap']
You can get the values as a list by doing:
list(my_dataframe.columns.values)
Also you can simply use (as shown in Ed Chum's answer):
list(my_dataframe)
There is a built-in method which is the most performant:
my_dataframe.columns.values.tolist()
.columns returns an Index, .columns.values returns an array and this has a helper function .tolist to return a list.
If performance is not as important to you, Index objects define a .tolist() method that you can call directly:
my_dataframe.columns.tolist()
The difference in performance is obvious:
%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For those who hate typing, you can just call list on df, as so:
list(df)
I did some quick tests, and perhaps unsurprisingly the built-in version using dataframe.columns.values.tolist() is the fastest:
In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop
In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop
In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop
In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop
(I still really like the list(dataframe) though, so thanks EdChum!)
It gets even simpler (by Pandas 0.16.0):
df.columns.tolist()
will give you the column names in a nice list.
Extended Iterable Unpacking (Python 3.5+): [*df] and Friends
Unpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible.
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
If you want a list....
[*df]
# ['A', 'B', 'C']
Or, if you want a set,
{*df}
# {'A', 'B', 'C'}
Or, if you want a tuple,
*df, # Please note the trailing comma
# ('A', 'B', 'C')
Or, if you want to store the result somewhere,
*cols, = df # A wild comma appears, again
cols
# ['A', 'B', 'C']
... if you're the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently ;)
P.S.: if performance is important, you will want to ditch the
solutions above in favour of
df.columns.to_numpy().tolist()
# ['A', 'B', 'C']
This is similar to Ed Chum's answer, but updated for
v0.24 where .to_numpy() is preferred to the use of .values. See
this answer (by me) for more information.
Visual Check
Since I've seen this discussed in other answers, you can use iterable unpacking (no need for explicit loops).
print(*df)
A B C
print(*df, sep='\n')
A
B
C
Critique of Other Methods
Don't use an explicit for loop for an operation that can be done in a single line (list comprehensions are okay).
Next, using sorted(df) does not preserve the original order of the columns. For that, you should use list(df) instead.
Next, list(df.columns) and list(df.columns.values) are poor suggestions (as of the current version, v0.24). Both Index (returned from df.columns) and NumPy arrays (returned by df.columns.values) define .tolist() method which is faster and more idiomatic.
Lastly, listification i.e., list(df) should only be used as a concise alternative to the aforementioned methods for Python 3.4 or earlier where extended unpacking is not available.
>>> list(my_dataframe)
['y', 'gdp', 'cap']
To list the columns of a dataframe while in debugger mode, use a list comprehension:
>>> [c for c in my_dataframe]
['y', 'gdp', 'cap']
By the way, you can get a sorted list simply by using sorted:
>>> sorted(my_dataframe)
['cap', 'gdp', 'y']
That's available as my_dataframe.columns.
It's interesting, but df.columns.values.tolist() is almost three times faster than df.columns.tolist(), but I thought that they were the same:
In [97]: %timeit df.columns.values.tolist()
100000 loops, best of 3: 2.97 µs per loop
In [98]: %timeit df.columns.tolist()
10000 loops, best of 3: 9.67 µs per loop
A DataFrame follows the dict-like convention of iterating over the “keys” of the objects.
my_dataframe.keys()
Create a list of keys/columns - object method to_list() and the Pythonic way:
my_dataframe.keys().to_list()
list(my_dataframe.keys())
Basic iteration on a DataFrame returns column labels:
[column for column in my_dataframe]
Do not convert a DataFrame into a list, just to get the column labels. Do not stop thinking while looking for convenient code samples.
xlarge = pd.DataFrame(np.arange(100000000).reshape(10000,10000))
list(xlarge) # Compute time and memory consumption depend on dataframe size - O(N)
list(xlarge.keys()) # Constant time operation - O(1)
In the Notebook
For data exploration in the IPython notebook, my preferred way is this:
sorted(df)
Which will produce an easy to read alphabetically ordered list.
In a code repository
In code I find it more explicit to do
df.columns
Because it tells others reading your code what you are doing.
%%timeit
final_df.columns.values.tolist()
948 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
list(final_df.columns)
14.2 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.columns.values)
1.88 µs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
final_df.columns.tolist()
12.3 µs ± 27.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.head(1).columns)
163 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The simplest option would be:
list(my_dataframe.columns) or my_dataframe.columns.tolist()
No need for the complex stuff above :)
Its very simple.
Like you can do it as:
list(df.columns)
For a quick, neat, visual check, try this:
for col in df.columns:
print col
As answered by Simeon Visser, you could do
list(my_dataframe.columns.values)
or
list(my_dataframe) # For less typing.
But I think most the sweet spot is:
list(my_dataframe.columns)
It is explicit and at the same time not unnecessarily long.
I feel the question deserves an additional explanation.
As fixxxer noted, the answer depends on the Pandas version you are using in your project. Which you can get with pd.__version__ command.
If you are for some reason like me (on Debian 8 (Jessie) I use 0.14.1) using an older version of Pandas than 0.16.0, then you need to use:
df.keys().tolist() because there isn’t any df.columns method implemented yet.
The advantage of this keys method is that it works even in newer version of Pandas, so it's more universal.
import pandas as pd
# create test dataframe
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(2))
list(df.columns)
Returns
['A', 'B', 'C']
n = []
for i in my_dataframe.columns:
n.append(i)
print n
This is the easiest way to reach your goal.
my_dataframe.columns.values.tolist()
and if you are Lazy, try this >
list(my_dataframe)
If the DataFrame happens to have an Index or MultiIndex and you want those included as column names too:
names = list(filter(None, df.index.names + df.columns.values.tolist()))
It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation.
I've run into needing this more often because I'm shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another "column" to me. It would probably make sense for pandas to have a built-in method for something like this (totally possible I've missed it).
its the simple code for you :
for i in my_dataframe:
print(i)
just do it
Even though the solution that was provided previously is nice, I would also expect something like frame.column_names() to be a function in Pandas, but since it is not, maybe it would be nice to use the following syntax. It somehow preserves the feeling that you are using pandas in a proper way by calling the "tolist" function: frame.columns.tolist()
frame.columns.tolist()
listHeaders = [colName for colName in my_dataframe]

Categories

Resources