count total number of list elements in pandas column - python

I have a pandas dataframe A with column keywords as
(here Im showing only 4 rows but in actual there are millions) :-
keywords
['loans','mercedez','bugatti']
['trump','usa']
['galaxy','7s','canon','macbook']
['beiber','spiderman','marvels','ironmen']
I want to sum total number of list elements in column keywords and store it into some variable. Something like
total_sum=elements in keywords[0]+elements in keywords[1]+elements in
keywords[2]+elements in keywords[3]
total_sum=3+2+4+4
total_sum=13
How I can do it in pandas?

IIUC
Setup
df = pd.DataFrame()
df['keywords']=[['loans','mercedez','bugatti'],
['trump','usa'],
['galaxy','7s','canon','macbook'],
['beiber','spiderman','marvels','ironmen']]
Then juse use str.len and sum
df.keywords.str.len().sum()
Detail:
df.keywords.str.len()
0 3
1 2
2 4
3 4
Name: keywords, dtype: int64
Ps: If you have strings that look like a list, use ast.literal_eval to convert to list first.
df.keywords.transform(ast.literal_eval).str.len().sum()

Using sum and map:
sum(map(len, df.keywords))
Sample
df = pd.DataFrame({
'keywords': [['a', 'b', 'c'], ['c', 'd'], ['a', 'b', 'c', 'd'], ['g', 'h', 'i']]
})
sum(map(len, df.keywords))
12
Timings
df = pd.concat([df]*10000)
%timeit sum(map(len, df.keywords))
1.87 ms ± 52.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.keywords.map(len).sum()
13.5 ms ± 661 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.keywords.str.len().sum()
14.3 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
>>> sum(map(len, df.keywords)) == df.keywords.map(len).sum() == df.keywords.str.len().sum()
True
A bit of a disclaimer: using pandas methods on columns that contain lists is always going to be inefficient (which is why using non-pandas' methods is so much faster here), since DataFrames are not meant to store list. You should try to avoid this whenever possible.

You can try this one:
df.keywords.map(len).sum()

Simple as that.
Maybe Pandas evolved since then.
df['len_of_list'] = df.my_columns_with_list.agg([len])
Cheers,

I want to sum total number of list elements in column keywords
This is different from what you pseudo-coded. I believe you mean to call the size function for dataframes:
total_sum = keywords.size

Method 1:
len([item for sublist in df.keywords for item in sublist]
Method 2:
df.keywords.apply(len).sum()
.
df = [{"item": "a", "item_price": [1,1.5,2]}, {"item": "b", "item_price": [0.5,0.75,1]}]
df = pd.DataFrame(df)
print(df)
print("Ans:",len([item for sublist in df.item_price for item in sublist]))
OUTPUT
df
item item_price
0 a [1, 1.5, 2]
1 b [0.5, 0.75, 1]
Ans:6

More like a list flatten problem
import itertools
len(list(itertools.chain(*df.keywords.values.tolist())))
Out[57]: 13

Related

How to get unique lists in a Pandas column of lists

I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'name': ["John", "Jack", "Jeff", "Kate"], "hobbies":[["pirates"], ["pirates"], ["climbing", "yoga"], ["yoga"]]})
# name hobbies
# 0 John [pirates]
# 1 Jack [pirates]
# 2 Jeff [climbing, yoga]
# 3 Kate [yoga]
I would like to have a list of the unique lists in hobbies.
Just to be clear, I don't want the list of unique hobbies (i.e. ["pirates", "climbing", "yoga"]), which is already covered in several questions including this one: pandas get unique values from column of lists
I would like instead the list [['pirates'], ['yoga'], ['climbing', 'yoga']].
I have thought of the following way but that does not seem very "panda-ic":
[list(t) for t in {tuple(h) for h in df["hobbies"]}]
Is there a better way to do it?
Let us change the list to tuple so we can do drop_duplicates
out = df.hobbies.apply(tuple).drop_duplicates().apply(list).tolist()
Out[143]: [['pirates'], ['climbing', 'yoga'], ['yoga']]
If you do not need converting back to list, you could do:
df.hobbies.apply(tuple).unique()
You could use numpy to do it:
import numpy as np
np.unique(df['hobbies'].to_numpy()).tolist()
lists aren't hashable keys, use a tuple instead and then convert to list
[*map(list,df['hobbies'].map(tuple).unique())]
output:
[['pirates'], ['climbing', 'yoga'], ['yoga']]
the use of unpacking over calling a list on a map object has proven faster for me
%%timeit
list(map(list,df['hobbies'].map(tuple).unique()))
385 µs ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
[*map(list,df['hobbies'].map(tuple).unique())]
296 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is there a more concise way to conditionally loop over rows in a dataframe?

I have a simple dataframe and would like to apply a function to a particular column based on the status of another column.
myDF = pd.DataFrame({'trial': ['A','B','A','B','A','B'], 'score': [1,2,3,4,5,6]})
I would like to multiply each observation in the score column by 10, but only if the trial name is 'A'. If it is 'B', I would like the score to remain the same.
I have a working version that does what I need, but I'm wondering if there is a way I can do this without having to create the new dataframe. Not that it makes a huge difference, but eight lines of code seems pretty long and I'm guessing there might be a simpler solution I have overlooked.
newDF = pd.DataFrame(columns = ['trial','score'])
for row in myDF.iterrows():
if row[1][0] == 'A':
newScore = {'trial': 'A', 'score': row[1][1]*10}
newDF = newDF.append(newScore, ignore_index=True)
else:
newScore = {'trial': 'B', 'score': row[1][1]}
newDF = newDF.append(newScore, ignore_index=True)
you can use loc to select the rows and columns to multiply by 10.
myDF.loc[myDF['trial'].eq('A'), 'score'] *= 10
print(myDF)
trial score
0 A 10
1 B 2
2 A 30
3 B 4
4 A 50
5 B 6
and it will be much faster than a looping.
​
Pandas is pretty fast, and does the job well! But, functions like .iterrows() is dead slow compared to other methods. There are various articles on this topic, one such is linked here.
Now, you can simply use .apply() function. Which will work wonders - and you can even fit any custom functions!
Here is an example of your work:
myDF["score"] = myDF.apply(lambda x : x[1] * 10 if x[0] == "A" else x[1], axis = 1)
You can even apply any functions using .apply and lambda as below:
def updateScore(trial, score):
return score if trial != 'A' else score * 10
myDF["score"] = myDF.apply(lambda x : updateScore(trial = x[0], score = x[1]), axis = 1)
For more details, you can check the documentation.
An alternative way to do this is using replace.
myDF = pd.DataFrame({'trial': ['A','B','A','B','A','B'], 'score': [1,2,3,4,5,6]})
myDF['score'] = myDF['score'].mul(myDF['trial'].replace({'A':10, 'B':1}))
Performance
Below I tested the performance of the different solutions using a dataset of 100000 rows.
N=100000
myDF = pd.DataFrame({'trial': np.random.choice(['A', 'B'], N), 'score': np.random.choice(np.arange(0,10), N)})
Given solutions:
%timeit myDF['score'] = myDF['score'].mul(myDF['trial'].replace({'A':10, 'B':1}))
22.5 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit myDF.loc[myDF['trial'].eq('A'), 'score'] *= 10
6.4 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit myDF["score"] = myDF.apply(lambda x : x[1] * 10 if x[0] == "A" else x[1], axis = 1)
587 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def updateScore(trial, score):
return score if trial != 'A' else score * 10
%timeit myDF["score"] = myDF.apply(lambda x : updateScore(trial = x[0], score = x[1]), axis = 1)
603 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Elegant way to remove elements from list item in data frame if not contained in another list

Lets say I have the following list:
list = ['a', 'b', 'c', 'd']
And a DataFrame like this:
df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
Out:
content
0 [a, b, abc]
1 [c, d, xyz]
2 [d, xyz]
I need a function that can remove every element from the 'content' column that is not in 'list', so my output would look like this:
Out:
content
0 [a, b]
1 [b, d]
2 [d]
Please consider that my actual df has about 1m rows and the list about 1k items. I tried by iterating over rows, but that took ages...
IIUC
df['new']=[[y for y in x if y in l] for x in df.content]
df
Out[535]:
content new
0 [a, b, abc] [a, b]
1 [c, d, xyz] [c, d]
2 [d, xyz] [d]
One way to do this is with apply:
keep = ['a', 'b', 'c', 'd'] # don't use list as a variable name
df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
df['fixed_content'] = df.apply(lambda row: [x for x in row['content'] if x in keep],axis=1)
Assuming the lists in your series contain unique values, you can use dict.keys to calculate the intersection while (in Python 3.7+) maintaining order:
df['content'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]
print(df)
content
0 [a, b]
1 [d, c]
2 [d]
Another option using filter
>>> list1 = ['a', 'b', 'c', 'd']
>>> df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
>>> df['content']=[list(filter(lambda x:x in list1,i)) for i in df['content']]
>>> df
content
0 [a, b]
1 [c, d]
2 [d]
Given that the list of strings we want to check membership against is of length ~1k, any of the answers already posted can be made significantly more efficient by first converting this list to a set.
In my testing, the fastest method was converting the list to a set and then using the answer posted by W-B:
l = set(l)
df['new'] = [[y for y in x if y in l] for x in df.content]
Full testing code and results below. I had to make some assumptions about the exact nature of the real dataset, but I think that my randomly generated lists of strings should be reasonably representative. Note that I excluded the solution from T Burgis as I ran into an error with it - could have been me doing something wrong, but since they had already commented that W-B's solution was faster, I didn't try too hard to figure it out. I should also note that for all solutions I assigned the result to df['new'] regardless of whether or not the original answer did so, for consistency's sake.
import random
import string
import pandas as pd
def initial_setup():
"""
Returns a 1m row x 1 column DataFrame, and a 992 element list of strings (all unique).
"""
random.seed(1)
keep = list(set([''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(1250)]))
content = [[''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(5)] for j in range(1000000)]
df = pd.DataFrame({'content': content})
return df, keep
def jpp(df, L):
df['new'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]
def wb(df, l):
df['new'] = [[y for y in x if y in l] for x in df.content]
def jonathon(df, list1):
df['new'] = [list(filter(lambda x:x in list1,i)) for i in df['content']]
Tests without conversion to set:
In [3]: df, keep = initial_setup()
...: %timeit jpp(df, keep)
...:
16.9 s ± 333 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: df, keep = initial_setup()
...: %timeit wb(df, keep)
1min ± 612 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: df, keep = initial_setup()
...: %timeit jonathon(df, keep)
1min 2s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tests with conversion to set:
In [6]: df, keep = initial_setup()
...: %timeit jpp(df, set(keep))
...:
1.7 s ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: df, keep = initial_setup()
...: %timeit wb(df, set(keep))
...:
689 ms ± 20.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: df, keep = initial_setup()
...: %timeit jonathon(df, set(keep))
...:
1.26 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Get index number from multi-index dataframe in python

There seems to be a lot of answers on how to get last index value from pandas dataframe but what I am trying to get index position number for the last row of every index at level 0 in a multi-index dataframe. I found a way using a loop but the data frame is millions of line and this loop is slow. I assume there is a more pythonic way of doing this.
Here is a mini example of df3. I want to get a list (or maybe an array) of the numbers in the index for the df >> the last row before it changes to a new stock. The index column is the results I want. this is the index position from the df
Stock Date Index
AAPL 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 3475
AMZN 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 6951
BAC 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 10427
This is the code I am using, where df3 in the dataframe
test_index_list = []
for start_index in range(len(df3)-1):
end_index = start_index + 1
if df3.index[start_index][0] != df3.index[end_index][0]:
test_index_list.append(start_index)
I change divakar answer a bit with get_level_values for indices of first level of MultiIndex:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])
print (df)
C D E
F A B
a a 4 7 1 5
b 5 8 3 3
c 4 9 5 6
b d 5 4 7 9
e 5 2 1 2
c f 4 3 0 4
def start_stop_arr(initial_list):
a = np.asarray(initial_list)
mask = np.concatenate(([True], a[1:] != a[:-1], [True]))
idx = np.flatnonzero(mask)
stop = idx[1:]-1
return stop
print (df.index.get_level_values(0))
Index(['a', 'a', 'a', 'b', 'b', 'c'], dtype='object', name='F')
print (start_stop_arr(df.index.get_level_values(0)))
[2 4 5]
dict.values
Using dict to track values leaves the last found value as the one that matters.
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
[2, 4, 5]
With Loop
Create function that takes a factorization and number of unique values
def last(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
You can then get the factorization with
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
array([2, 4, 5])
However, the way MultiIndex is usually constructed, the labels objects are already factorizations and the levels objects are unique values.
last(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
What's more is that we can use Numba to use just in time compiling to super-charge this.
from numba import njit
#njit
def nlast(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
nlast(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
Timing
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
641 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
nlast(f, len(u))
264 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
nlast(df.index.labels[0], len(df.index.levels[0]))
4.06 µs ± 43.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
last(df.index.labels[0], len(df.index.levels[0]))
654 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
709 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jezrael's solution. Also very fast.
%timeit start_stop_arr(df.index.get_level_values(0))
113 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.unique
I did not time this because I don't like it. See below:
Using np.unique and the return_index argument. This returns the first place each unique value is found. After this, I'd do some shifting to get at the last position of the prior unique value.
Note: this works if the level values are in contiguous groups. If they aren't, we have to do sorting and unsorting that isn't worth it. Unless it really is then I'll show how to do it.
i = np.unique(df.index.get_level_values(0), return_index=True)[1]
np.append(i[1:], len(df)) - 1
array([2, 4, 5])
Setup
from #jezrael
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])

get column names from csv file using pandas [duplicate]

I want to get a list of the column headers from a Pandas DataFrame. The DataFrame will come from user input, so I won't know how many columns there will be or what they will be called.
For example, if I'm given a DataFrame like this:
>>> my_dataframe
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
I would get a list like this:
>>> header_list
['y', 'gdp', 'cap']
You can get the values as a list by doing:
list(my_dataframe.columns.values)
Also you can simply use (as shown in Ed Chum's answer):
list(my_dataframe)
There is a built-in method which is the most performant:
my_dataframe.columns.values.tolist()
.columns returns an Index, .columns.values returns an array and this has a helper function .tolist to return a list.
If performance is not as important to you, Index objects define a .tolist() method that you can call directly:
my_dataframe.columns.tolist()
The difference in performance is obvious:
%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For those who hate typing, you can just call list on df, as so:
list(df)
I did some quick tests, and perhaps unsurprisingly the built-in version using dataframe.columns.values.tolist() is the fastest:
In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop
In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop
In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop
In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop
(I still really like the list(dataframe) though, so thanks EdChum!)
It gets even simpler (by Pandas 0.16.0):
df.columns.tolist()
will give you the column names in a nice list.
Extended Iterable Unpacking (Python 3.5+): [*df] and Friends
Unpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible.
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
If you want a list....
[*df]
# ['A', 'B', 'C']
Or, if you want a set,
{*df}
# {'A', 'B', 'C'}
Or, if you want a tuple,
*df, # Please note the trailing comma
# ('A', 'B', 'C')
Or, if you want to store the result somewhere,
*cols, = df # A wild comma appears, again
cols
# ['A', 'B', 'C']
... if you're the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently ;)
P.S.: if performance is important, you will want to ditch the
solutions above in favour of
df.columns.to_numpy().tolist()
# ['A', 'B', 'C']
This is similar to Ed Chum's answer, but updated for
v0.24 where .to_numpy() is preferred to the use of .values. See
this answer (by me) for more information.
Visual Check
Since I've seen this discussed in other answers, you can use iterable unpacking (no need for explicit loops).
print(*df)
A B C
print(*df, sep='\n')
A
B
C
Critique of Other Methods
Don't use an explicit for loop for an operation that can be done in a single line (list comprehensions are okay).
Next, using sorted(df) does not preserve the original order of the columns. For that, you should use list(df) instead.
Next, list(df.columns) and list(df.columns.values) are poor suggestions (as of the current version, v0.24). Both Index (returned from df.columns) and NumPy arrays (returned by df.columns.values) define .tolist() method which is faster and more idiomatic.
Lastly, listification i.e., list(df) should only be used as a concise alternative to the aforementioned methods for Python 3.4 or earlier where extended unpacking is not available.
>>> list(my_dataframe)
['y', 'gdp', 'cap']
To list the columns of a dataframe while in debugger mode, use a list comprehension:
>>> [c for c in my_dataframe]
['y', 'gdp', 'cap']
By the way, you can get a sorted list simply by using sorted:
>>> sorted(my_dataframe)
['cap', 'gdp', 'y']
That's available as my_dataframe.columns.
It's interesting, but df.columns.values.tolist() is almost three times faster than df.columns.tolist(), but I thought that they were the same:
In [97]: %timeit df.columns.values.tolist()
100000 loops, best of 3: 2.97 µs per loop
In [98]: %timeit df.columns.tolist()
10000 loops, best of 3: 9.67 µs per loop
A DataFrame follows the dict-like convention of iterating over the “keys” of the objects.
my_dataframe.keys()
Create a list of keys/columns - object method to_list() and the Pythonic way:
my_dataframe.keys().to_list()
list(my_dataframe.keys())
Basic iteration on a DataFrame returns column labels:
[column for column in my_dataframe]
Do not convert a DataFrame into a list, just to get the column labels. Do not stop thinking while looking for convenient code samples.
xlarge = pd.DataFrame(np.arange(100000000).reshape(10000,10000))
list(xlarge) # Compute time and memory consumption depend on dataframe size - O(N)
list(xlarge.keys()) # Constant time operation - O(1)
In the Notebook
For data exploration in the IPython notebook, my preferred way is this:
sorted(df)
Which will produce an easy to read alphabetically ordered list.
In a code repository
In code I find it more explicit to do
df.columns
Because it tells others reading your code what you are doing.
%%timeit
final_df.columns.values.tolist()
948 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
list(final_df.columns)
14.2 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.columns.values)
1.88 µs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
final_df.columns.tolist()
12.3 µs ± 27.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.head(1).columns)
163 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The simplest option would be:
list(my_dataframe.columns) or my_dataframe.columns.tolist()
No need for the complex stuff above :)
Its very simple.
Like you can do it as:
list(df.columns)
For a quick, neat, visual check, try this:
for col in df.columns:
print col
As answered by Simeon Visser, you could do
list(my_dataframe.columns.values)
or
list(my_dataframe) # For less typing.
But I think most the sweet spot is:
list(my_dataframe.columns)
It is explicit and at the same time not unnecessarily long.
I feel the question deserves an additional explanation.
As fixxxer noted, the answer depends on the Pandas version you are using in your project. Which you can get with pd.__version__ command.
If you are for some reason like me (on Debian 8 (Jessie) I use 0.14.1) using an older version of Pandas than 0.16.0, then you need to use:
df.keys().tolist() because there isn’t any df.columns method implemented yet.
The advantage of this keys method is that it works even in newer version of Pandas, so it's more universal.
import pandas as pd
# create test dataframe
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(2))
list(df.columns)
Returns
['A', 'B', 'C']
n = []
for i in my_dataframe.columns:
n.append(i)
print n
This is the easiest way to reach your goal.
my_dataframe.columns.values.tolist()
and if you are Lazy, try this >
list(my_dataframe)
If the DataFrame happens to have an Index or MultiIndex and you want those included as column names too:
names = list(filter(None, df.index.names + df.columns.values.tolist()))
It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation.
I've run into needing this more often because I'm shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another "column" to me. It would probably make sense for pandas to have a built-in method for something like this (totally possible I've missed it).
its the simple code for you :
for i in my_dataframe:
print(i)
just do it
Even though the solution that was provided previously is nice, I would also expect something like frame.column_names() to be a function in Pandas, but since it is not, maybe it would be nice to use the following syntax. It somehow preserves the feeling that you are using pandas in a proper way by calling the "tolist" function: frame.columns.tolist()
frame.columns.tolist()
listHeaders = [colName for colName in my_dataframe]

Categories

Resources