how to set columns of pandas dataframe as list - python

I have a pandas dataframe and when I try to acess its columns (like df[["a"]) it is not possible because
the columns are defined as an "Index" object (pandas.core.indexes.base.Index). or Index(['col2','col2'], [![enter image description here][1]][1]dtype='object')
I tried convert it doing something like df.columns = df.columns.tolist() and also df.columns = [str(col) for col in df.columns]
but the columns remained as an Index object.
What I want is to make df.columns and it would return a list object.
What Can I do ?

columns is not callable. So, you need to remove the parenthesis ():
df.columns will give you the name of the columns as an object.
list(df.columns) will give you the name of the columns as a list.
In your example, list(ss.columns) will return a list of column names.

try this:
df.columns.values.tolist()
since you were trying to convert it using this approach, you missed the values attribute

You have to wrap it over list Constructor to function it like a list i.e list(ss.columns).
list(ss.columns)
Hope this works!

Related

Get a pandas column name as a string

I am having a dataframe containing multiple columns and multiple rows. I am trying to find the column which contains the entry 'some_string'. I managed to this by
col = df.columns[df.isin(['some_string']).any()]
I would like to have col as a string, but instead it is of the following type
In [47]:
print(col)
Out[47]:
Index(['col_N'], dtype='object')
So how can I get just 'col_N' returned? I just can't find an answer to that! Tnx
You can treat your output as a list. If you have only one match you can as for
print(col[0])
If you have one or more and you want to print then all, you can convert it to a list:
print(list(col))
or you can only pass the values of col to the print:
print(*col)
I think typecasting will help
list_of_columns = list(df.columns)

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

Creating new pandas dataframe from certain columns of existing dataframe

I have read a csv file into a pandas dataframe and want to do some simple manipulations on the dataframe. I can not figure out how to create a new dataframe based on selected columns from my original dataframe. My attempt:
names = ['A','B','C','D']
dataset = pandas.read_csv('file.csv', names=names)
new_dataset = dataset['A','D']
I would like to create a new dataframe with the columns A and D from the original dataframe.
It is called subset - passed list of columns in []:
dataset = pandas.read_csv('file.csv', names=names)
new_dataset = dataset[['A','D']]
what is same as:
new_dataset = dataset.loc[:, ['A','D']]
If need only filtered output add parameter usecols to read_csv:
new_dataset = pandas.read_csv('file.csv', names=names, usecols=['A','D'])
EDIT:
If use only:
new_dataset = dataset[['A','D']]
and use some data manipulation, obviously get:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
If you modify values in new_dataset later you will find that the modifications do not propagate back to the original data (dataset), and that Pandas does warning.
As pointed EdChum add copy for remove warning:
new_dataset = dataset[['A','D']].copy()
You must pass a list of column names to select columns. Otherwise, it will be interpreted as MultiIndex; df['A','D'] would work if df.columns was MultiIndex.
The most obvious way is df.loc[:, ['A', 'B']] but there are other ways (note how all of them take lists):
df1 = df.filter(items=['A', 'D'])
df1 = df.reindex(columns=['A', 'D'])
df1 = df.get(['A', 'D']).copy()
N.B. items is the first positional argument, so df.filter(['A', 'D']) also works.
Note that filter() and reindex() return a copy as well, so you don't need to worry about getting SettingWithCopyWarning later.

pandas automatically create dataframe from list of series with column names

I have a list of pandas series objects. I have a list of functions that generate them. How do I create a dataframe of the objects with the column names being the names of the functions that created the objects?
So, to create the regular dataframe, I've got:
pandas.concat([list of series objects],axis=1,join='inner')
But I don't currently have a way to insert all the functionA.__name__, functionB.__name__, etc. as column names in the dataframe.
How would I preserve the same conciseness, and set the column names?
IIUC, given your concat dataframe df you can:
df = pandas.concat([list of series objects],axis=1,join='inner')
and then assign the column names as a list of functions names:
df.columns = [functionA.__name__, functionB.__name__, etc.]
Hope that helps.
You can set the column names in a second step:
df = pandas.concat([list of series objects],axis=1,join='inner')
df.columns = [functionA.__name__, functionB.__name__]

How to get pandas.DataFrame columns containing specific dtype

I'm using df.columns.values to make a list of column names which I then iterate over and make charts, etc... but when I set this up I overlooked the non-numeric columns in the df. Now, I'd much rather not simply drop those columns from the df (or a copy of it). Instead, I would like to find a slick way to eliminate them from the list of column names.
Now I have:
names = df.columns.values
what I'd like to get to is something that behaves like:
names = df.columns.values(column_type=float64)
Is there any slick way to do this? I suppose I could make a copy of the df, and drop those non-numeric columns before doing columns.values, but that strikes me as clunky.
Welcome any inputs/suggestions. Thanks.
Someone will give you a better answe than this possibly, but one thing I tend to do is if all my numeric data are int64 or float64 objects, then you can create a dict of the column data types and then use the values to create your list of columns.
So for example, in a dataframe where I have columns of type float64, int64 and object firstly you can look at the data types as so:
DF.dtypes
and if they conform to the standard whereby the non-numeric columns of data are all object types (as they are in my dataframes), then you can do the following to get a list of the numeric columns:
[key for key in dict(DF.dtypes) if dict(DF.dtypes)[key] in ['float64', 'int64']]
Its just a simple list comprehension. Nothing fancy. Again, though whether this works for you will depend upon how you set up you dataframe...
dtypes is a Pandas Series.
That means it contains index & values attributes.
If you only need the column names:
headers = df.dtypes.index
it will return a list containing the column names of "df" dataframe.
There's a new feature in 0.14.1, select_dtypes to select columns by dtype, by providing a list of dtypes to include or exclude.
For example:
df = pd.DataFrame({'a': np.random.randn(1000),
'b': range(1000),
'c': ['a'] * 1000,
'd': pd.date_range('2000-1-1', periods=1000)})
df.select_dtypes(['float64','int64'])
Out[129]:
a b
0 0.153070 0
1 0.887256 1
2 -1.456037 2
3 -1.147014 3
...
To get the column names from pandas dataframe in python3-
Here I am creating a data frame from a fileName.csv file
>>> import pandas as pd
>>> df = pd.read_csv('fileName.csv')
>>> columnNames = list(df.head(0))
>>> print(columnNames)
You can also try to get the column names from panda data frame that returns columnn name as well dtype. here i'll read csv file from https://mlearn.ics.uci.edu/databases/autos/imports-85.data but you have define header that contain columns names.
import pandas as pd
url="https://mlearn.ics.uci.edu/databases/autos/imports-85.data"
df=pd.read_csv(url,header = None)
headers=["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base","length","width","height","curb-weight","engine-type",
"num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm"
,"city-mpg","highway-mpg","price"]
df.columns=headers
print df.columns

Categories

Resources