I am trying to understand the meaning of the output of the following code:
import pandas as pd
index = ['index1','index2','index3']
columns = ['col1','col2','col3']
df = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]], index=index, columns=columns)
print df.index
I would expect just a list containing the index of the dataframe:
['index1, 'index2', 'index3']
however the output is:
Index([u'index1', u'index2', u'index3'], dtype='object')
This is the pretty output of the pandas.Index object, if you look at the type it shows the class type:
In [45]:
index = ['index1','index2','index3']
columns = ['col1','col2','col3']
df = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]], index=index, columns=columns)
df.index
Out[45]:
Index(['index1', 'index2', 'index3'], dtype='object')
In [46]:
type(df.index)
Out[46]:
pandas.indexes.base.Index
So what it shows is that you have an Index type with the elements 'index1' and so on, the dtype is object which is str
if you didn't pass your list of strings for the index you get the default int index which is the new type RangeIndex:
In [47]:
df = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]], columns=columns)
df.index
Out[47]:
RangeIndex(start=0, stop=3, step=1)
If you wanted a list of the values:
In [51]:
list(df.index)
Out[51]:
['index1', 'index2', 'index3']
Related
I have a dataframe which I read from a CSV as:
df = pd.read_csv(csv_path, header = None)
By default, Pandas assigns the header (df.columns) to be [0, 1, 2, ...] of type int64
What's the best way to to convert this to type str, such that df.columns results in ['0', '1', '2',...] (i.e type str)?
Currently, the best way I can think of doing this is df.columns = list(map(str, df.columns))
Unfortunately, df.astype(str) only affects the values and not the column names
You can use astype(str) with column names like this:
df.columns = df.columns.astype(str)
Example:
In [2472]: l = [1,2]
In [2473]: l1 = [2,3]
In [2475]: df = pd.DataFrame([l, l1])
In [2476]: df
Out[2476]:
0 1
0 1 2
1 2 3
In [2480]: df.columns = df.columns.astype(str)
In [2482]: df.columns
Out[2482]: Index(['0', '1'], dtype='object')
I am trying to create a dataframe in Pandas from the AB column in my csv file. (AB is the 27th column).
I am using this line:
df = pd.read_csv(filename, error_bad_lines = False, usecols = [27])
... which is resulting in this error:
ValueError: Usecols do not match names.
I'm very new to Pandas, could someone point out what i'm doing wrong to me?
Here is a small demo:
CSV file (without header, i.e. there is NO column names):
1,2,3,4,5,6,7,8,9,10
11,12,13,14,15,16,17,18,19,20
We are going to read only 8-th column:
In [1]: fn = r'D:\temp\.data\1.csv'
In [2]: df = pd.read_csv(fn, header=None, usecols=[7], names=['col8'])
In [3]: df
Out[3]:
col8
0 8
1 18
PS pay attention at header=None, usecols=[7], names=['col8']
If you don't use header=None and names parameters, the first row will be used as a header:
In [6]: df = pd.read_csv(fn, usecols=[7])
In [7]: df
Out[7]:
8
0 18
In [8]: df.columns
Out[8]: Index(['8'], dtype='object')
and if we want to read only the last 10-th column:
In [9]: df = pd.read_csv(fn, usecols=[10])
... skipped ...
ValueError: Usecols do not match names.
because pandas counts columns starting from 0, so we have to do it this way:
In [12]: df = pd.read_csv(fn, usecols=[9], names=['col10'])
In [13]: df
Out[13]:
col10
0 10
1 20
usecols uses the column name in your csv file rather than the column number.
in your case it should be usecols=['AB'] rather than usecols=[28] that is the reason of your error stating usecols do not match names.
I'm trying to read csv file as DataFrame with pandas, and I want to read index row as string. However, since the row for index doesn't have any characters, pandas handles this data as integer. How to read as string?
Here are my csv file and code:
[sample.csv]
uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30
[code]
df = pd.read_csv('sample.csv', index_col="uid" dtype=float)
print df.index.values
The result: df.index is integer, not string:
>>> [1 2 3]
But I want to get df.index as string:
>>> ['01', '02', '03']
And an additional condition: The rest of index data have to be numeric value and they're actually too many and I can't point them with specific column names.
pass dtype param to specify the dtype:
In [159]:
import pandas as pd
import io
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
df = pd.read_csv(io.StringIO(t), dtype={'uid':str})
df.set_index('uid', inplace=True)
df.index
Out[159]:
Index(['01', '02', '03'], dtype='object', name='uid')
So in your case the following should work:
df = pd.read_csv('sample.csv', dtype={'uid':str})
df.set_index('uid', inplace=True)
The one-line equivalent doesn't work, due to a still-outstanding pandas bug here where the dtype param is ignored on cols that are to be treated as the index**:
df = pd.read_csv('sample.csv', dtype={'uid':str}, index_col='uid')
You can dynamically do this if we assume the first column is the index column:
In [171]:
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
df = pd.read_csv(io.StringIO(t), dtype=dtypes)
df.set_index('uid', inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 01 to 03
Data columns (total 3 columns):
f1 3 non-null float64
f2 3 non-null float64
f3 3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes
In [172]:
df.index
Out[172]:
Index(['01', '02', '03'], dtype='object', name='uid')
Here we read just the header row to get the column names:
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
we then generate dict of the column names with the desired dtypes:
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
we get the index name, assuming it's the first entry and then create a dict from the rest of the cols and assign float as the desired dtype and add the index col specifying the type to be str, you can then pass this as the dtype param to read_csv
If the result is not a string you have to convert it to be a string.
try:
result = [str(i) for i in result]
or in this case:
print([str(i) for i in df.index.values])
I have a list with lots of dataframes
col = ['open', 'high', 'low', 'close']
index = [1, 2, 3, 4]
df1 = pd.DataFrame(columns=col, index=index)
df2 = pd.DataFrame(columns=col, index=index)
df3 = pd.DataFrame(columns=col, index=index)
dflist = [df1, df2, df3]
I need to rename all the columns of all the dataframes in the list. I need to add the name of each dataframe to the name of each column. I tried to do it with a for loop.
for key in dflist:
key.rename(columns=lambda x: key+x)
Obviously, this is not working. The desired output would be:
In [1]: df1.columns.tolist()
Out [2]: ['df1open', 'df1high', 'df1low', 'df1close']
In [3]: df2.columns.tolist()
Out [4]: ['df2open', 'df2high', 'df2low', 'df2close']
In [5]: df3.columns.tolist()
Out [6]: ['df3open', 'df3high', 'df3low', 'df3close']
Thanks for your help.
You want to use a dict instead of a list to store the DataFrames, if you need to somehow access their "names" and manipulate them programmatically (think when you have thousands of them). Also note the use of the inplace argument, which is common in pandas:
import pandas as pd
col = ['open', 'high', 'low', 'close']
index = [1, 2, 3, 4]
df_all = {'df1': pd.DataFrame(columns=col, index=index),
'df2': pd.DataFrame(columns=col, index=index),
'df3': pd.DataFrame(columns=col, index=index)}
for key, df in df_all.iteritems():
df.rename(columns=lambda x: key+x, inplace=True)
print df_all['df1'].columns.tolist()
Output:
['df1open', 'df1high', 'df1low', 'df1close']
There are a couple of issues here. Firstly, dflist is the list of DataFrames, as opposed to the names of those DataFrames. So df1 is not the same as "df1", which means that key + x isn't a string concatenation.
Secondly, the rename() function returns a new DataFrame. So you have to pass the inplace=True parameter to overwrite the existing column names.
Try this instead:
dflist = ['df1', 'df2', 'df3']
for key in dflist:
df = eval(key)
df.rename(columns=lambda x: key+x, inplace=True)
df = pd.DataFrame({'a':[2,3,5], 'b':[1,2,3], 'c':[12,13,14]})
df.set_index(['a','b'], inplace=True)
display(df)
s = df.iloc[1]
# How to get 'a' and 'b' value from s?
It is so annoying that ones columns become indices we cannot simply use df['colname'] to fetch values.
Does it encourage we use set_index(drop=False)?
When I print s I get
In [8]: s = df.iloc[1]
In [9]: s
Out[9]:
c 13
Name: (3, 2), dtype: int64
which has a and b in the name part, which you can access with:
s.name
Something else that you can do is
df.index.values
and specifically for your iloc[1]
df.index.values[1]
Does this help? Other than this I am not sure what you are looking for.
if you want to get "a" and "b"
df.index.names
gives:
FrozenList(['a', 'b'])