How to specify column names while reading an Excel file using Pandas? - python

I read an Excel Sheet into a pandas DataFrame this way:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet1")
the first cell's value of each column is selected as the column name for the dataFrame, I want to specify my own column names, How do I do this?

This thread is 5 years old and outdated now, but still shows up on the top of the list from a generic search. So I am adding this note. Pandas now (v0.22) has a keyword to specify column names at parsing Excel files. Use:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet 1", header=None, names=['A', 'B', 'C'])
If header=None is not set, pd seems to consider the first row as header and delete it during parsing. If there is indeed a header, but you dont want to use it, you have two choices, either (1) use "names" kwarg only; or (2) use "names" with header=None and skiprows=1. I personally prefer the second option, since it clearly makes note that the input file is not in the format I want, and that I am doing something to go around it.

I think setting them afterwards is the only way in this case, so if you have for example four columns in your DataFrame:
df.columns = ['W','X','Y','Z']
If you know in advance what the headers in the Excelfile are its probably better to rename them, this would rename W into A, etc:
df.rename(columns={'W':'A', 'X':'B', etc})

As Ram said, this post comes on the top and may be useful to some....
In pandas 0.24.2 (may be earlier as well), read_excel itself has the capability of ignoring the source headers and giving your own col names and few other good controls:
DID = pd.read_excel(file1, sheet_name=0, header=None, usecols=[0, 1, 6], names=['A', 'ID', 'B'], dtype={2:str}, skiprows=10)
# for example....
# usecols => read only specific col indexes
# dtype => specifying the data types
# skiprows => skip number of rows from the top.

call .parse with header=None keyword argument.
df = xl.parse("Sheet1", header=None)

In case the excel sheet only contains the data without headers:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"])
In case the excel sheet already contains header names, then use skiprows to skip the line:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"],skiprows=1)

Related

How to fix data frame indexing issue in Pandas? [duplicate]

Using pandas, how do I read in only a subset of the columns (say 4th and 7th columns) of a .csv file with no headers? I cannot seem to be able to do so using usecols.
In order to read a csv in that doesn't have a header and for only certain columns you need to pass params header=None and usecols=[3,6] for the 4th and 7th columns:
df = pd.read_csv(file_path, header=None, usecols=[3,6])
See the docs
Previous answers were good and correct, but in my opinion, an extra names parameter will make it perfect, and it should be the recommended way, especially when the csv has no headers.
Solution
Use usecols and names parameters
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'])
Additional reading
or use header=None to explicitly tells people that the csv has no headers (anyway both lines are identical)
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)
So that you can retrieve your data by
# with `names` parameter
df['colA']
df['colB']
instead of
# without `names` parameter
df[0]
df[1]
Explain
Based on read_csv, when names are passed explicitly, then header will be behaving like None instead of 0, so one can skip header=None when names exist.
Make sure you specify pass header=None and add usecols=[3,6] for the 4th and 7th columns.
As per documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html :
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
namesarray-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
columts = ['Day', 'PLMN', 'RNCname']
tempo = pd.read_csv("info.csv", sep=';', header=0, names=columts, index_col=False)
You can also call read_table() with header=None (to read the first row of the file as the first row of the data):
df = pd.read_table('test.tsv', sep=',', usecols=[3,6], header=None)
This function is more useful if the separator is \t (.tsv file etc.) because the default delimiter is \t (unlike read_csv whose default delimiter is ,).

how to read excel file with nested columns with pandas?

I am trying to read an Excel file using pandas but my columns and index are changed:
df = pd.read_excel('Assignment.xlsx',sheet_name='Assignment',index_col=0)
Excel file:
Jupyter notebook:
By default pandas consider first row as header. You need to tell that take 2 rows as header.
df = pd.read_excel("xyz.xlsx", header=[0,1], usecols = "A:I", skiprows=[0])
print df
You can choose to mention skiprows depending on the requirement. If you remove skiprows, it will show first row header without any unnamed entries.
Refer this link

Cant drop columns with pandas if index_col = 0 is used while reading csv's [duplicate]

I have the following code which imports a CSV file. There are 3 columns and I want to set the first two of them to variables. When I set the second column to the variable "efficiency" the index column is also tacked on. How can I get rid of the index column?
df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency
I tried using
del df['index']
after I set
energy = df.index
which I found in another post but that results in "KeyError: 'index' "
When writing to and reading from a CSV file include the argument index=False and index_col=False, respectively. Follows an example:
To write:
df.to_csv(filename, index=False)
and to read from the csv
df.read_csv(filename, index_col=False)
This should prevent the issue so you don't need to fix it later.
df.reset_index(drop=True, inplace=True)
DataFrames and Series always have an index. Although it displays alongside the column(s), it is not a column, which is why del df['index'] did not work.
If you want to replace the index with simple sequential numbers, use df.reset_index().
To get a sense for why the index is there and how it is used, see e.g. 10 minutes to Pandas.
You can set one of the columns as an index in case it is an "id" for example.
In this case the index column will be replaced by one of the columns you have chosen.
df.set_index('id', inplace=True)
If your problem is same as mine where you just want to reset the column headers from 0 to column size. Do
df = pd.DataFrame(df.values);
EDIT:
Not a good idea if you have heterogenous data types. Better just use
df.columns = range(len(df.columns))
you can specify which column is an index in your csv file by using index_col parameter of from_csv function
if this doesn't solve you problem please provide example of your data
One thing that i do is df=df.reset_index()
then df=df.drop(['index'],axis=1)
To remove or not to create the default index column, you can set the index_col to False and keep the header as Zero. Here is an example of how you can do it.
recording = pd.read_excel("file.xls",
sheet_name= "sheet1",
header= 0,
index_col= False)
The header = 0 will make your attributes to headers and you can use it later for calling the column.
It works for me this way:
Df = data.set_index("name of the column header to start as index column" )

How to read the same column header, but "spelt" differently in new files. Pandas

So I am writing in python using pandas. The code that I wrote extracts specific column headers from an excel file, which works, but I dont want to have to go into the code everytime to change out the name of the column headers to extract when working on new files with the same data.
Here is my extraction method:
xlsx = pd.ExcelFile('filepath')
df = pd.read_excel(xlsx, 'Tabelle1')
df2 = df[['ZONE_NAME','POLYGONSTRING']]
df2.to_csv('filepath\name', sep=';', index=False, header=True)
So When I run this code to another excel file I want it to accept any possible name for "ZONE_NAME" which could be "zonename", "Zone Name" etc...
If your problem is just limited to different ways to write a column like "ZONE_NAME" (e.g., "zone_name", "ZONENAME", "ZONE_name", etc.) then why not just use some type of filter on the column names:
xlsx = pd.ExcelFile('filepath')
df = pd.read_excel(xlsx, 'Tabelle1')
# This will filter out any non-alphabetical characters from each
# column name and lower it (so "ZONE_NAME" "ZONENAME" and "zone_NAME"
# would become "zonename")
filtered_columns = [re.sub('[^a-zA-Z]', "", c).lower() for c in list(df.columns.values)]
df.columns = filtered_columns
df2 = df[filtered_columns]
df2.to_csv('filepath\name', sep=';', index=False, header=True)
Hope this helps.
Pandas read_csv will automatically detect column headers. No need to specify anything ahead of time. Your post is lacking the links to any image, It would be better to post samples of the data in-line though.

Pandas read in table without headers

Using pandas, how do I read in only a subset of the columns (say 4th and 7th columns) of a .csv file with no headers? I cannot seem to be able to do so using usecols.
In order to read a csv in that doesn't have a header and for only certain columns you need to pass params header=None and usecols=[3,6] for the 4th and 7th columns:
df = pd.read_csv(file_path, header=None, usecols=[3,6])
See the docs
Previous answers were good and correct, but in my opinion, an extra names parameter will make it perfect, and it should be the recommended way, especially when the csv has no headers.
Solution
Use usecols and names parameters
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'])
Additional reading
or use header=None to explicitly tells people that the csv has no headers (anyway both lines are identical)
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)
So that you can retrieve your data by
# with `names` parameter
df['colA']
df['colB']
instead of
# without `names` parameter
df[0]
df[1]
Explain
Based on read_csv, when names are passed explicitly, then header will be behaving like None instead of 0, so one can skip header=None when names exist.
Make sure you specify pass header=None and add usecols=[3,6] for the 4th and 7th columns.
As per documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html :
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
namesarray-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
columts = ['Day', 'PLMN', 'RNCname']
tempo = pd.read_csv("info.csv", sep=';', header=0, names=columts, index_col=False)
You can also call read_table() with header=None (to read the first row of the file as the first row of the data):
df = pd.read_table('test.tsv', sep=',', usecols=[3,6], header=None)
This function is more useful if the separator is \t (.tsv file etc.) because the default delimiter is \t (unlike read_csv whose default delimiter is ,).

Categories

Resources