Reading multi header excel sheet in pandas - python

I have a multi header excel sheet without any index column. When I read the excel in pandas, it treats first column as an index. I want pandas to create an index instead of treating 1st column as an index. Any help would be appreciated.
I tried below code:
df = pd.read_excel(file, header=[1,2], sheetname= "Ratings Inputs", parse_cols ="A:AA", index_col=None)

From my tests, read_csv seems broken with a multi_line header: when index_col is absent or None, it behaves as is it was 0.
You have 2 possible workarounds here:
reset_index as suggested by #mounaim:
df = pd.read_excel(file, header=[1,2], sheetname= "Ratings Inputs",
parse_cols ="A:AA", index_col=None).reset_index()
It is almost correct except that the header for first columns are used to name the MultiIndex df.columns and the first column is named `('index', ''). So you must re-create it:
df.columns = pd.MultiIndex.from_tuples([tuple(df.columns.names)]
+ list(df.columns)[1:])
Read separetely the headers
head = pd.read_excel('3x3.xlsx', header=None, sheetname= "Ratings Inputs",
parse_cols ="A:AA", skiprows=1, nrows=2)
df = pd.read_excel(file, header=2, sheetname= "Ratings Inputs",
parse_cols ="A:AA", index_col=None).reset_index()
df.columns = pd.MultiIndex.from_tuples(list(head.transpose().to_records(index=False)))

Have you tried reset_index() :
your_data_frame.reset_index(drop=True,inplace=True)

Related

Concat dataframes - One with no column name

So I have 2 csv files with the same number of columns. The first csv file has its columns named (age, sex). The second file though doesn't name its columns like the first one but it's data corresponds to the according column of the first csv file. How can I concat them properly?
First csv.
Second csv.
This is how I read my files:
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header=None)
I tried using concat() like this but I get 4 columns as a result..
df = pd.concat([df1, df2])
You can also use the append function. Be careful to have the same column names for both, otherwise you will end with 4 columns.
Check this link, I found it very useful.
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header = None)
df2.columns = df1.columns
df = df1.append(df2, ignore_index=True)
I found a solution. After reading the second file I added
df2.columns = df1.columns
Works just like I wanted to. I guess I better research more next time :). Thanks
Final code:
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header = None)
df2.columns = df1.columns
df = pd.concat([df1, df2])

How to read a column header with a line break into Pandas?

I have a csv column heading:
"Submission S
tatus"
csv headers:
Unit,Publication ID,Title,"Submission S
tatus",Notes,Name,User ID
How can I refer to this when reading into the dataframe with the usecols parameter (or alternatively when renaming at a later stage)?
I have tried:
df = pd.read_csv('myfile.csv', usecols = ['Submission S\ntatus']
error: Usecols do not match columns, columns expected but not found
df = pd.read_csv('myfile.csv', usecols = ['Submission S\rtatus']
error: Usecols do not match columns, columns expected but not found
df = pd.read_csv('myfile.csv', usecols = ['Submission S
tatus']
error: SyntaxError: EOL while scanning string literal
How should I be referring to this column?
This is not the answer you wanted, but I hope it will help you if you want any workaround for this.
df = pd.read_csv('myfile.csv', usecols = [n])
df.rename(columns={df.columns[n]: "new column name"}, inplace=True)
# n is your column postion
You can read a csv file using the traditonal way of statements:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name
You can save the column names by
colnames = df.Names
Later replace the name of your specified column using a meaningful word

Skipping range of rows after header through pandas.read_excel

I know the argument usecols in pandas.read_excel() allows you to select specific columns.
Say, I read an Excel file in with pandas.read_excel(). My excel spreadsheet has 1161 rows. I want to keep the 1st row (with index 0), and skip rows 2:337. Seems like the argument skiprows works only when 0 indexing is involved. I tried several different ways but my code always produces an output where all my 1161 rows are read rather than only after the 337th row on. Such as this:
documentationscore_dataframe = pd.read_excel("Documentation Score Card_17DEC2015 Rev 2 17JAN2017.xlsx",
sheet_name = "Sheet1",
skiprows = "336",
usecols = "H:BD")
Here is another attempt:
documentationscore_dataframe = pd.read_excel("Documentation Score Card_17DEC2015 Rev 2 17JAN2017.xlsx",
sheet_name = "Sheet1",
skiprows = "1:336",
usecols = "H:BD")
I would like the dataframe to exclude rows 2 through 337 in the original Excel import.
As per the documentation for pandas.read_excel, skiprows must be list-like.
Try this instead to exclude rows 1 to 336 inclusive:
df = pd.read_excel("file.xlsx",
sheet_name = "Sheet1",
skiprows = range(1, 337),
usecols = "H:BD")
Note: range constructor is considered list-like for this purpose, so no explicit list conversion is necessary.
You can also pass a function to skiprows=. For example, to skip the first 336 rows (after the header row):
df = pd.read_excel('Book1.xlsx', sheet_name='Sheet1', skiprows=lambda x: 1<=x<=336)

dropping column using pandas-drop()-function not working

I have a file with time series data. From this file I want to remove the first column (containing the dates).
However, the following code:
from pandas import read_csv
dataset = read_csv('USrealGDPGrowthPred_Quarterly.txt', header=0)
dataset.drop('DATE', axis=1)
results in this error message:
ValueError: labels ['DATE'] not contained in axis
But: the label is contained in the file, as you can see in the screenshot.
What is going on here? How can I get rid of that column?
UPDATE:
the following code:
dataset = read_csv('USrealGDPGrowthPred_Quarterly.txt', header=0, sep='\t')
dataset.drop('DATE', axis=1)
print(dataset.head(5))
does not result in an error message but doesn't drop the column either. The data looks like nothing happened.
So there are 2 problems:
First need change separator to tab, because read_csv have default sep=',' as commented #cᴏʟᴅsᴘᴇᴇᴅ:
df = read_csv('USrealGDPGrowthPred_Quarterly.txt', header=0, sep='\t')
Or use read_table with default sep='\t':
df = df.read_table('USrealGDPGrowthPred_Quarterly.txt', header=0)
And then assign output back or use inplace=True in drop:
dataset = dataset.drop('DATE', axis=1)
Or:
dataset.drop('DATE', axis=1, inplace=True)`
I had a similar issue using df.drop(columns=['column'])
Adding The inplace=True to df.drop(columns=['column'], inplace=True) fixed it for me thank you!

pandas add columns when read from a csv file

I want to read from a CSV file using pandas read_csv. The CSV file doesn't have column names. When I use pandas to read the CSV file, the first row is set as columns by default. But when I use df.columns = ['ID', 'CODE'], the first row is gone. I want to add, not replace.
df = pd.read_csv(CSV)
df
a 55000G707270
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240
df.columns=['ID','CODE']
df
ID CODE
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240
I think you need parameter names in read_csv:
df = pd.read_csv(CSV, names=['ID','CODE'])
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed unless mangle_dupe_cols=True, which is the default.
You may pass the column names at the time of reading the csv file itself as :
df = pd.read_csv(csv_path, names = ["ID", "CODE"])
Use names argument in function call to add the columns yourself:
df = pd.read_csv(CSV, names=['ID','CODE'])
you need both: header=None and names=['ID','CODE'], because there are no column names/labels/headers in your CSV file:
df = pd.read_csv(CSV, header=None, names=['ID','CODE'])
The reason there are extra index columns add is because to_csv() writes an index per default, so you can either disable index when saving your CSV:
df.to_csv('file.csv', index=False)
or you can specify an index column when reading:
df = pd.read_csv('file.csv', index_col=0)

Categories

Resources