Pandas Read csv function not alleging column headers well

Pandas Read csv function not alleging column headers well - python

I have a csv file with no headers. The first column is ID and so on... Here is how I read that file in pandas.
rss_content=pd.read_csv("rss_content.csv",header=None,names=["id","feedId","url","imageUrl","title","desc","author","createTimestamp"])
However when the file gets imported I see the first two columns of the data become index and the Id column gets assigned to third column and so on. Basically the headers are shifted by two columns to right and first two columns have no header.
Why is that and how to fix it?

Assuming you have the following CSV file:
1,2,3,4,5
11,22,33,44,55
If you specify too less column names the rest columns will become index columns:
In [1]: fn = r'D:\temp\.data\41066716.csv'
In [2]: df = pd.read_csv(fn, header=None, names=['a','b','c'])
In [3]: df
Out[3]:
a b c
1 2 3 4 5
11 22 33 44 55

Related

Python pandas drop issue on a date header

I have an excel document that has three rows ahead of the main header(name of columns).
Excel document
When loading the data in pandas data frame using :
import pandas
df = pandas.read_excel('output/tracker.xlsx')
print(df)
I get this data(which is fine):
Date/Time:13/06/2022 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 NaN NaN NaN NaN
1 NaN 2763 2763 NaN
2 NaN Site ID Company Site ID Region
3 203990318_700670803 203990318 689179 Nord-Ost
I do not need the first three rows so I run :
df = df.iloc[2:]
It removes the rows that have ID of 0 and 1.
But it doesn't remove the Date/Time:13/06/2022 Unnamed: 1 etc row.
How do I remove that top row?

Rather directly load the data without the useless rows using the skiprows parameter of pandas.read_excel:
df = pandas.read_excel('output/tracker.xlsx', skiprows=3)

I get this data(which is fine):
pandas.read_excel by default assumes 1st row is header, i.e. it does hold names for columns, which looking into snippet of your data is not case, use header=None to inform pandas that there are not names of column, but rather data, that is
import pandas
df = pandas.read_excel('output/tracker.xlsx',header=None)
print(df)
then you should be able to remove these as you already did

Formatting Excel to DataFrame

excel sheet snapshot
Please take a look at my excel sheet snapshot attached on the top-left end. When I create a DataFrame from this sheet my first column and row are filled with NaN. I need to skip this blank row and column to select the second row and column for DataFrame creation.
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 NaN ID SCOPE TASK
1 NaN 34 XX something_1
2 NaN 534 SS something_2
3 NaN 43 FF something_3
4 NaN 32 ZZ something_4
I want my DataFrame to look like this
0 ID SCOPE TASK
1 34 XX something_1
2 534 SS something_2
3 43 FF something_3
4 32 ZZ something_4
I tried this code but didn't get what I expected
df = pd.read_excel("Book1.xlsx")
df.columns = df.iloc[0]
df.drop(df.index[1])
df.head()
NaN ID SCOPE TASK
0 NaN ID SCOPE TASK
1 NaN 34 XX something_1
2 NaN 534 SS something_2
3 NaN 43 FF something_3
4 NaN 32 ZZ something_4
Still I need to drop the first column and 0 the index row from here.
Can anyone help?

Specify the row number which will be the header (column names) of the dataframe using header parameter; in your case it is 1. Also, specify the column names using usecols parameter, in your case, they are 'ID', 'SCOPE', and 'TASK'.
df = pd.read_excel('your_excel_file.xlsx', header=1, usecols=['ID','SCOPE', 'TASK'])
Check out header and usecols from here.

if its an entire column you wish to delete, try this -
del df["name of the column"]
here's an eg.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2),columns=['a','b'])
# created a random dataframe 'df' with 'a' and 'b' as columns
del df['a'] # deleted column 'a' using 'del'
print(df) # no column 'a' in 'df' now

You can actually do it all while reading your excel file with pandas. You want to :
skip the first line : use the argument skiprows=0
use the columns from B to D : use the argument usecols="B:D"
use the row #2 as the header (I assumed here) : use the argument header=1 (0 indexed)
Answer :
df = pd.read_excel("Book1.xlsx", skiprows=0, usecols="B:D", header=1)
Edit : you don't even need to use skiprows when using header.
df = pd.read_excel("Book1.xlsx", usecols="B:D", header=1)

Set the first column of pandas dataframe as header

This is my output DataFrame from reading an excel file
I would like my first column to be index/header
one Entity
0 two v1
1 three Prod
2 four 2015-05-27 00:00:00
3 five 2018-04-27 00:00:00
4 six Both
5 seven id
6 eight hello

To Set the first column of pandas data frame as header
set "header=1" while reading file
eg: df = pd.read_csv(inputfilePath, header=1)
set skiprows=1 while reading the file
eg: df = df.read_csv(inputfilepath, skiprows=1)
set iloc[0] in dataframe
eg: df.columns = df.iloc[0]
I hope this will help.

One way is using T twice
df=df.T.set_index(0).T

Prevent pandas read_csv treating first row as header of column names

I'm reading in a pandas DataFrame using pd.read_csv. I want to keep the first row as data, however it keeps getting converted to column names.
I tried header=False but this just deleted it entirely.
(Note on my input data: I have a string (st = '\n'.join(lst)) that I convert to a file-like object (io.StringIO(st)), then build the csv from that file object.)

You want header=None the False gets type promoted to int into 0 see the docs emphasis mine:
header : int or list of ints, default ‘infer’ Row number(s) to use as
the column names, and the start of the data. Default behavior is as if
set to 0 if no names passed, otherwise None. Explicitly pass header=0
to be able to replace existing names. The header can be a list of
integers that specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be skipped
(e.g. 2 in this example is skipped). Note that this parameter ignores
commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
You can see the difference in behaviour, first with header=0:
In [95]:
import io
import pandas as pd
t="""a,b,c
0,1,2
3,4,5"""
pd.read_csv(io.StringIO(t), header=0)
Out[95]:
a b c
0 0 1 2
1 3 4 5
Now with None:
In [96]:
pd.read_csv(io.StringIO(t), header=None)
Out[96]:
0 1 2
0 a b c
1 0 1 2
2 3 4 5
Note that in latest version 0.19.1, this will now raise a TypeError:
In [98]:
pd.read_csv(io.StringIO(t), header=False)
TypeError: Passing a bool to header is invalid. Use header=None for no
header or header=int or list-like of ints to specify the row(s) making
up the column names

I think you need parameter header=None to read_csv:
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b
2,1
1,1"""
df = pd.read_csv(StringIO(temp),header=None)
print (df)
0 1
0 a b
1 2 1
2 1 1

If you're using pd.ExcelFile to read all the excel file sheets then:
df = pd.ExcelFile("path_to_file.xlsx")
df.sheet_names # Provide the sheet names in the excel file
df = df.parse(2, header=None) # Parsing the 2nd sheet in the file with header = None
df
Output:
0 1
0 a b
1 1 1
2 0 1
3 5 2

You can set custom column name in order to prevent this:
Let say if you have two columns in your dataset then:
df = pd.read_csv(your_file_path, names = ['first column', 'second column'])
You can also generate programmatically column names if you have more than and can pass a list in front of names attribute.

Pandas indexing and accessing columns by names

I am trying to access pandas dataframe by column names after indexing the df with a specific column and it returns incorrect column values.
import pandas as pd
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['id', 'exp','fov','cycle', 'color', 'values'], index_col=2)
rs.cycle.head()
I am indexing the df here with 'fov' and I want to access the 'cycle' column, it gives me the color column instead. I think I am missing something here?
EDIT
The first few lines of the input file are:
6 3 1 G 0.96593
6 3 1 O 0.88007
6 3 1 R 0.94305
6 3 2 B 0.90554
6 3 2 G 0.93146

I think the problem arises because your data file has 5 columns and your names list has 6 elements. To verify, check the first few values in the id column- these will all be set to 6 if I am right. The First few items in the exp column will have the value 3.
To fix this, read your input file like so:
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['exp','fov','cycle', 'color', 'values'], index_col=2
Pandas will automatically insert row identifiers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Read csv function not alleging column headers well - python

Related

Python pandas drop issue on a date header

Formatting Excel to DataFrame

Set the first column of pandas dataframe as header

Prevent pandas read_csv treating first row as header of column names

Pandas indexing and accessing columns by names

Categories

Resources