remove Unwanted lines in csv before loading into DF [duplicate]

remove Unwanted lines in csv before loading into DF [duplicate] - python

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?

You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6

I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows

I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df

All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)

Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d

skip[1] will skip second line, not the first one.

Related

Make Pandas figure out how many rows to skip in pd.read_excel

I'm trying to automate reading in hundreds of excel files into a single dataframe. Thankfully the layout of the excel files is fairly constant. They all have the same header (the casing of the header may vary) and then of course the same number of columns, and the data I want to read is always stored in the first spreadsheet.
However, in some files a number of rows have been skipped before the actual data begins. There may or may not be comments and such in the rows before the actual data. For instance, in some files the header is in row 3 and then the data starts in row 4 and down.
I would like pandas to figure out on its own, how many rows to skip. Currently I use a somewhat complicated solution...I first read the file into a dataframe, check if the header is correct, if no search to find the row containing the header, and then re-read the file now knowing how many rows to skip..
def find_header_row(df, my_header):
"""Find the row containing the header."""
for idx, row in df.iterrows():
row_header = [str(t).lower() for t in row]
if len(set(my_header) - set(row_header)) == 0:
return idx + 1
raise Exception("Cant find header row!")
my_header = ['col_1', 'col_2',..., 'col_n']
df = pd.read_excel('my_file.xlsx')
# Make columns lower case (case may vary)
df.columns = [t.lower() for t in df.columns]
# Check if the header of the dataframe mathces my_header
if len(set(my_header) - set(df.columns)) != 0:
# If no... use my function to find the row containing the header
n_rows_to_skip = find_header_row(df, kolonner)
# Re-read the dataframe, skipping the right number of rows
df = pd.read_excel(fil, skiprows=n_rows_to_skip)
Since I know what the header row looks like is there a way to let pandas figure out on its own where the data begins? Or is can anyone think of a better solution?

Let's know if this work for you
import pandas as pd
df = pd.read_excel("unamed1.xlsx")
df
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 NaN bad row1 badddd row111 NaN
1 baaaa NaN NaN
2 NaN NaN NaN
3 id name age
4 1 Roger 17
5 2 Rosa 23
6 3 Rob 31
7 4 Ives 15
first_row = (df.count(axis = 1) >= df.shape[1]).idxmax()
df.columns = df.loc[first_row]
df = df.loc[first_row+1:]
df
3 id name age
4 1 Roger 17
5 2 Rosa 23
6 3 Rob 31
7 4 Ives 15

Data cleaning: Remove 0 value from my dataset having a header and index_col

I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?

You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]

It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like

Pandas Read csv function not alleging column headers well

I have a csv file with no headers. The first column is ID and so on... Here is how I read that file in pandas.
rss_content=pd.read_csv("rss_content.csv",header=None,names=["id","feedId","url","imageUrl","title","desc","author","createTimestamp"])
However when the file gets imported I see the first two columns of the data become index and the Id column gets assigned to third column and so on. Basically the headers are shifted by two columns to right and first two columns have no header.
Why is that and how to fix it?

Assuming you have the following CSV file:
1,2,3,4,5
11,22,33,44,55
If you specify too less column names the rest columns will become index columns:
In [1]: fn = r'D:\temp\.data\41066716.csv'
In [2]: df = pd.read_csv(fn, header=None, names=['a','b','c'])
In [3]: df
Out[3]:
a b c
1 2 3 4 5
11 22 33 44 55

Prevent pandas read_csv treating first row as header of column names

I'm reading in a pandas DataFrame using pd.read_csv. I want to keep the first row as data, however it keeps getting converted to column names.
I tried header=False but this just deleted it entirely.
(Note on my input data: I have a string (st = '\n'.join(lst)) that I convert to a file-like object (io.StringIO(st)), then build the csv from that file object.)

You want header=None the False gets type promoted to int into 0 see the docs emphasis mine:
header : int or list of ints, default ‘infer’ Row number(s) to use as
the column names, and the start of the data. Default behavior is as if
set to 0 if no names passed, otherwise None. Explicitly pass header=0
to be able to replace existing names. The header can be a list of
integers that specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be skipped
(e.g. 2 in this example is skipped). Note that this parameter ignores
commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
You can see the difference in behaviour, first with header=0:
In [95]:
import io
import pandas as pd
t="""a,b,c
0,1,2
3,4,5"""
pd.read_csv(io.StringIO(t), header=0)
Out[95]:
a b c
0 0 1 2
1 3 4 5
Now with None:
In [96]:
pd.read_csv(io.StringIO(t), header=None)
Out[96]:
0 1 2
0 a b c
1 0 1 2
2 3 4 5
Note that in latest version 0.19.1, this will now raise a TypeError:
In [98]:
pd.read_csv(io.StringIO(t), header=False)
TypeError: Passing a bool to header is invalid. Use header=None for no
header or header=int or list-like of ints to specify the row(s) making
up the column names

I think you need parameter header=None to read_csv:
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b
2,1
1,1"""
df = pd.read_csv(StringIO(temp),header=None)
print (df)
0 1
0 a b
1 2 1
2 1 1

If you're using pd.ExcelFile to read all the excel file sheets then:
df = pd.ExcelFile("path_to_file.xlsx")
df.sheet_names # Provide the sheet names in the excel file
df = df.parse(2, header=None) # Parsing the 2nd sheet in the file with header = None
df
Output:
0 1
0 a b
1 1 1
2 0 1
3 5 2

You can set custom column name in order to prevent this:
Let say if you have two columns in your dataset then:
df = pd.read_csv(your_file_path, names = ['first column', 'second column'])
You can also generate programmatically column names if you have more than and can pass a list in front of names attribute.

Skip rows during csv import pandas

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?

You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6

I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows

I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df

All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)

Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d

skip[1] will skip second line, not the first one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

remove Unwanted lines in csv before loading into DF [duplicate] - python

You can try yourself: >>> import pandas as pd >>> from StringIO import StringIO >>> s = """1, 2 ... 3, 4 ... 5, 6""" >>> pd.read_csv(StringIO(s), skiprows=[1], header=None) 0 1 0 1 2 1 5 6 >>> pd.read_csv(StringIO(s), skiprows=1, header=None) 0 1 0 3 4 1 5 6

I don't have reputation to comment yet, but I want to add to alko answer for further reference. From the docs: skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows

skip[1] will skip second line, not the first one.

Related

Make Pandas figure out how many rows to skip in pd.read_excel

Data cleaning: Remove 0 value from my dataset having a header and index_col

Pandas Read csv function not alleging column headers well

Prevent pandas read_csv treating first row as header of column names

Skip rows during csv import pandas

Categories

Resources