Pandas: read_csv ignore rows after a blank line

Pandas: read_csv ignore rows after a blank line - python

There is a weird .csv file, something like:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
pretty fine, but after these lines, there is always a blank line followed by lots of useless lines. The whole stuff is something line:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
dhjsakfjkldsa
fasdfggfhjhgsdfgds
gsdgffsdgfdgsdfgs
gsdfdgsg
The number of lines in the bottom is totally random, the only remark is the empty line before them.
Pandas has a parameter "skipfooter" for ignoring a known number of rows in the footer.
Any idea about how to ignore this rows without actually opening (open()...) the file and removing them?

There is not any option to terminate read_csv function by getting the first blank line. This module isn't capable of accepting/rejecting lines based on desired conditions. It only can ignore blank lines (optional) or rows which disobey the formed shape of data (rows with more separators).
You can normalize the data by the below approaches (without parsing file - pure pandas):
Knowing the number of the desired\trash data rows. [Manual]
pd.read_csv('file.csv', nrows=3) or pd.read_csv('file.csv', skipfooter=4)
Preserving the desired data by eliminating others in DataFrame. [Automatic]
df.dropna(axis=0, how='any', inplace=True)
The results will be:
header1 header2 header3
0 val11 val12 val13
1 val21 val22 val23
2 val31 val32 val33

The best way to do this using pandas native functions is a combination of arguments and function calls - a bit messy, but definitely possible!
First, call read_csv with the skip_blank_lines=False, since the default is True.
df = pd.read_csv(<filepath>, skip_blank_lines=False)
Then, create a dataframe that only contains the blank rows, using the isnull or isna method. This works by locating (.loc) the indices where all values are null/blank.
blank_df = df.loc[df.isnull().all(1)]
By utilizing the fact that this dataframe preserves the original indices, you can get the index of the first blank row.
Because this uses indexing, you will also want to check that there actually is a blank line in the csv. And finally, you simply slice the original dataframe in order to remove the unwanted lines.
if len(blank_df) > 0:
first_blank_index = blank_df.index[0]
df = df[:first_blank_index]

If you're using the csv module, it's fairly trivial to detect an empty row.
import csv
with open(filename, newline='') as f:
r = csv.reader(f)
for l in r:
if not l:
break
#Otherwise, process data

Related

How to efficiently remove junk above headers in an .xls file

I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.

A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...

If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)

Remove rows containing blank space in python data frame

I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?

Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)

I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)

In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.

reading a multi-indexed CSV in pandas with multiple delimiters

I'm trying to create a very human-readable script that will be multi-indexed. It looks like this:
A
one : some data
two : some other data
B
one : foo
three : bar
I'd like to use pandas' read_csv to automatically read this in as a multi-indexed file with both \t and : used as delimiters so that I can easily slice by section (i.e., A and B). I understand something like that header=[0,1] and perhaps tupleize_cols may be used to this end, but I can't get that far since it doesn't seem to want to read both the tabs and colons properly. If I use sep='[\t:]', it consumes the leading tabs. If I don't use the regexp and read with sep='\t', it gets the tabs right, but doesn't handle the colons. Is this possible using read_csv? I could do it line by line, but there must be an easier way :)
This is the output I had in mind. I added labels to the indices and column, which could hopefully be applied when reading it in:
value
index_1 index_2
A one some data
two some other data
B one foo
three bar
EDIT: I used part of Ben.T's answer to get what I needed. I'm not in love with my solution since I'm writing to a temp file, but it does work:
with open('temp.csv','w') as outfile:
for line in open(reader.filename,'r'):
if line[0] != '\t' or not line.strip():
index1 = line.split('\n')[0]
else:
outfile.write(index1+':'+re.sub('[\t]+','',line))
pd.read_csv('temp.csv', sep=':', header=None, \
names = ['index_1', 'index_2', 'Value'] ).set_index(['index_1', 'index_2'])

You can use two delimiters in read_csv such as:
pd.read_csv( path_file, sep=':|\t', engine='python')
Note the engine='python' to prevent a warning.
EDIT: with your input format it seems difficult, but with input like:
A one : some data
A two : some other data
B one : foo
B three : bar
with a \t as delimiter after A or B, then you get a multiindex by:
pd.read_csv(path_file, sep=':|\t', header = None, engine='python', names = ['index_1', 'index_2', 'Value'] ).set_index(['index_1', 'index_2'])

How to remove rows from a csv file for which values of certain attributes are the same

I have a csv file. In that file if the values of first, 5th and 13th attributes are the same, then the rows will be considered as duplicates. in that case duplicate rows are to be removed. How to do that in python?
I wrote a code but it seems the code falls in infinite loop:
import csv
rows = csv.reader(open("items4.csv", "r"))
newrows = []
i=0
for row in rows:
if(i==0):
newrows.append(row)
i=i+1
continue
for row1 in newrows:
if(row[1]!=row1[1] and row[5]!=row1[5] and row[13]!=row1[13]):
newrows.append(row)
writer = csv.writer(open("items5.csv", "w"))
writer.writerows(newrows)

I would change your logic ever so slightly to introduce a flag, like this:
for row1 in newrows:
if row[1]==row1[1] and row[5]==row1[5] and row[13]==row1[13]:
break
else:
newrows.append(row)
The problem with your initial code was that you kept adding the row into newrows if it did not match any of the rows inside, this effectively extended newrows indefinitely since you keep adding values that satisfied: row[1]!=row1[1] and row[5]!=row1[5] and row[13]!=row1[13]

#Clarence already gave a great answer.
Just as an alternative, pandas makes these things much easier when things get more complicated.
Let's say you have the columns that you want to consider in a list, called col_list
import pandas as pd
# --- About read_csv ---
# header and delimiter are two arguments to consider for read_csv
df = pd.read_csv('path/to/your/file.csv')
# --- About drop_duplicates ---
# inplace being True changes the df itself rather than creating a new DataFrame
# subset takes the labels of columns to consider, you call them with df.columns so df.columns[col_list] will give you your desired column labels
df.drop_duplicates(subset=df.columns[col_list], inplace=True)
# --- Important Reminder!!! ---
# Don't forget that Python indices start with 0 not 1, therefore first columns should be denoted as 0 in your col_list
# --- Write your file back ---
df.to_csv('path/to/your/new_file.csv')

pandas.read_csv not partitioning data at semicolon delimiter

I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.

Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe

The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: read_csv ignore rows after a blank line - python

If you're using the csv module, it's fairly trivial to detect an empty row. import csv with open(filename, newline='') as f: r = csv.reader(f) for l in r: if not l: break #Otherwise, process data

Related

How to efficiently remove junk above headers in an .xls file

Remove rows containing blank space in python data frame

reading a multi-indexed CSV in pandas with multiple delimiters

How to remove rows from a csv file for which values of certain attributes are the same

pandas.read_csv not partitioning data at semicolon delimiter

Categories

Resources