I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)
I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)
In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.
Related
I have a simple code that read csv file. After that I change the names of the columns and print them. I found one weird issue that for some numeric columns its adding extra .0 Here is my code:
v_df = pd.read_csv('csvfile', delimiter=;)
v_df = v_df.rename(columns={Order No. : Order_Id})
for index, csv_row in v_df.iterrows():
print(csv_row.Order_Id)
Output is:
149545961155429.0
149632391661184.0
If I remove the empty row (2nd one in the above output) from the csv file, .0 does not appear in the ORDER_ID.
After doing some search, I found that converting this column to string will solve the problem. It does work if I change the first row of the above code to:
v_df = pd.read_csv('csvfile', delimiter=;, dtype={'Order No.' : 'str'})
However, the issue is that the column name 'Order No.' is changing to Order_Id as I am doing the rename so I can not use 'Order No.'. For this reason I tried the following:
v_df[['Order_Id']] = v_df[['Order_Id']].values.astype('str')
But unfortunately it seems that astype is not changing the datatype and .0 is still appearing. My questions are:
1- Why .0 is coming at the first place if there is an empty row in the csv file?
2- Why datatype change is not happening after rename?
My aim is to just get rid of .0, I don't want to change the datatype if .0 can go away using any other method.
I am trying to emulate your df here, although it has some differences I think it will work for you:
import pandas as pd
import numpy as np
v_df = pd.DataFrame([['13-Oct-22','149545961155429.0','149545961255429','Delivered'],
['12-Oct-22',None,None,'delivered'],
['15-Oct-22','149632391661184.0','149632391761184','Delivered']], columns=
['Transaction Date','Order_Id','Order Item No.','Order Item Status'])
v_df[['Order_Id']] = v_df[['Order_Id']].fillna(np.nan).values.astype('float').astype('int').astype('str')
Try it and let me know
I have a dataframe with multiple headers and column indexes, and would like to retrieve the list of entries that are non-zero. The dataframe is constructed from a .csv file provided by another party.
Its hard to include data as its sensitive, but I read in the data and remove NaNs to make it smaller and only include non-zero rows and columns.
df = pd.read_csv('Example.csv', header=[0,1,2,3], index_col=[0,1])
a = df.where(df==1).dropna(how='all').dropna(axis=1)
x = [(df[col][df[col].eq(1)].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq(1)].index))]
for i in range(len(x)):
print(x[i])
I am hoping for the output
((index col1, index col2), (header 3))
So I guess the hypothetical would be
If I listed every iteration of comic book characters under header I would have:
Brand: Marvel/DC/Etc
Hero: Spiderman/Captain America/...
Person: Parker/Riley/Morales
Then my column indexes would be Comic name, next column number of that comic.
Each entry would be 1 if the character is present, and nothing otherwise in the .csv read from Excel.
I would like the output to be
((Amazing Spiderman, 1),( Parker, Spiderman))
etc.
I hope that makes sense.
I resolved this by removing rows not being used in the query at that time. It is not an ideal solution but it will make your version operational, though it does mean it can be fiddly if you need both/N headers outputted.
I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.
A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...
If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)
I'm using Python, csvReader to read from a CSV. One of my columns have multiple variables separated by a "|", and would like to write into separate columns.
For example, the 3rd column contains rows of different transportation "cars|trains|airplanes|cabs", and would like to write to a new csv with four separate columns. Each cell can have anywhere from 1-5 variables. Thanks for your help!
If the column doesn't need to be actually parsed like a sub-CSV (e.g., there's no quoting or escaping in it), you can just use split to split it:
col2vals = row[2].split('|')
If you're looking to replace each row[2] with the multiple resulting values, use slice assignment:
row[2:3] = row[2].split('|')
Or, if you're looking to break it into separate rows instead, just loop over the results and append/yield/newcsv.writerow/whatever you were doing with each original row.
If the column might have quoting or escaping, you can use csv again, by passing it a StringIO file-like object:
col2 = row[2]
fakefile = io.StringIO(col2)
subreader = csv.reader(fakefile, delimiter='|')
col2vals = next(subreader)
You can of course put the whole thing together into a single line, but I'm not sure the result is all that readable; probably somewhere between the two extremes is best.
row[2:3] = next(csv.reader(io.StringIO(row[2]), delimiter='|'))
There is a weird .csv file, something like:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
pretty fine, but after these lines, there is always a blank line followed by lots of useless lines. The whole stuff is something line:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
dhjsakfjkldsa
fasdfggfhjhgsdfgds
gsdgffsdgfdgsdfgs
gsdfdgsg
The number of lines in the bottom is totally random, the only remark is the empty line before them.
Pandas has a parameter "skipfooter" for ignoring a known number of rows in the footer.
Any idea about how to ignore this rows without actually opening (open()...) the file and removing them?
There is not any option to terminate read_csv function by getting the first blank line. This module isn't capable of accepting/rejecting lines based on desired conditions. It only can ignore blank lines (optional) or rows which disobey the formed shape of data (rows with more separators).
You can normalize the data by the below approaches (without parsing file - pure pandas):
Knowing the number of the desired\trash data rows. [Manual]
pd.read_csv('file.csv', nrows=3) or pd.read_csv('file.csv', skipfooter=4)
Preserving the desired data by eliminating others in DataFrame. [Automatic]
df.dropna(axis=0, how='any', inplace=True)
The results will be:
header1 header2 header3
0 val11 val12 val13
1 val21 val22 val23
2 val31 val32 val33
The best way to do this using pandas native functions is a combination of arguments and function calls - a bit messy, but definitely possible!
First, call read_csv with the skip_blank_lines=False, since the default is True.
df = pd.read_csv(<filepath>, skip_blank_lines=False)
Then, create a dataframe that only contains the blank rows, using the isnull or isna method. This works by locating (.loc) the indices where all values are null/blank.
blank_df = df.loc[df.isnull().all(1)]
By utilizing the fact that this dataframe preserves the original indices, you can get the index of the first blank row.
Because this uses indexing, you will also want to check that there actually is a blank line in the csv. And finally, you simply slice the original dataframe in order to remove the unwanted lines.
if len(blank_df) > 0:
first_blank_index = blank_df.index[0]
df = df[:first_blank_index]
If you're using the csv module, it's fairly trivial to detect an empty row.
import csv
with open(filename, newline='') as f:
r = csv.reader(f)
for l in r:
if not l:
break
#Otherwise, process data