Rename multiple cell in a csv large file using Python csv module - python

I've a large CSV file. There are a lot of cells value "NIL" mentioned in different columns. I need to set the value blank. I mean those cells will return blank space.
I repeat I don't have to rename header or any specific cell. I've to rename multiple cells matching "NIL" to this " "

If i am correct you want to handle missing data. It depends in which way you want to handle missing data. Python provides machenism to deal with such data cells.
You can either refer to this link for handling missing data.
The fillna function can “fill in” NA values with non-null data in various ways:
Replace NA with a scalar value:
fillna(0)
Padding the value
fillna(method='pad')
Dropping the column or row
dropna(axis=0),
dropna(axis=1),
dropna()
Interpolate the data
interpolate()
Replacing Values
replace()

Related

Splitting column of dataframe based on text characters in cells

I imported a .csv file with a single column of data into a dataframe that I am trying to clean up by splitting the column based on various string occurrences within the cells. I've tried numerous means to split the column, but can't seem to get it to work. My latest attempt was using the following:
df.loc[:,'DataCol'] = df.DataCol.str.split(pat=':\n',expand=True)
df
The result is a dataframe that is still one column and completely unchanged. What am I doing wrong? This is my first time doing anything like this so please forgive the simple question.
Df.loc creates a copy of the column you've selected - try replacing the code below with df['DataCol'], which references the actual column in the original dataframe.
df.loc[:,'DataCol']

How to deal with NaN values in pandas (from csv file)?

I have a fairly large csv file filled with data obtained from a machine for material testing (compression test). The headers of the data are Time, Force, Stroke and they are repeated 10 times because of the sample size, so the last set of headers is Time.10, Force.10, Stroke.10.
Because of the nature of the experiment not all columns are equally long (some are approx.. 2000 rows longer than others). When I import the file into my IDE (spyder or jupyter) using pandas, all the cells within the the rows that are empty in the csv file are labeled as NaN.
The problem is... I can't do any mathematical operations within or between columns that have NaN values as they are treated as str. I have tried the most recommended solutions on pretty much all forums; .fillna(), dropna(), replace() and interpolate(). The mentioned methods work but only visually, e.g. df.fillna(0) replaces all NaN values with 0, but when I try to e.g. find the max value in the column, I still get the error that says that there are strings in my column (TypeError: '>' not supported between instances of 'float' and 'str'). The problem is caused 100% by the NaN values that are the result of the empty cells in the csv file as I have imported a csv file in which all columns where the same length (with no empty cells) and there where no problems. If anyone has any solution to this problem (doesn't need to be within pandas, just within Python) that I'm stuck on for over 2 weeks, I would be grateful.
Try read_csv() with na_filter=False.
This should at least prevent from setting "empty" source cells to NaN.
But note that:
such "empty" cells can have an empty string as the content,
the type of each column containing at least one such cell is object
(not a number),
so that (for the time being) they can't take part in any numeric
operations.
So probably, after read_csv() you should:
replace such empty strings with e.g. 0 (or whatever numeric value),
call to_numeric(...) to change type of each column from object
to whatever numeric type is appropriate in each case.

Check the missing value in an excel table

I am working on my assignment of data visualization. Firstly, I have to check dataset I found, and do the data wrangling, if it is necessary. The data consists of several particles index for air quality in Madrid, those data were collected by different stations.
I found some values are missing in the table. How can I check those missing values quickly by tools (python or R or Tableau) and replace those value?
In Python, you can use the pandas module to load the Excel file as a DataFrame. Post this, it is easy to substitute the NaN/missing values.
Let's say your excel is named madrid_air.xlsx
import pandas as pd
df = pd.read_excel('madrid_air.xlsx')
Post this, you will have what they call a DataFrame which consists of the data in the excel file in the same tabular format with column names and index. In the DataFrame the missing values will be loaded as NaN values. So in order to get the rows which contains NaN values,
df_nan = df[df.isna()]
df_nan will have the rows which has NaN values in them.
Now if you want to fill all those NaN values with let's say 0.
df_zerofill = df.fillna(0)
df_zerofill will have the whole DataFrame with all the NaNs substituted with 0.
In order to specifically fill coulmns use the coumn names.
df[['NO','NO_2']] = df[['NO','NO_2']].fillna(0)
This will fill the NO and NO_2 columns' missing values with 0.
To read up more about DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
To read up more about handling missing data in DataFrames : https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
There are several libraries for python to process excel spreadsheets. My favorite one is openpyxl. It transforms the spreadsheets into a dataframe in which you then can address a specific field by it coordinates. Which comes in quite handy is that it also recognizes labels of rows and columns. Of course you can also update your tables
with it. But be careful, if you are using corrupted code your xlsx-files might get permantly damaged
Edit1:
import openpyxl
wb = openpyxl.load_workbook('filename.xlsx')
# if your worksheet is the first one in the workbook
ws = wb.get_sheet_names(wb.get_sheet_by_name()[0])
for row in ws.iter_rows('G{}:I{}'.format(ws.min_row,ws.max_row)):
for cell in row:
if cell.value is None:
cell.value = 0
Well, in Tableau you can creat a worksheet, drag n Drop the lowest level of granurality in the dimension table (Blue pill) in and put the columns (as measures) in the same chart.
If your table is trully atomic, then you will get a response in your worksheet at the bottom right telling you about the null values. Clicking on it allows you to clear or replace these specifics values in the data of the workbook.
Just to clearify, Its not the "hi end" and the Coding way, but is the simplest one.
PS: You can also check for missing values in the data input window of the Tableau by filtering the columns by "null" values.
PS2: If you want to Chang it dynamic, the you Will need to use formulas like:
IF ISNULL(Measure1)
THEN (Measure2) ˜ OR Another Formula
ELSE null
END

Import table to DataFrame and set group of column as list

I have a table (Tab delimited .txt file) in the following form:
each row is an entry;
first row are headers
the first 5 columns are simple numeric parameters
all column after the 7th column are supposed to be a list of values
My problem is how can I import and create a data frame where the last column contain a list of values?
-----Problem 1 ----
The header (first row) is "shorter", containing simply the name of some columns. All the values after the 7th do not have a header (because it is suppose to be a list). If I import the file as is, this appear to confuse the import functions
If, for example, I import as follow
df = pd.read_table( path , sep="\t")
the DataFrame created has only as many columns as the elements in the first row. Moreover, the data value assigned are mismatched.
---- Problem 2 -----
What is really confusing to me is that if I open the .txt in Excel and save it as Tab-delimited (without changing anything), I can then import it without problems, with headers too: columns with no header are simply given an “Unnamed XYZ” tag.
Why would saving in Excel change it? Using Note++ I can see only one difference: the original .txt is in "Unix (LF)" form, while the one saved in Excel is "Windows (CR LF)". Both are UTF-8, so I do not understand how this would be an issue?!?
Nevertheless, from here I could manipulate the data and try to gather all columns I wish and make them into a list. However, I hope that there is a more elegant and faster way to do it.
Here is a screen-shot of the .txt file
Thank you,

Pandas, reading excel column values, but stop when no more values are present in that column

I want to import some values from an excel sheet with Pandas.
When I read values with Pandas, I would like to read column by column, but stop reading values when the rows of each column are empty.
Since in my excel file different columns have different number of rows, what I am getting now are arrays with some numbers, but then filled up with "nan" values until they reach the maximum number (i.e., the number rows of the excel column having the greatest number of rows)
I hope the explanation was not too confusing.
The code snippet is not a great example, it is not reproducible, but hopefully will help understanding what I am trying to do.
In the second part of the snippet (below #Removing nan) I was trying to remove the "nan" after having already imported them, but that was not working either, I was getting this error:
ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The same happened with np.isfinite
df = pandas.read_excel(file_name)
for i in range(number_of_average_points):
#Reading column values (includes nan)
force_excel_col = df[df.columns[index_force]].values[13:]
acceleration1_excel_col = df[df.columns[index_acceleration1]].values[13:]
acceleration2_excel_col = df[df.columns[index_acceleration2]].values[13:]
#Trying to remove nan
force.append(force_excel_col[np.logical_not(np.isnan(force_excel_col))])
acceleration1.append(acceleration1_excel_col[np.isfinite(acceleration1_excel_col)])
acceleration2.append(acceleration1_excel_col[np.isfinite(acceleration2_excel_col)])
This might be doable, but it is not efficient and bad practice. Having NaN data in a dataframe is a regular part of any data analysis in Pandas (and in general).
I'd encourage you rather to read in the entire excel file. Then, to get rid of all NaNs, you can either replace them (with 0s, for example), using Pandas' builtin dropna() method, or even drop all rows from your dataframe that contain NaN values. Happy to expand on this if you are interested.

Categories

Resources