I am working on my assignment of data visualization. Firstly, I have to check dataset I found, and do the data wrangling, if it is necessary. The data consists of several particles index for air quality in Madrid, those data were collected by different stations.
I found some values are missing in the table. How can I check those missing values quickly by tools (python or R or Tableau) and replace those value?
In Python, you can use the pandas module to load the Excel file as a DataFrame. Post this, it is easy to substitute the NaN/missing values.
Let's say your excel is named madrid_air.xlsx
import pandas as pd
df = pd.read_excel('madrid_air.xlsx')
Post this, you will have what they call a DataFrame which consists of the data in the excel file in the same tabular format with column names and index. In the DataFrame the missing values will be loaded as NaN values. So in order to get the rows which contains NaN values,
df_nan = df[df.isna()]
df_nan will have the rows which has NaN values in them.
Now if you want to fill all those NaN values with let's say 0.
df_zerofill = df.fillna(0)
df_zerofill will have the whole DataFrame with all the NaNs substituted with 0.
In order to specifically fill coulmns use the coumn names.
df[['NO','NO_2']] = df[['NO','NO_2']].fillna(0)
This will fill the NO and NO_2 columns' missing values with 0.
To read up more about DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
To read up more about handling missing data in DataFrames : https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
There are several libraries for python to process excel spreadsheets. My favorite one is openpyxl. It transforms the spreadsheets into a dataframe in which you then can address a specific field by it coordinates. Which comes in quite handy is that it also recognizes labels of rows and columns. Of course you can also update your tables
with it. But be careful, if you are using corrupted code your xlsx-files might get permantly damaged
Edit1:
import openpyxl
wb = openpyxl.load_workbook('filename.xlsx')
# if your worksheet is the first one in the workbook
ws = wb.get_sheet_names(wb.get_sheet_by_name()[0])
for row in ws.iter_rows('G{}:I{}'.format(ws.min_row,ws.max_row)):
for cell in row:
if cell.value is None:
cell.value = 0
Well, in Tableau you can creat a worksheet, drag n Drop the lowest level of granurality in the dimension table (Blue pill) in and put the columns (as measures) in the same chart.
If your table is trully atomic, then you will get a response in your worksheet at the bottom right telling you about the null values. Clicking on it allows you to clear or replace these specifics values in the data of the workbook.
Just to clearify, Its not the "hi end" and the Coding way, but is the simplest one.
PS: You can also check for missing values in the data input window of the Tableau by filtering the columns by "null" values.
PS2: If you want to Chang it dynamic, the you Will need to use formulas like:
IF ISNULL(Measure1)
THEN (Measure2) ˜ OR Another Formula
ELSE null
END
Related
I have a few small data frames that I'm outputting to excel on one sheet. To make then fit better, I need to merge some cells in one table, but to write this in xlsx writer, I need to specify the data parameter. I want to keep the data that is already written in the left cell from using the to_excel() bit of code. Is there a way to do this without having to specify the data parameter? Or do I need to lookup the value in the dataframe to put in there.
For example:
df.to_excel(writer, 'sheet') gives similar to the following output:
Then I want to merge across C:D for this table without having to specify what data should be there (because it is already in column C), using something like:
worksheet.merge_range('C1:D1', cell_format = fmat) etc.
to get below:
Is this possible? Or will I need to lookup the values in the dataframe?
Is this possible? Or will I need to lookup the values in the dataframe?
You will need to lookup the data from the dataframe. There is no way in XlsxWriter to write formatting on top of existing data. The data and formatting need to be written at the same time (apart from Conditional Formatting which can't be used for merging anyway).
I am trying to extract a table from PDF document with python package pdfplumber. The table has four columns and multiple rows. The first row are headers and the second row has only one merged cell, then the values are saved normally (example)
pdfplumber was able to retrive the table, but it made 6 columns out if four and saved values not according to the columns.
Table as shown in PDF document
I tried to use various table settings, including "vertical strategy": "lines", but this yields me the same result.
# Python 2.7.16
import pandas as pd
import pdfplumber
path = 'file_path'
pdf = pdfplumber.open(path)
first_page = pdf.pages[7]
df5 = pd.DataFrame(first_page.extract_table())
getting six columns instead of four with values in wrong columns.
Output example:
Table as output in jupyter notebooks
I would be happy to hear, if anybody has any suggestion, solution.
Did you got the answer as i want ot replace the \n coming in the text of column?
This is not exactly what you're looking for but you could load the op into a dataframe and iterate over it using the non-null values in the first row as column names for another dataframe. After that it is easy, you can just collate all the data between 2 column name columns in the output dataframe and insert it into the new dataframe after merging those cells.
I have a table (Tab delimited .txt file) in the following form:
each row is an entry;
first row are headers
the first 5 columns are simple numeric parameters
all column after the 7th column are supposed to be a list of values
My problem is how can I import and create a data frame where the last column contain a list of values?
-----Problem 1 ----
The header (first row) is "shorter", containing simply the name of some columns. All the values after the 7th do not have a header (because it is suppose to be a list). If I import the file as is, this appear to confuse the import functions
If, for example, I import as follow
df = pd.read_table( path , sep="\t")
the DataFrame created has only as many columns as the elements in the first row. Moreover, the data value assigned are mismatched.
---- Problem 2 -----
What is really confusing to me is that if I open the .txt in Excel and save it as Tab-delimited (without changing anything), I can then import it without problems, with headers too: columns with no header are simply given an “Unnamed XYZ” tag.
Why would saving in Excel change it? Using Note++ I can see only one difference: the original .txt is in "Unix (LF)" form, while the one saved in Excel is "Windows (CR LF)". Both are UTF-8, so I do not understand how this would be an issue?!?
Nevertheless, from here I could manipulate the data and try to gather all columns I wish and make them into a list. However, I hope that there is a more elegant and faster way to do it.
Here is a screen-shot of the .txt file
Thank you,
I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!
I've a large CSV file. There are a lot of cells value "NIL" mentioned in different columns. I need to set the value blank. I mean those cells will return blank space.
I repeat I don't have to rename header or any specific cell. I've to rename multiple cells matching "NIL" to this " "
If i am correct you want to handle missing data. It depends in which way you want to handle missing data. Python provides machenism to deal with such data cells.
You can either refer to this link for handling missing data.
The fillna function can “fill in” NA values with non-null data in various ways:
Replace NA with a scalar value:
fillna(0)
Padding the value
fillna(method='pad')
Dropping the column or row
dropna(axis=0),
dropna(axis=1),
dropna()
Interpolate the data
interpolate()
Replacing Values
replace()