Pandas read excel file and fill missing values - python

I have imported this excel file into Pandas as follows:
xlsnist = pd.ExcelFile(path+'framework-for-improving-critical-infrastructure-cybersecurity-core.xlsx')
df3 = pd.read_excel(xlsnist, "CSF Core")
The screenshot below shows that this file has merged cells. I want to fill the empty rows with the relevant values for Function, Category, Subcategory. For example the NaN cells of Function should have "IDENTIFY (ID)" until it changes to "PROTECT (PR)" at row 82. I want to do this for all columns so there are no "NaN" values but I'm not sure how to do this.

You can try:
import pandas as pd
file = 'framework-for-improving-critical-infrastructure-cybersecurity-core.xlsx'
df = pd.read_excel(file)
df.ffill()
Result:

Related

Reset labels in Pandas DataFrame, Python

I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.

Read Excel file with blank cells as Pandas dataframe with multiindex

Suppose there is a Excel file:
Is there a way to read it directly as a Pandas dataframe with multiindex, without filling blank spaces in the first column?
Data:
Code:
df = pd.read_excel('test.xlsx')
.ffill():
df.i0.ffill(inplace=True)
set_index():
df.set_index(['i0', 'i1'], inplace=True)

Pandas dataframe read_excel does not consider blank upper left cells as columns?

I'm trying to read an Excel or CSV file into pandas dataframe. The file will read the first two columns only, and the top row of the first two columns will be the column names. The problem is when I have the first column of the top row empty in the Excel file.
IDs
2/26/2010 2
3/31/2010 4
4/31/2010 2
5/31/2010 2
Then, the last line of the following code fails:
uploaded_file = request.FILES['file-name']
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1])
else:
df = pd.read_excel(uploaded_file, usecols=[0,1])
ref_date = 'ref_date'
regime_tag = 'regime_tag'
df.columns = [ref_date, regime_tag]
Apparently, it only reads one column (i.e. the IDs). However, with read_csv, it reads both column, with the first column being unnamed. I want it to behave that way and read both columns regardless of whether the top cells are empty or filled. How do I go about doing that?
What's happening is the first "column" in the Excel file is being read in as an index, while in the CSV file it's being treated as a column / series.
I recommend you work the other way and amend pd.read_csv to read the first column as an index. Then use reset_index to elevate the index to a series:
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1], index_col=0)
else:
df = pd.read_excel(uploaded_file, header=[0,1], usecols=[0,1])
df = df.reset_index() # this will elevate index to a column called 'index'
This will give consistent output, i.e. first series will have label 'index' and the index of the dataframe will be the regular pd.RangeIndex.
You could potentially use a dispatcher to get rid of the unwieldy if / else construct:
file_flag = {True: pd.read_csv, False: pd.read_excel}
read_func = file_flag[uploaded_file.name.endswith('.csv')]
df = read_func(uploaded_file, usecols=[0,1], index_col=0).reset_index()

To Re arrange the columns of dataframe from csv and add format to empty cells

I need to read a csv file in python and then re arrange the columns of csv and make a new dataframe made of the rearranged columns
I tried using list, but it might work slow..
Any alternative using numpy or pandas?
Edit:
I am rearranging the row using df.reindex()
I am currently doing this and thus exporting the df after leaving 4 rows blank
df_reindexed.to_excel(writer, sheet_name='Sheet1',startrow=4, index=False)
I need to add format and text to cells in those top 4 rows, corresponding to the column name in the following rows.
I know I can use iloc, but is there anyway to do it so that i can select a cell above a cell with specified name?
import pandas as pd
# read a CSV with pandas
src = "your/path"
old_df = pd.read_csv(src, sep=",")
# the columns that you want
desired_cols = ['c1','c2']
# pandas will return a new df only with the columns that you want
new_df = old_df[desired_cols]
Another way to do it is:
desired_cols = ['c1', 'c2', 'c3']
df_final = df_final.reindex(columns = desired_cols)

Not getting back the column names after reading into an xlsx file

Hello I have xlsx files and merged them into one dataframe by using pandas. It worked but instead of getting back the column names that I had in the xlsx file I got numbers as columns instead and the column titles became a row: Like this:
Output: 1 2 3
COLTITLE1 COLTITLE2 COLTITLE3
When they should be like this:
Output: COLTITLE1 COLTITLE2 COLTITLE3
The column titles are not column titles but rather they have become a row. How can I get back the rightful column names that I had within the xlsx file. Just for clarity all the column names are the same within both the xlsx files. Help would be appreciated heres my code below:
# import modules
from IPython.display import display
import pandas as pd
import numpy as np
pd.set_option("display.max_rows", 999)
pd.set_option('max_colwidth',100)
%matplotlib inline
# filenames
file_names = ["data/OrderReport.xlsx", "data/OrderReport2.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in file_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# concatenate them
atlantic_data = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I hope I understood your question correctly. You just need to get rid of the index_col=None and it will return the column name as usual:
frames = [x.parse(x.sheet_names[0], header=None) for x in excels]
If you add index_col=None pandas will treat your column name as 1 row of data rather than a column for the dataframe.

Categories

Resources