Python pandas drop issue on a date header - python

I have an excel document that has three rows ahead of the main header(name of columns).
Excel document
When loading the data in pandas data frame using :
import pandas
df = pandas.read_excel('output/tracker.xlsx')
print(df)
I get this data(which is fine):
Date/Time:13/06/2022 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 NaN NaN NaN NaN
1 NaN 2763 2763 NaN
2 NaN Site ID Company Site ID Region
3 203990318_700670803 203990318 689179 Nord-Ost
I do not need the first three rows so I run :
df = df.iloc[2:]
It removes the rows that have ID of 0 and 1.
But it doesn't remove the Date/Time:13/06/2022 Unnamed: 1 etc row.
How do I remove that top row?

Rather directly load the data without the useless rows using the skiprows parameter of pandas.read_excel:
df = pandas.read_excel('output/tracker.xlsx', skiprows=3)

I get this data(which is fine):
pandas.read_excel by default assumes 1st row is header, i.e. it does hold names for columns, which looking into snippet of your data is not case, use header=None to inform pandas that there are not names of column, but rather data, that is
import pandas
df = pandas.read_excel('output/tracker.xlsx',header=None)
print(df)
then you should be able to remove these as you already did

Related

How to extract column title from a dataframe and add it to another dataframe?

My goal is to have my column titles in the small df added to an existing large dataframe without me manually typing the name in.
This is the small dataframe.
veddra_term_code veddra_version veddra_term_name number_of_animals_affected accuracy
335 11 Emesis NaN NaN
142 11 Anaemia NOS NaN NaN
The large dataframe is similar to the above but has forty columns.
This is the code I used to extract the small dataframe from dict.
df = pd.DataFrame(reaction for result in d['results'] for reaction in result['reaction']) #get reaction data
df
You can pass dataframe.reindex a list of columns, consisting of the existing columns and also new ones. If a column does not exist yet in the dataframe, it will get as value NaN.
Assume that df is your big dataframe you want to extend with columns. You can then create a new list of column names (columns_to_add) from your small dataframe and combine them. Then you call reindex on the big dataframe.
import pandas as pd
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
existing_columns = df.columns.tolist()
columns_to_add = ["C", "D"] # or use small_df.columns.tolist()
new_columns = existing_columns + columns_to_add
df = df.reindex(columns = new_columns)
This will produce:
A B C D
0 1 2 NaN NaN
1 2 3 NaN NaN
2 3 4 NaN NaN
If you do not like NaN you can use a different value by passing the keyword fill_value.
(e.g. df.reindex(columns = new_columns, fill_value=0).
df.columns will give you an array of the names of the columns
import numpy as np
#loop small dataframe headers
for i in small_df.columns:
# if large df doesnt have the header, create the header
if i not in large_df.columns:
#creates new header with no data
large_df.loc[:,i]=np.nan

Formatting Excel to DataFrame

excel sheet snapshot
Please take a look at my excel sheet snapshot attached on the top-left end. When I create a DataFrame from this sheet my first column and row are filled with NaN. I need to skip this blank row and column to select the second row and column for DataFrame creation.
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 NaN ID SCOPE TASK
1 NaN 34 XX something_1
2 NaN 534 SS something_2
3 NaN 43 FF something_3
4 NaN 32 ZZ something_4
I want my DataFrame to look like this
0 ID SCOPE TASK
1 34 XX something_1
2 534 SS something_2
3 43 FF something_3
4 32 ZZ something_4
I tried this code but didn't get what I expected
df = pd.read_excel("Book1.xlsx")
df.columns = df.iloc[0]
df.drop(df.index[1])
df.head()
NaN ID SCOPE TASK
0 NaN ID SCOPE TASK
1 NaN 34 XX something_1
2 NaN 534 SS something_2
3 NaN 43 FF something_3
4 NaN 32 ZZ something_4
Still I need to drop the first column and 0 the index row from here.
Can anyone help?
Specify the row number which will be the header (column names) of the dataframe using header parameter; in your case it is 1. Also, specify the column names using usecols parameter, in your case, they are 'ID', 'SCOPE', and 'TASK'.
df = pd.read_excel('your_excel_file.xlsx', header=1, usecols=['ID','SCOPE', 'TASK'])
Check out header and usecols from here.
if its an entire column you wish to delete, try this -
del df["name of the column"]
here's an eg.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2),columns=['a','b'])
# created a random dataframe 'df' with 'a' and 'b' as columns
del df['a'] # deleted column 'a' using 'del'
print(df) # no column 'a' in 'df' now
You can actually do it all while reading your excel file with pandas. You want to :
skip the first line : use the argument skiprows=0
use the columns from B to D : use the argument usecols="B:D"
use the row #2 as the header (I assumed here) : use the argument header=1 (0 indexed)
Answer :
df = pd.read_excel("Book1.xlsx", skiprows=0, usecols="B:D", header=1)
Edit : you don't even need to use skiprows when using header.
df = pd.read_excel("Book1.xlsx", usecols="B:D", header=1)

Make Pandas figure out how many rows to skip in pd.read_excel

I'm trying to automate reading in hundreds of excel files into a single dataframe. Thankfully the layout of the excel files is fairly constant. They all have the same header (the casing of the header may vary) and then of course the same number of columns, and the data I want to read is always stored in the first spreadsheet.
However, in some files a number of rows have been skipped before the actual data begins. There may or may not be comments and such in the rows before the actual data. For instance, in some files the header is in row 3 and then the data starts in row 4 and down.
I would like pandas to figure out on its own, how many rows to skip. Currently I use a somewhat complicated solution...I first read the file into a dataframe, check if the header is correct, if no search to find the row containing the header, and then re-read the file now knowing how many rows to skip..
def find_header_row(df, my_header):
"""Find the row containing the header."""
for idx, row in df.iterrows():
row_header = [str(t).lower() for t in row]
if len(set(my_header) - set(row_header)) == 0:
return idx + 1
raise Exception("Cant find header row!")
my_header = ['col_1', 'col_2',..., 'col_n']
df = pd.read_excel('my_file.xlsx')
# Make columns lower case (case may vary)
df.columns = [t.lower() for t in df.columns]
# Check if the header of the dataframe mathces my_header
if len(set(my_header) - set(df.columns)) != 0:
# If no... use my function to find the row containing the header
n_rows_to_skip = find_header_row(df, kolonner)
# Re-read the dataframe, skipping the right number of rows
df = pd.read_excel(fil, skiprows=n_rows_to_skip)
Since I know what the header row looks like is there a way to let pandas figure out on its own where the data begins? Or is can anyone think of a better solution?
Let's know if this work for you
import pandas as pd
df = pd.read_excel("unamed1.xlsx")
df
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 NaN bad row1 badddd row111 NaN
1 baaaa NaN NaN
2 NaN NaN NaN
3 id name age
4 1 Roger 17
5 2 Rosa 23
6 3 Rob 31
7 4 Ives 15
first_row = (df.count(axis = 1) >= df.shape[1]).idxmax()
df.columns = df.loc[first_row]
df = df.loc[first_row+1:]
df
3 id name age
4 1 Roger 17
5 2 Rosa 23
6 3 Rob 31
7 4 Ives 15

Set the first column of pandas dataframe as header

This is my output DataFrame from reading an excel file
I would like my first column to be index/header
one Entity
0 two v1
1 three Prod
2 four 2015-05-27 00:00:00
3 five 2018-04-27 00:00:00
4 six Both
5 seven id
6 eight hello
To Set the first column of pandas data frame as header
set "header=1" while reading file
eg: df = pd.read_csv(inputfilePath, header=1)
set skiprows=1 while reading the file
eg: df = df.read_csv(inputfilepath, skiprows=1)
set iloc[0] in dataframe
eg: df.columns = df.iloc[0]
I hope this will help.
One way is using T twice
df=df.T.set_index(0).T

Only getting relevant data from Pandas Dataframe

Brief background: I just started recently using Pandas to read in a csv file of data. I'm able to create a dataframe from reading the csv but now I want to do some calculations using only specific columns of the dataset.
Is there a way to create a new dataframe where I only use rows where the relevant columns are not NA or 0? For example imagine an array that looks like:
blah blah1 blah2 blah3
0 1 1 1 1
1 NA NA 1 NA
2 1 1 1 1
So say I want to do things with data under columns "blah1" and "blah2", but I want to only use rows 0 and 2 because 1 has an NA under the column "blah".
Is there a simple way of doing this? Thanks!
Edit (Clarifications):
- I don't know ahead of time that I want to drop row 1, thus I need to be able to check for a NA value (and possibly any other placeholder value beyond just whether it is null).
Yes, you can use dropna
df = df.dropna(axis = 1)
and to select columns use this:
df = df[["blah1", "blah2"]]
Now df contains only cols "blah1" and "blah2" and rows 0 and 2
EDIT 1
To limit NaN verification to some columns you can use isnull().
mask = df[["blah1", "blah2"]].isnull().all(axis=1)
df = df[~mask]
EDIT 2
mask = df.B == 'placeholder'
df = df[~mask]

Categories

Resources