Formatting Excel to DataFrame - python

excel sheet snapshot
Please take a look at my excel sheet snapshot attached on the top-left end. When I create a DataFrame from this sheet my first column and row are filled with NaN. I need to skip this blank row and column to select the second row and column for DataFrame creation.
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 NaN ID SCOPE TASK
1 NaN 34 XX something_1
2 NaN 534 SS something_2
3 NaN 43 FF something_3
4 NaN 32 ZZ something_4
I want my DataFrame to look like this
0 ID SCOPE TASK
1 34 XX something_1
2 534 SS something_2
3 43 FF something_3
4 32 ZZ something_4
I tried this code but didn't get what I expected
df = pd.read_excel("Book1.xlsx")
df.columns = df.iloc[0]
df.drop(df.index[1])
df.head()
NaN ID SCOPE TASK
0 NaN ID SCOPE TASK
1 NaN 34 XX something_1
2 NaN 534 SS something_2
3 NaN 43 FF something_3
4 NaN 32 ZZ something_4
Still I need to drop the first column and 0 the index row from here.
Can anyone help?

Specify the row number which will be the header (column names) of the dataframe using header parameter; in your case it is 1. Also, specify the column names using usecols parameter, in your case, they are 'ID', 'SCOPE', and 'TASK'.
df = pd.read_excel('your_excel_file.xlsx', header=1, usecols=['ID','SCOPE', 'TASK'])
Check out header and usecols from here.

if its an entire column you wish to delete, try this -
del df["name of the column"]
here's an eg.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2),columns=['a','b'])
# created a random dataframe 'df' with 'a' and 'b' as columns
del df['a'] # deleted column 'a' using 'del'
print(df) # no column 'a' in 'df' now

You can actually do it all while reading your excel file with pandas. You want to :
skip the first line : use the argument skiprows=0
use the columns from B to D : use the argument usecols="B:D"
use the row #2 as the header (I assumed here) : use the argument header=1 (0 indexed)
Answer :
df = pd.read_excel("Book1.xlsx", skiprows=0, usecols="B:D", header=1)
Edit : you don't even need to use skiprows when using header.
df = pd.read_excel("Book1.xlsx", usecols="B:D", header=1)

Related

Python pandas drop issue on a date header

I have an excel document that has three rows ahead of the main header(name of columns).
Excel document
When loading the data in pandas data frame using :
import pandas
df = pandas.read_excel('output/tracker.xlsx')
print(df)
I get this data(which is fine):
Date/Time:13/06/2022 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 NaN NaN NaN NaN
1 NaN 2763 2763 NaN
2 NaN Site ID Company Site ID Region
3 203990318_700670803 203990318 689179 Nord-Ost
I do not need the first three rows so I run :
df = df.iloc[2:]
It removes the rows that have ID of 0 and 1.
But it doesn't remove the Date/Time:13/06/2022 Unnamed: 1 etc row.
How do I remove that top row?
Rather directly load the data without the useless rows using the skiprows parameter of pandas.read_excel:
df = pandas.read_excel('output/tracker.xlsx', skiprows=3)
I get this data(which is fine):
pandas.read_excel by default assumes 1st row is header, i.e. it does hold names for columns, which looking into snippet of your data is not case, use header=None to inform pandas that there are not names of column, but rather data, that is
import pandas
df = pandas.read_excel('output/tracker.xlsx',header=None)
print(df)
then you should be able to remove these as you already did

Make Pandas figure out how many rows to skip in pd.read_excel

I'm trying to automate reading in hundreds of excel files into a single dataframe. Thankfully the layout of the excel files is fairly constant. They all have the same header (the casing of the header may vary) and then of course the same number of columns, and the data I want to read is always stored in the first spreadsheet.
However, in some files a number of rows have been skipped before the actual data begins. There may or may not be comments and such in the rows before the actual data. For instance, in some files the header is in row 3 and then the data starts in row 4 and down.
I would like pandas to figure out on its own, how many rows to skip. Currently I use a somewhat complicated solution...I first read the file into a dataframe, check if the header is correct, if no search to find the row containing the header, and then re-read the file now knowing how many rows to skip..
def find_header_row(df, my_header):
"""Find the row containing the header."""
for idx, row in df.iterrows():
row_header = [str(t).lower() for t in row]
if len(set(my_header) - set(row_header)) == 0:
return idx + 1
raise Exception("Cant find header row!")
my_header = ['col_1', 'col_2',..., 'col_n']
df = pd.read_excel('my_file.xlsx')
# Make columns lower case (case may vary)
df.columns = [t.lower() for t in df.columns]
# Check if the header of the dataframe mathces my_header
if len(set(my_header) - set(df.columns)) != 0:
# If no... use my function to find the row containing the header
n_rows_to_skip = find_header_row(df, kolonner)
# Re-read the dataframe, skipping the right number of rows
df = pd.read_excel(fil, skiprows=n_rows_to_skip)
Since I know what the header row looks like is there a way to let pandas figure out on its own where the data begins? Or is can anyone think of a better solution?
Let's know if this work for you
import pandas as pd
df = pd.read_excel("unamed1.xlsx")
df
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 NaN bad row1 badddd row111 NaN
1 baaaa NaN NaN
2 NaN NaN NaN
3 id name age
4 1 Roger 17
5 2 Rosa 23
6 3 Rob 31
7 4 Ives 15
first_row = (df.count(axis = 1) >= df.shape[1]).idxmax()
df.columns = df.loc[first_row]
df = df.loc[first_row+1:]
df
3 id name age
4 1 Roger 17
5 2 Rosa 23
6 3 Rob 31
7 4 Ives 15

NaNs after merging two dataframes

I have two dataframes like the following:
df1
id name
-------------------------
0 43 c
1 23 t
2 38 j
3 9 s
df2
user id
--------------------------------------------------
0 222087 27,26
1 1343649 6,47,17
2 404134 18,12,23,22,27,43,38,20,35,1
3 1110200 9,23,2,20,26,47,37
I want to split all the ids in df2 into multiple rows and join the resultant dataframe to df1 on "id".
I do the following:
b = pd.DataFrame(df2['id'].str.split(',').tolist(), index=df2.user_id).stack()
b = b.reset_index()[[0, 'user_id']] # var1 variable is currently labeled 0
b.columns = ['Item_id', 'user_id']
When I try to merge, I get NaNs in the resultant dataframe.
pd.merge(b, df1, on = "id", how="left")
id user name
-------------------------------------
0 27 222087 NaN
1 26 222087 NaN
2 6 1343649 NaN
3 47 1343649 NaN
4 17 1343649 NaN
So, I tried doing the following:
b['name']=np.nan
for i in range(0, len(df1)):
b['name'][(b['id'] == df1['id'][i])] = df1['name'][i]
It still gives the same result as above. I am confused as to what could cause this because I am sure both of them should work!
Any help would be much appreciated!
I read similar posts on SO but none seemed to have a concrete answer. I am also not sure if this is not at all related to coding or not.
Thanks in advance!
Problem is you need convert column id in df2 to int, because output of string functions is always string, also if works with numeric.
df2.id = df2.id.astype(int)
Another solution is convert df1.id to string:
df1.id = df1.id.astype(str)
And get NaNs because no match - str values doesnt match with int values.

openpyxl Python Iterating Through Large Data List

I have a large excel workbook with 1 sheet with roughly 45,000 rows and 45 columns. I want to iterate through the columns looking for duplicates and unique items and its taking a very long time to go through individual columns. Is there anyway to optimize my code or make this go faster? I either want to print the information or save to txt file. I'm on windows 10 and python 2.7 using openpyxl module:
from openpyxl import load_workbook, worksheet, Workbook
import os
#read work book to get data
wb = load_workbook(filename = 'file.xlsx', use_iterators = True)
ws = wb.get_sheet_by_name(name = 'file')
wb = load_workbook(filename='file.xlsx', read_only=True)
count = 0
seen = set()
uniq = []
for cell in ws.columns[0]:
if cell not in seen:
uniq.append(cell)
seen.add(cell)
print("Unique: "+uniq)
print("Doubles: "+seen)
EDIT: Lets say I have 5 columns A,B,C,D,E and 10 entries, so 10 rows, 5x10. In column A I want to extract all the duplicates and separate them from the unique values.
As VedangMehta mentioned, Pandas will do it very quickly for you.
Run this code:
import pandas as pd
#read in the dataset:
df = pd.read_excel('file.xlsx', sheetname = 'file')
df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())
#save duplicated values from first column
df[df_dup].iloc[:,0].to_csv("file_duplicates_col1.csv")
#save unique values from first column
df[~df_dup].iloc[:,0].to_csv("file_unique_col1.csv")
#save duplicated values from all columns:
df[df_dup].to_csv("file_duplicates.csv")
#save unique values from all columns:
df[df_dup].to_csv("file_unique.csv")
For details, see below:
Suppose your dataset looks as follows:
df = pd.DataFrame({'a':[1,3,1,13], 'b':[13,3,5,3]})
df.head()
Out[24]:
a b
0 1 13
1 3 3
2 1 5
3 13 3
You can find which values are duplicated in each column:
df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())
the result:
df_dup
Out[26]:
a b
0 False False
1 False False
2 True False
3 False True
you can find the duplicated values by subsetting the df using the boolean dataframe df_dup
df[df_dup]
Out[27]:
a b
0 NaN NaN
1 NaN NaN
2 1.0 NaN
3 NaN 3.0
Again, you can save that using:
#save the above using:
df[df_dup].to_csv("duplicated_values.csv")
to see the duplicated values in the first column, use:
df[df_dup].iloc[:,0]
to get
Out[11]:
0 NaN
1 NaN
2 1.0
3 NaN
Name: a, dtype: float64
For unique calues, use ~ which is Python's not sign. So you're essentially subsetting df by values that are Not duplicates
df[~df_dup]
Out[29]:
a b
0 1.0 13.0
1 3.0 3.0
2 NaN 5.0
3 13.0 NaN
When working with read-only mode don't use the columns property to read a worksheet. This is because data is stored in rows so columns require the parser to continually re-read the file.
This is an example of using openpyxl to convert worksheets into Pandas dataframes. It requires openpyxl 2.4 or higher, which at the time of writing can must be checked out.

Removing header column from pandas dataframe

I have the foll. dataframe:
df
A B
0 23 12
1 21 44
2 98 21
How do I remove the column names A and B from this dataframe? One way might be to write it into a csv file and then read it in specifying header=None. is there a way to do that without writing out to csv and re-reading?
I think you cant remove column names, only reset them by range with shape:
print df.shape[1]
2
print range(df.shape[1])
[0, 1]
df.columns = range(df.shape[1])
print df
0 1
0 23 12
1 21 44
2 98 21
This is same as using to_csv and read_csv:
print df.to_csv(header=None,index=False)
23,12
21,44
98,21
print pd.read_csv(io.StringIO(u""+df.to_csv(header=None,index=False)), header=None)
0 1
0 23 12
1 21 44
2 98 21
Next solution with skiprows:
print df.to_csv(index=False)
A,B
23,12
21,44
98,21
print pd.read_csv(io.StringIO(u""+df.to_csv(index=False)), header=None, skiprows=1)
0 1
0 23 12
1 21 44
2 98 21
How to get rid of a header(first row) and an index(first column).
To write to CSV file:
df = pandas.DataFrame(your_array)
df.to_csv('your_array.csv', header=False, index=False)
To read from CSV file:
df = pandas.read_csv('your_array.csv')
a = df.values
If you want to read a CSV file that doesn't contain a header, pass additional parameter header:
df = pandas.read_csv('your_array.csv', header=None)
I had the same problem but solved it in this way:
df = pd.read_csv('your-array.csv', skiprows=[0])
Haven't seen this solution yet so here's how I did it without using read_csv:
df.rename(columns={'A':'','B':''})
If you rename all your column names to empty strings your table will return without a header.
And if you have a lot of columns in your table you can just create a dictionary first instead of renaming manually:
df_dict = dict.fromkeys(df.columns, '')
df.rename(columns = df_dict)
You can first convert the DataFrame to an Numpy array, using this:
s1=df.iloc[:,0:2].values
Then, convert the numpy array back to DataFrame:
s2=pd.DataFrame(s1)
This will return a DataFrame with no Columns.
enter image description here
This works perfectly:
To get the dataframe without the header use:
totalRow = len(df.index)
df.iloc[1: totalRow]
Or you can use the second method like this:
totalRow = df.index.stop
df.iloc[1, totalRow]

Categories

Resources