I have a large excel workbook with 1 sheet with roughly 45,000 rows and 45 columns. I want to iterate through the columns looking for duplicates and unique items and its taking a very long time to go through individual columns. Is there anyway to optimize my code or make this go faster? I either want to print the information or save to txt file. I'm on windows 10 and python 2.7 using openpyxl module:
from openpyxl import load_workbook, worksheet, Workbook
import os
#read work book to get data
wb = load_workbook(filename = 'file.xlsx', use_iterators = True)
ws = wb.get_sheet_by_name(name = 'file')
wb = load_workbook(filename='file.xlsx', read_only=True)
count = 0
seen = set()
uniq = []
for cell in ws.columns[0]:
if cell not in seen:
uniq.append(cell)
seen.add(cell)
print("Unique: "+uniq)
print("Doubles: "+seen)
EDIT: Lets say I have 5 columns A,B,C,D,E and 10 entries, so 10 rows, 5x10. In column A I want to extract all the duplicates and separate them from the unique values.
As VedangMehta mentioned, Pandas will do it very quickly for you.
Run this code:
import pandas as pd
#read in the dataset:
df = pd.read_excel('file.xlsx', sheetname = 'file')
df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())
#save duplicated values from first column
df[df_dup].iloc[:,0].to_csv("file_duplicates_col1.csv")
#save unique values from first column
df[~df_dup].iloc[:,0].to_csv("file_unique_col1.csv")
#save duplicated values from all columns:
df[df_dup].to_csv("file_duplicates.csv")
#save unique values from all columns:
df[df_dup].to_csv("file_unique.csv")
For details, see below:
Suppose your dataset looks as follows:
df = pd.DataFrame({'a':[1,3,1,13], 'b':[13,3,5,3]})
df.head()
Out[24]:
a b
0 1 13
1 3 3
2 1 5
3 13 3
You can find which values are duplicated in each column:
df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())
the result:
df_dup
Out[26]:
a b
0 False False
1 False False
2 True False
3 False True
you can find the duplicated values by subsetting the df using the boolean dataframe df_dup
df[df_dup]
Out[27]:
a b
0 NaN NaN
1 NaN NaN
2 1.0 NaN
3 NaN 3.0
Again, you can save that using:
#save the above using:
df[df_dup].to_csv("duplicated_values.csv")
to see the duplicated values in the first column, use:
df[df_dup].iloc[:,0]
to get
Out[11]:
0 NaN
1 NaN
2 1.0
3 NaN
Name: a, dtype: float64
For unique calues, use ~ which is Python's not sign. So you're essentially subsetting df by values that are Not duplicates
df[~df_dup]
Out[29]:
a b
0 1.0 13.0
1 3.0 3.0
2 NaN 5.0
3 13.0 NaN
When working with read-only mode don't use the columns property to read a worksheet. This is because data is stored in rows so columns require the parser to continually re-read the file.
This is an example of using openpyxl to convert worksheets into Pandas dataframes. It requires openpyxl 2.4 or higher, which at the time of writing can must be checked out.
Related
Given the following test file:
https://docs.google.com/spreadsheets/d/1rRUZirjPj2cBeaukUG8ngEowv80Nqg6N/edit?usp=sharing&ouid=100016243141159098340&rtpof=true&sd=true
I need to import the .xlsx file that has 4 sheets (this is only an example, my original file has many more sheets), add a column to each df's with sheet’s name to which it belongs to and then concatenate the resulting df's with those that had same number of columns.
In this example I have two sheets with 2 columns (I want those in the same dataframe), and another two sheets with one column each (which I want in only one dataframe).
What have I done so far?
my_dict = pd.read_excel('test.xlsx',header=0, sheet_name=None) #the output is a dictionary
for key, df in my_dict.items():
df['sheet_name'] = key # This code creates a new column in each dataframe with the name of the sheet.
I don't know how to concatenate the dataframes that are inside the dictionary, to group them by the number of columns that each one has. The result here would be two different df's.
Read in the data:
xlsx = pd.read_excel('test.xlsx', sheet_name = None)
Create two variables, one containing dataframes that have two columns, the other containing dataframes that have only one column :
two = {key:value for key,value in xlsx.items() if value.columns.size == 2}
one = {key:value for key,value in xlsx.items() if value.columns.size == 1}
Concatenate two and one individually:
two = pd.concat(two, names = ['sheet_name', None]).droplevel(-1).reset_index()
two
sheet_name A B C D
0 JFK 1.0 2.0 NaN NaN
1 JFK 5.0 6.0 NaN NaN
2 MIA NaN NaN 1.0 1.0
3 MIA NaN NaN 2.0 2.0
one = pd.concat(one, names = ['sheet_name', None]).droplevel(-1).reset_index()
one
sheet_name z
0 SJU 1
1 SJU 2
2 BCN 3
3 BCN 4
If you want the dataframe with two columns to have the same column names, you can do the preprocessing during the dictionary filtering phase:
two = {key:value.set_axis(['A', 'B'], axis = 'columns')
for key,value in xlsx.items()
if value.columns.size == 2}
# concatenation will result in only three columns:
two = pd.concat(two, names = ['sheet_name', None]).droplevel(-1).reset_index()
two
sheet_name A B
0 JFK 1 2
1 JFK 5 6
2 MIA 1 1
3 MIA 2 2
I have dataframe like this:
id
name
emails
1
a
a#e.com,b#e.com,c#e.com,d#e.com
2
f
f#gmail.com
And I need iterate over emails if there are more than one, create additional rows in dataframe with additional emails, not corresponding to name, should be like this:
id
name
emails
1
a
a#e.com
2
f
f#gmail.com
3
NaN
b#e.com
4
NaN
c#e.com
5
NaN
d#e.com
What is the best way to do it apart of iterrows with append or concat? is it ok to modify iterated dataframe during iteration?
Thanks.
Use DataFrame.explode with splitted values by Series.str.split first, then compare values before # and if no match set missing value and last sorting like missing values are in end of DataFrame with assign range to id column:
df = df.assign(emails = df['emails'].str.split(',')).explode('emails')
mask = df['name'].eq(df['emails'].str.split('#').str[0])
df['name'] = np.where(mask, df['name'], np.nan)
df = df.sort_values('name', key=lambda x: x.isna(), ignore_index=True)
df['id'] = range(1, len(df) + 1)
print (df)
id name emails
0 1 a a#e.com
1 2 f f#gmail.com
2 3 NaN b#e.com
3 4 NaN c#e.com
4 5 NaN d#e.com
Consider I have a dataframe with few columns
and a list ['salary','gross exp']
Now I want to perform sum of the column operation only on the columns from the list on the dataframe and save that to the dataframe
To put this in prespective, the list of columns ['salary','gross exp'] are money related and it makes sense to perform sum on these columns and not touch any of the other columns
P.S: I have several Excel workbooks to work on and each consists of few tens of sheets, so doing it manually is out of options
Also macro code for excel works fine if that's possible
TIA
Working with the following example:
import pandas as pd
list_ = ['sallary', 'gross exp']
d = {'sallary': [1,2,3], 'gross exp': [2,2,2], 'another column': [10,10,10]}
df = pd.DataFrame(d)
df
sallary
gross exp
another column
0
1
2
10
1
2
2
10
2
3
2
10
You can then add a new empty row, and insert the sum of the columns that you have from the list in that same row:
df = df.append(pd.Series(dtype = float), ignore_index=True)
df.loc[df.index[-1],list_] = df[list_].sum()
df
sallary
gross exp
another column
0
1.0
2.0
10.0
1
2.0
2.0
10.0
2
3.0
2.0
10.0
3
6.0
6.0
NaN
I have initialized an empty pandas dataframe that I am now trying to fill but I keep running into the same error. This is the (simplified) code I am using
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
# sett the values for the first two rows
df.loc[0:2,:] = [[1,2],[3,4],[5,6]]
On running the above code I get the following error:
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
I am not sure whats causing this. I tried the same using a single row at a time and it works (df.loc[0,:] = [1,2,3]). I thought this should be the logical expansion when I want to handle more than one rows. But clearly, I am wrong. Whats the correct way to do this? I need to enter values for multiple rows and columns and once. I can do it using a loop but that's not what I am looking for.
Any help would be great. Thanks
Since you have the columns from empty dataframe use it in dataframe constructor i.e
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
df = pd.DataFrame(np.array([[1,2],[3,4],[5,6]]).T,columns=df.columns)
A B C
0 1 3 5
1 2 4 6
Well, if you want to use loc specifically then, reindex the dataframe first then assign i.e
arr = np.array([[1,2],[3,4],[5,6]]).T
df = df.reindex(np.arange(arr.shape[0]))
df.loc[0:arr.shape[0],:] = arr
A B C
0 1 3 5
1 2 4 6
How about adding data by index as below. You can add externally to a function as and when you receive data.
def add_to_df(index, data):
for idx,i in zip(index,(zip(*data))):
df.loc[idx]=i
#Set values for first two rows
data1 = [[1,2],[3,4],[5,6]]
index1 = [0,1]
add_to_df(index1, data1)
print df
print ""
#Set values for next three rows
data2 = [[7,8,9],[10,11,12],[13,14,15]]
index2 = [2,3,4]
add_to_df(index2, data2)
print df
Result
>>>
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
2 7.0 10.0 13.0
3 8.0 11.0 14.0
4 9.0 12.0 15.0
>>>
Seeing through the documentation and some experiments, my guess is that loc only allows you to insert 1 key at a time. However, you can insert multiple keys first with reindex as #Dark shows.
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#setting-with-enlargement
Also, while you are using loc[:2, :], you mean you want to select the first two rows. However, there is nothing in the empty df for you to select. There is no rows while you are trying to insert 3 rows. Thus, the message gives
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
BTW, [[1,2],[3,4],[5,6]] will be 3 rows rather than 2.
Does this get the output you looking for:
import pandas as pd
df=pd.DataFrame({'A':[1,2],'B':[3,4],'C':[5,6]})
Output :
A B C
0 1 3 5
1 2 4 6
As part of a data profiling exercise, I'm reading excel sheets into pandas dataframes.
df = pd.ExcelFile('file.xlsx').parse(0)
nullcounts = df.isnull().sum().to_frame('null_records')
Produces a nice frame with the null count for every series in my dataframe. But if the string 'NA' appears in a row of data, I don't want the isnull operation to return True.
Is there a simple way to do this without hard coding a rule for a specific column/dataframe?
Edit: It appears that the NAs in my source data are being ignored when being read into pandas, since when I load the data and compare visually I see NaN where in excel there was NA.
If use read_excel is possible define which values are converted to NaN with parameter keep_default_na and na_values:
df = pd.read_excel('file.xlsx')
print (df)
a b
0 NaN NaN
1 3.0 6.0
nullcounts = df.isnull().sum().to_frame('null_records')
print (nullcounts)
null_records
a 1
b 1
df = pd.read_excel('file.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NA NaN
1 3 6.0
nullcounts = df.isnull().sum().to_frame('null_records')
print (nullcounts)
null_records
a 0
b 1