As part of a data profiling exercise, I'm reading excel sheets into pandas dataframes.
df = pd.ExcelFile('file.xlsx').parse(0)
nullcounts = df.isnull().sum().to_frame('null_records')
Produces a nice frame with the null count for every series in my dataframe. But if the string 'NA' appears in a row of data, I don't want the isnull operation to return True.
Is there a simple way to do this without hard coding a rule for a specific column/dataframe?
Edit: It appears that the NAs in my source data are being ignored when being read into pandas, since when I load the data and compare visually I see NaN where in excel there was NA.
If use read_excel is possible define which values are converted to NaN with parameter keep_default_na and na_values:
df = pd.read_excel('file.xlsx')
print (df)
a b
0 NaN NaN
1 3.0 6.0
nullcounts = df.isnull().sum().to_frame('null_records')
print (nullcounts)
null_records
a 1
b 1
df = pd.read_excel('file.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NA NaN
1 3 6.0
nullcounts = df.isnull().sum().to_frame('null_records')
print (nullcounts)
null_records
a 0
b 1
Related
Sorry if this seems repetive, i've found a lot of close answers using groupby and size but none that return the column header as the index.
I have the following df (which actually has 340 columns and many rows):
import pandas as pd
data = {'Name_Clean_40_40_Correct':['0','1','0','0'], 'Name_Clean_40_80_Correct':['0','1','1','N/A'],'Name_Clean_40_60_Correct':['N/A','N/A','0','1']}
df_third = pd.DataFrame(data)
I am trying to count the instances of '0','1', and 'N/A' for each column. So i'd like to have the index be the column names and the columns be '0','1', and 'N/A'.
I was trying this, but i'm afraid it is very inefficient or incorrect, since it won't complete.
def countx(x, colname):
df_thresholds=df_third.groupby(colname).count()
for col in df_thresholds.columns:
df_thresholds[col + '_Count'] = df_third.apply(countx, axis=1, args=(col,))
I can do it for one column but that would be a pain:
df_thresholds=df_third.groupby('Name_Clean_100_100_Correct').count()
df_thresholds=df_thresholds[['Name_Raw']]
df_thresholds=df_thresholds.T
If I understand correctly this should work:
df_third.apply(pd.Series.value_counts)
result:
Name_Clean_40_40_Correct ... Name_Clean_40_60_Correct
0 3.0 ... 1
1 1.0 ... 1
N/A NaN ... 2
BTW: to select only columns containing 'Correct':
df_third.filter(like='Correct')
Transposed form df_third.T:
0 1 N/A
Name_Clean_40_40_Correct 3.0 1.0 NaN
Name_Clean_40_80_Correct 1.0 2.0 1.0
Name_Clean_40_60_Correct 1.0 1.0 2.0
I have a dataframe like the following
df = [[1,'NaN',3],[4,5,'Nan'],[7,8,9]]
df = pd.DataFrame(df)
and I would like to remove all columns that have in their first row a NaN value.
So the output should be:
df = [[1,3],[4,'Nan'],[7,9]]
df = pd.DataFrame(df)
So in this case, only the second column is removed since the first element was a NaN value.
Hence, dropna() is based on a condition.. any idea how to handle this? Thx!
If values are np.nan and not string NaN(else replace them), you can do:
Input:
df = [[1,np.nan,3],[4,5,np.nan],[7,8,9]]
df = pd.DataFrame(df)
Solution:
df.loc[:,df.iloc[0].notna()] #assign back to your desired variable
0 2
0 1 3.0
1 4 NaN
2 7 9.0
I tried to insert a new row to a dataframe named 'my_df1' using the my_df1.loc function.But in the result ,the new row added has NaN values
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
my_df1.loc[3] = pd.Series([5,5,5])
Result displayed is as below
A B C
0 1.0 4.0 a
1 2.0 5.0 b
2 3.0 6.0 c
3 NaN NaN NaN
The reason that is all NaN is that my_df1.loc[3] as index (A,B,C) while pd.Series([5,5,5]) as index (0,1,2). When you do series1=series2, pandas only copies values of common indices, hence the result.
To fix this, do as #anky_91 says, or if you already has a series, use its values:
my_df1.loc[3] = my_series.values
Finally I found out how to add a Series as a row or column to a dataframe
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
Code1 adds a new column 'D' and values 5,5,5 to the dataframe
my_df1.loc[:,'D'] = pd.Series([5,5,5],index = my_df1.index)
print(my_df1)
Code2 adds a new row with index 3 and values 3,4,3,4 to the dataframe in code 1
my_df1.loc[3] = pd.Series([3,4,3,4],index = ('A','B','C','D'))
print(my_df1)
I have initialized an empty pandas dataframe that I am now trying to fill but I keep running into the same error. This is the (simplified) code I am using
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
# sett the values for the first two rows
df.loc[0:2,:] = [[1,2],[3,4],[5,6]]
On running the above code I get the following error:
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
I am not sure whats causing this. I tried the same using a single row at a time and it works (df.loc[0,:] = [1,2,3]). I thought this should be the logical expansion when I want to handle more than one rows. But clearly, I am wrong. Whats the correct way to do this? I need to enter values for multiple rows and columns and once. I can do it using a loop but that's not what I am looking for.
Any help would be great. Thanks
Since you have the columns from empty dataframe use it in dataframe constructor i.e
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
df = pd.DataFrame(np.array([[1,2],[3,4],[5,6]]).T,columns=df.columns)
A B C
0 1 3 5
1 2 4 6
Well, if you want to use loc specifically then, reindex the dataframe first then assign i.e
arr = np.array([[1,2],[3,4],[5,6]]).T
df = df.reindex(np.arange(arr.shape[0]))
df.loc[0:arr.shape[0],:] = arr
A B C
0 1 3 5
1 2 4 6
How about adding data by index as below. You can add externally to a function as and when you receive data.
def add_to_df(index, data):
for idx,i in zip(index,(zip(*data))):
df.loc[idx]=i
#Set values for first two rows
data1 = [[1,2],[3,4],[5,6]]
index1 = [0,1]
add_to_df(index1, data1)
print df
print ""
#Set values for next three rows
data2 = [[7,8,9],[10,11,12],[13,14,15]]
index2 = [2,3,4]
add_to_df(index2, data2)
print df
Result
>>>
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
2 7.0 10.0 13.0
3 8.0 11.0 14.0
4 9.0 12.0 15.0
>>>
Seeing through the documentation and some experiments, my guess is that loc only allows you to insert 1 key at a time. However, you can insert multiple keys first with reindex as #Dark shows.
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#setting-with-enlargement
Also, while you are using loc[:2, :], you mean you want to select the first two rows. However, there is nothing in the empty df for you to select. There is no rows while you are trying to insert 3 rows. Thus, the message gives
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
BTW, [[1,2],[3,4],[5,6]] will be 3 rows rather than 2.
Does this get the output you looking for:
import pandas as pd
df=pd.DataFrame({'A':[1,2],'B':[3,4],'C':[5,6]})
Output :
A B C
0 1 3 5
1 2 4 6
I have a large excel workbook with 1 sheet with roughly 45,000 rows and 45 columns. I want to iterate through the columns looking for duplicates and unique items and its taking a very long time to go through individual columns. Is there anyway to optimize my code or make this go faster? I either want to print the information or save to txt file. I'm on windows 10 and python 2.7 using openpyxl module:
from openpyxl import load_workbook, worksheet, Workbook
import os
#read work book to get data
wb = load_workbook(filename = 'file.xlsx', use_iterators = True)
ws = wb.get_sheet_by_name(name = 'file')
wb = load_workbook(filename='file.xlsx', read_only=True)
count = 0
seen = set()
uniq = []
for cell in ws.columns[0]:
if cell not in seen:
uniq.append(cell)
seen.add(cell)
print("Unique: "+uniq)
print("Doubles: "+seen)
EDIT: Lets say I have 5 columns A,B,C,D,E and 10 entries, so 10 rows, 5x10. In column A I want to extract all the duplicates and separate them from the unique values.
As VedangMehta mentioned, Pandas will do it very quickly for you.
Run this code:
import pandas as pd
#read in the dataset:
df = pd.read_excel('file.xlsx', sheetname = 'file')
df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())
#save duplicated values from first column
df[df_dup].iloc[:,0].to_csv("file_duplicates_col1.csv")
#save unique values from first column
df[~df_dup].iloc[:,0].to_csv("file_unique_col1.csv")
#save duplicated values from all columns:
df[df_dup].to_csv("file_duplicates.csv")
#save unique values from all columns:
df[df_dup].to_csv("file_unique.csv")
For details, see below:
Suppose your dataset looks as follows:
df = pd.DataFrame({'a':[1,3,1,13], 'b':[13,3,5,3]})
df.head()
Out[24]:
a b
0 1 13
1 3 3
2 1 5
3 13 3
You can find which values are duplicated in each column:
df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())
the result:
df_dup
Out[26]:
a b
0 False False
1 False False
2 True False
3 False True
you can find the duplicated values by subsetting the df using the boolean dataframe df_dup
df[df_dup]
Out[27]:
a b
0 NaN NaN
1 NaN NaN
2 1.0 NaN
3 NaN 3.0
Again, you can save that using:
#save the above using:
df[df_dup].to_csv("duplicated_values.csv")
to see the duplicated values in the first column, use:
df[df_dup].iloc[:,0]
to get
Out[11]:
0 NaN
1 NaN
2 1.0
3 NaN
Name: a, dtype: float64
For unique calues, use ~ which is Python's not sign. So you're essentially subsetting df by values that are Not duplicates
df[~df_dup]
Out[29]:
a b
0 1.0 13.0
1 3.0 3.0
2 NaN 5.0
3 13.0 NaN
When working with read-only mode don't use the columns property to read a worksheet. This is because data is stored in rows so columns require the parser to continually re-read the file.
This is an example of using openpyxl to convert worksheets into Pandas dataframes. It requires openpyxl 2.4 or higher, which at the time of writing can must be checked out.