Counting rows with NaN - python

I have the following DataFrame:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
What I have to do is count the rows that are exactly the same, including the NaN values.
The problem is the following, I use groupby, but it is a function that ignores the NaN values, that is, it does not have them in mind when doing the counting, that is the reason why I am not returning a correct output counting the number of exact repetitions between those rows.
My code is the following one:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
aux = data[data.duplicated(keep=False)]
x = data[data.duplicated(keep=False)].drop_duplicates()
#This line should count my repeated rows
s = aux.groupby(data.columns.tolist(),as_index=False).transform('size')
return x
If I print "x" var, I get this result, it shows all the repeated rows:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
51 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
53 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
Now I have to count those rows from my x result that are exactly the same.
This should be my correct output:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2
Here is my problem and it's that groupby ignores NaN values, and that's why other similar posts about this problem can't help me.
Thanks

If your dataframe's name is df, you can count the number of duplicates using just one line of code:
sum(df.duplicated(keep = False))
If you want to drop duplicate rows, use the drop_duplicates method. documentation
Example:
#data.csv
col1,col2,col3
a,3,NaN #duplicate
b,9,4 #duplicate
c,12,5
a,3,NaN #duplicate
b,9,4 #duplicate
d,19,20
a,3,NaN #duplicate - 5 duplicate rows
Importing data.csv and dropping duplicate rows (by default the first instance of a duplicated row is kept)
import pandas as pd
df = pd.read_csv("data.csv")
print(df.drop_duplicates())
#Output
c1 c2 c3
0 a 3 NaN
1 b 9 4.0
2 c 12 5.0
5 d 19 20.0
To count the number of duplicates rows, use the dataframe's duplicated method. Set "keep" to False (documentation). As mentioned above, you can simply do this using sum(df.duplicated(keep = False)). Here's a messier way to do it that demonstrates what the "duplicated" method does:
duplicate_rows = df.duplicated(keep = False)
print(duplicate_rows)
#count the number of duplicates (i.e. count the number of 'True' values in
#the duplicate_rows boolean series.
number_of_duplicates = sum(duplicate_rows)
print("Number of duplicate rows:")
print(number_of_duplicates)
#Output
#The duplicate_rows pandas series from df.duplicated(keep = False)
0 True
1 True
2 False
3 True
4 True
5 False
6 True
dtype: bool
#The number of rows from sum(df.duplicated(keep = False))
Number of duplicate rows:
5

I just solved it.
The problem as I said was groupby that doesn't not accept Nan Values.
So what I have done to solve it, is to change all Nan Values with fillna(0) function so it changes all NaN to 0 and now I can do the comparation properly.
Here is my new function working properly:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
aux = data[data.duplicated(keep=False)]
x = data[data.duplicated(keep=False)].drop_duplicates()
s = aux.fillna(0).groupby(data.columns.tolist()).size().reset_index().rename(columns={0:'count'})
x['num_reps'] = s['count'].tolist()[::-1]
return x

Related

Convert two pandas rows into one

I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()

How to append to individual columns in a Pandas DataFrame

So I want to add/append data to a specific pandas dataFrame column but without it causing NaN values in the remaining columns
I.e.
DataFrame = pd.DataFrame(columns=["column1", "column2", "column3"])
for i in range():
DataFrame = DataFrame.append({"column1":int(i)}, ignore_index=True)
DataFrame = DataFrame.append({"column2":float(i*2)}, ignore_index=True)
DataFrame = DataFrame.append({"column3":int(i*5)}, ignore_index=True)
print(DataFrame)
This will return:
column1 column2 column3
0 0.0 NaN NaN
1 NaN 0.0 NaN
2 NaN NaN 0.0
3 1.0 NaN NaN
4 NaN 2.0 NaN
5 NaN NaN 5.0
6 2.0 NaN NaN
7 NaN 4.0 NaN
8 NaN NaN 10.0
What we want returned:
column1 column2 column3
0 0.0 0.0 0.0
1 1.0 2.0 5.0
2 2.0 4.0 10.0
I know I can in this case use one .append for all the different columns. But I have some cases where the data to be appended will vary based on multiple conditions. Hence I'd like to know if it's possible to append to single columns in a dataframe without producing NaN values in the remaining columns. So that I can avoid writing hundreds of if else statements.
Or if someone has any good idea regarding how to 'collapse' the NaN values (removing the NaN values without removing the entire row so that if there is a NaN value at index 0 in column 3 and there is a integer 5 at index 1 in the same column the integer 5 gets moved up to index 0)
Happy to hear any ideas.
IIUC for your current example you can try this:
DataFrame[['column2','column3']]=DataFrame[['column2','column3']].bfill()
Output:
column1 column2 column3
0 0.0 0.0 0.0
1 NaN 0.0 0.0
2 NaN 2.0 0.0
3 1.0 2.0 5.0
4 NaN 2.0 5.0
5 NaN 4.0 5.0
6 2.0 4.0 10.0
7 NaN 4.0 10.0
8 NaN 6.0 10.0
9 3.0 6.0 15.0
10 NaN 6.0 15.0
11 NaN 8.0 15.0
12 4.0 8.0 20.0
13 NaN 8.0 20.0
14 NaN NaN 20.0
then remove the NaN :
DataFrame.dropna(inplace=True)
Outpt:
column1 column2 column3
0 0.0 0.0 0.0
3 1.0 2.0 5.0
6 2.0 4.0 10.0
9 3.0 6.0 15.0
12 4.0 8.0 20.0

Add single field on df end

Is it possible after dataframe with 20+ rows and xx+ columns to add a single field with total count of certain value. User will add different values to df and before 'pandas.DataFrame.to_excel' it's neccesary to to add a single field with some specific data. Like in the attached picture. Is it possible to add a single field after an already structured df?
This can work for you:
Df:
A B output
0 a 1.0 1.0
1 a 2.0 1.0
2 a 3.0 1.0
3 a 4.0 1.0
4 a 5.0 1.0
for i in range(df.iloc[-1].name + 1, 25): # Add 20 new nan row (you can change it)
df.loc[i, :] = np.nan
df.loc[df.iloc[-1].name + 1, 'A'] = 'Result: ' + str(df['B'].sum()) # For this example i just put sum of column B so you can change it.
print(df)
A B output
0 a 1.0 1.0
1 a 2.0 1.0
2 a 3.0 1.0
3 a 4.0 1.0
4 a 5.0 1.0
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
15 NaN NaN NaN
16 NaN NaN NaN
17 NaN NaN NaN
18 NaN NaN NaN
19 NaN NaN NaN
20 NaN NaN NaN
21 NaN NaN NaN
22 NaN NaN NaN
23 NaN NaN NaN
24 NaN NaN NaN
25 Result: 15.0 NaN NaN

Pandas combine two columns

I have following database:
df = pandas.DataFrame({'Buy':[10,np.nan,2,np.nan,np.nan,4],'Sell':[np.nan,7,np.nan,9,np.nan,np.nan]})
Out[37]:
Buy Sell
0 10.0 NaN
1 NaN 7.0
2 2.0 NaN
3 NaN 9.0
4 NaN NaN
5 4.0 NaN
I want o create two more columns called Quant and B/S
for Quant it is working fine as follows:
df['Quant'] = df['Buy'].fillna(df['Sell']) # Fetch available value from both column and if both values are Nan then output is Nan.
Output is:
df
Out[39]:
Buy Sell Quant
0 10.0 NaN 10.0
1 NaN 7.0 7.0
2 2.0 NaN 2.0
3 NaN 9.0 9.0
4 NaN NaN NaN
5 4.0 NaN 4.0
But I want to create B/S on the basis of "from which column they have taken value while creating Quant"
You can perform an equality test and feed into numpy.where:
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
For the case where both values are null, you can use an additional step:
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Example
from io import StringIO
import pandas as pd
mystr = StringIO("""Buy Sell
10 nan
nan 8
4 nan
nan 5
nan 7
3 nan
2 nan
nan nan""")
df = pd.read_csv(mystr, delim_whitespace=True)
df['Quant'] = df['Buy'].fillna(df['Sell'])
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Result
print(df)
Buy Sell Quant B/S
0 10.0 NaN 10.0 B
1 NaN 8.0 8.0 S
2 4.0 NaN 4.0 B
3 NaN 5.0 5.0 S
4 NaN 7.0 7.0 S
5 3.0 NaN 3.0 B
6 2.0 NaN 2.0 B
7 NaN NaN NaN NaN

Duplicated rows from .csv and count- python

I have to get the number of times that a complete line appears repeated in my data frame, to then only show those lines that appear repetitions and show in the last column how many times those lines appear repetitions.
Input values for creating output correct table:
dur,wage1,wage2,wage3,cola,hours,pension,stby_pay,shift_diff,educ_allw,holidays,vacation,ldisab,dntl,ber,hplan,agr
2,4.5,4.0,?,?,40,?,?,2,no,10,below average,no,half,?,half,bad
2,2.0,2.0,?,none,40,none,?,?,no,11,average,yes,none,yes,full,bad
3,4.0,5.0,5.0,tc,?,empl_contr,?,?,?,12,generous,yes,none,yes,half,good
1,2.0,?,?,tc,40,ret_allw,4,0,no,11,generous,no,none,no,none,bad
1,6.0,?,?,?,38,?,8,3,?,9,generous,?,?,?,?,good
2,2.5,3.0,?,tcf,40,none,?,?,?,11,below average,?,?,yes,?,bad
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good
1,2.1,?,?,tc,40,ret_allw,2,3,no,9,below average,yes,half,?,none,bad
1,2.8,?,?,none,38,empl_contr,2,3,no,9,below average,yes,half,?,none,bad
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.3,4.4,?,?,38,?,?,4,?,12,generous,?,full,?,full,good
1,2.8,?,?,?,35,?,?,2,?,12,below average,?,?,?,?,good
2,2.0,2.5,?,?,35,?,?,6,yes,12,average,?,?,?,?,good
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.5,4.0,?,none,40,?,?,4,?,12,average,yes,full,yes,half,good
3,3.5,4.0,4.6,none,36,?,?,3,?,13,generous,?,?,yes,full,good
3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good
I just have to keep those rows that are totally equal.
This is the table result:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2
As you can see on this table, we keep for example line with index 6 because on line 6 and 17 from the input table to read, both lines are the same.
With my current code:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
x = data[data.duplicated(keep=False)].drop_duplicates()
return x
I get the result correctly, however I do not know how to count the repeated rows and then add it in the column 'nums_rep' at the end of the table.
This is my result, without the last column that counts the number of repeated rows:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
How can I perform a correct count, based on the equality of all the data in the column, then add it and show it in the column 'num_reps'?

Categories

Resources