So I want to add/append data to a specific pandas dataFrame column but without it causing NaN values in the remaining columns
I.e.
DataFrame = pd.DataFrame(columns=["column1", "column2", "column3"])
for i in range():
DataFrame = DataFrame.append({"column1":int(i)}, ignore_index=True)
DataFrame = DataFrame.append({"column2":float(i*2)}, ignore_index=True)
DataFrame = DataFrame.append({"column3":int(i*5)}, ignore_index=True)
print(DataFrame)
This will return:
column1 column2 column3
0 0.0 NaN NaN
1 NaN 0.0 NaN
2 NaN NaN 0.0
3 1.0 NaN NaN
4 NaN 2.0 NaN
5 NaN NaN 5.0
6 2.0 NaN NaN
7 NaN 4.0 NaN
8 NaN NaN 10.0
What we want returned:
column1 column2 column3
0 0.0 0.0 0.0
1 1.0 2.0 5.0
2 2.0 4.0 10.0
I know I can in this case use one .append for all the different columns. But I have some cases where the data to be appended will vary based on multiple conditions. Hence I'd like to know if it's possible to append to single columns in a dataframe without producing NaN values in the remaining columns. So that I can avoid writing hundreds of if else statements.
Or if someone has any good idea regarding how to 'collapse' the NaN values (removing the NaN values without removing the entire row so that if there is a NaN value at index 0 in column 3 and there is a integer 5 at index 1 in the same column the integer 5 gets moved up to index 0)
Happy to hear any ideas.
IIUC for your current example you can try this:
DataFrame[['column2','column3']]=DataFrame[['column2','column3']].bfill()
Output:
column1 column2 column3
0 0.0 0.0 0.0
1 NaN 0.0 0.0
2 NaN 2.0 0.0
3 1.0 2.0 5.0
4 NaN 2.0 5.0
5 NaN 4.0 5.0
6 2.0 4.0 10.0
7 NaN 4.0 10.0
8 NaN 6.0 10.0
9 3.0 6.0 15.0
10 NaN 6.0 15.0
11 NaN 8.0 15.0
12 4.0 8.0 20.0
13 NaN 8.0 20.0
14 NaN NaN 20.0
then remove the NaN :
DataFrame.dropna(inplace=True)
Outpt:
column1 column2 column3
0 0.0 0.0 0.0
3 1.0 2.0 5.0
6 2.0 4.0 10.0
9 3.0 6.0 15.0
12 4.0 8.0 20.0
Related
Now, I have two dataframe. I have use groupby. and count() function to export this dataframe(df1). When I used groupby. to count the total number of each category. It filtered out the category which the count is 0. How can I use Python to get the outcome?
However,I will like to have a dataframe which also required categories.
Original dataframe:
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 Beam Sensor 27.0 12.0 13.0 14.0
3 CLPS 1.0 NaN NaN 1.0
However,I will like to have a dataframe which also required categories.
(required categories: ATIDS, BasicCrane, LLP, Beam Sensor, CLPS, SPR)
Expected dataframe (The count number of 'LLP' and 'SPR' is 0)
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
>>> categories
['ATIDS', 'BasicCrane', 'LLP', 'Beam Sensor', 'CLPS', 'SPR']
>>> pd.merge(pd.DataFrame({'Cat': categories}), df, how='outer')
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
One way you could easily do is to fill NaN value with 0 'before' doing a groupby function. All zero data (previously NaN value) will be totally be counted as zero.
df.fillna(0)
I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()
I have a DataFrame where I want to replace only the rows with NaN values in each column by the row below it. I tried solutions from multiple feeds and used ffill but that resulted in filling few cells and not the entire row.
ss s h b sb
0 NaN NaN NaN NaN NaN
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 1.0 6.0 7.0 11.0 3.0
Expected output:
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
You can create groups by testing rows with only missing values with cumulative sum by swapped order of column and pass to GroupBy.bfill:
df = df.groupby((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1]).bfill()
print (df)
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
Detail:
print ((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1])
0 3
1 3
2 2
3 1
4 1
5 1
dtype: int32
I have the following DataFrame:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
What I have to do is count the rows that are exactly the same, including the NaN values.
The problem is the following, I use groupby, but it is a function that ignores the NaN values, that is, it does not have them in mind when doing the counting, that is the reason why I am not returning a correct output counting the number of exact repetitions between those rows.
My code is the following one:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
aux = data[data.duplicated(keep=False)]
x = data[data.duplicated(keep=False)].drop_duplicates()
#This line should count my repeated rows
s = aux.groupby(data.columns.tolist(),as_index=False).transform('size')
return x
If I print "x" var, I get this result, it shows all the repeated rows:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
51 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
53 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
Now I have to count those rows from my x result that are exactly the same.
This should be my correct output:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2
Here is my problem and it's that groupby ignores NaN values, and that's why other similar posts about this problem can't help me.
Thanks
If your dataframe's name is df, you can count the number of duplicates using just one line of code:
sum(df.duplicated(keep = False))
If you want to drop duplicate rows, use the drop_duplicates method. documentation
Example:
#data.csv
col1,col2,col3
a,3,NaN #duplicate
b,9,4 #duplicate
c,12,5
a,3,NaN #duplicate
b,9,4 #duplicate
d,19,20
a,3,NaN #duplicate - 5 duplicate rows
Importing data.csv and dropping duplicate rows (by default the first instance of a duplicated row is kept)
import pandas as pd
df = pd.read_csv("data.csv")
print(df.drop_duplicates())
#Output
c1 c2 c3
0 a 3 NaN
1 b 9 4.0
2 c 12 5.0
5 d 19 20.0
To count the number of duplicates rows, use the dataframe's duplicated method. Set "keep" to False (documentation). As mentioned above, you can simply do this using sum(df.duplicated(keep = False)). Here's a messier way to do it that demonstrates what the "duplicated" method does:
duplicate_rows = df.duplicated(keep = False)
print(duplicate_rows)
#count the number of duplicates (i.e. count the number of 'True' values in
#the duplicate_rows boolean series.
number_of_duplicates = sum(duplicate_rows)
print("Number of duplicate rows:")
print(number_of_duplicates)
#Output
#The duplicate_rows pandas series from df.duplicated(keep = False)
0 True
1 True
2 False
3 True
4 True
5 False
6 True
dtype: bool
#The number of rows from sum(df.duplicated(keep = False))
Number of duplicate rows:
5
I just solved it.
The problem as I said was groupby that doesn't not accept Nan Values.
So what I have done to solve it, is to change all Nan Values with fillna(0) function so it changes all NaN to 0 and now I can do the comparation properly.
Here is my new function working properly:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
aux = data[data.duplicated(keep=False)]
x = data[data.duplicated(keep=False)].drop_duplicates()
s = aux.fillna(0).groupby(data.columns.tolist()).size().reset_index().rename(columns={0:'count'})
x['num_reps'] = s['count'].tolist()[::-1]
return x
here is my DataFrame:
0 1 2
0 0 0.0 20.0 NaN
1 1.0 21.0 NaN
2 2.0 22.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2011.0
1 0 3.0 23.0 NaN
1 4.0 24.0 NaN
2 5.0 25.0 NaN
3 6.0 26.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2012.0
i want to convert the 'ID' and 'Year' rows to dataframe Index with 'ID' being level=0 and 'Year' being level=1. I tried using stack() but still cannot figure it .
Edited: my desired output should look like below:
0 1
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0
This should work:
df1 = df.loc[pd.IndexSlice[:, ['ID', 'Year']], '2']
dfs = df1.unstack()
dfi = df1.index
dfn = df.drop(dfi).drop('2', axis=1).unstack()
dfn.set_index([dfs.ID, dfs.Year]).stack()