I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()
Related
I have a DataFrame where I want to replace only the rows with NaN values in each column by the row below it. I tried solutions from multiple feeds and used ffill but that resulted in filling few cells and not the entire row.
ss s h b sb
0 NaN NaN NaN NaN NaN
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 1.0 6.0 7.0 11.0 3.0
Expected output:
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
You can create groups by testing rows with only missing values with cumulative sum by swapped order of column and pass to GroupBy.bfill:
df = df.groupby((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1]).bfill()
print (df)
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
Detail:
print ((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1])
0 3
1 3
2 2
3 1
4 1
5 1
dtype: int32
So I want to add/append data to a specific pandas dataFrame column but without it causing NaN values in the remaining columns
I.e.
DataFrame = pd.DataFrame(columns=["column1", "column2", "column3"])
for i in range():
DataFrame = DataFrame.append({"column1":int(i)}, ignore_index=True)
DataFrame = DataFrame.append({"column2":float(i*2)}, ignore_index=True)
DataFrame = DataFrame.append({"column3":int(i*5)}, ignore_index=True)
print(DataFrame)
This will return:
column1 column2 column3
0 0.0 NaN NaN
1 NaN 0.0 NaN
2 NaN NaN 0.0
3 1.0 NaN NaN
4 NaN 2.0 NaN
5 NaN NaN 5.0
6 2.0 NaN NaN
7 NaN 4.0 NaN
8 NaN NaN 10.0
What we want returned:
column1 column2 column3
0 0.0 0.0 0.0
1 1.0 2.0 5.0
2 2.0 4.0 10.0
I know I can in this case use one .append for all the different columns. But I have some cases where the data to be appended will vary based on multiple conditions. Hence I'd like to know if it's possible to append to single columns in a dataframe without producing NaN values in the remaining columns. So that I can avoid writing hundreds of if else statements.
Or if someone has any good idea regarding how to 'collapse' the NaN values (removing the NaN values without removing the entire row so that if there is a NaN value at index 0 in column 3 and there is a integer 5 at index 1 in the same column the integer 5 gets moved up to index 0)
Happy to hear any ideas.
IIUC for your current example you can try this:
DataFrame[['column2','column3']]=DataFrame[['column2','column3']].bfill()
Output:
column1 column2 column3
0 0.0 0.0 0.0
1 NaN 0.0 0.0
2 NaN 2.0 0.0
3 1.0 2.0 5.0
4 NaN 2.0 5.0
5 NaN 4.0 5.0
6 2.0 4.0 10.0
7 NaN 4.0 10.0
8 NaN 6.0 10.0
9 3.0 6.0 15.0
10 NaN 6.0 15.0
11 NaN 8.0 15.0
12 4.0 8.0 20.0
13 NaN 8.0 20.0
14 NaN NaN 20.0
then remove the NaN :
DataFrame.dropna(inplace=True)
Outpt:
column1 column2 column3
0 0.0 0.0 0.0
3 1.0 2.0 5.0
6 2.0 4.0 10.0
9 3.0 6.0 15.0
12 4.0 8.0 20.0
I have a pandas dataframe that summarises sales by calendar month & outputs something like:
Month level_0 UNIQUE_ID 102018 112018 12018 122017 122018 22018 32018 42018 52018 62018 72018 82018 92018
0 SOLD_QUANTITY 01 3692.0 5182.0 3223.0 1292.0 2466.0 2396.0 2242.0 2217.0 3590.0 2593.0 1665.0 3371.0 3069.0
1 SOLD_QUANTITY 011 3.0 6.0 NaN NaN 7.0 5.0 2.0 1.0 5.0 NaN 1.0 1.0 3.0
2 SOLD_QUANTITY 02 370.0 130.0 NaN NaN 200.0 NaN NaN 269.0 202.0 NaN 201.0 125.0 360.0
3 SOLD_QUANTITY 03 2.0 6.0 NaN NaN 2.0 1.0 NaN 6.0 11.0 9.0 2.0 3.0 5.0
4 SOLD_QUANTITY 08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 175.0 NaN NaN
I want to be able to programmatically re-arrange the column headers in ascending date order (eg starting 122017, 12018, 22018...). I need to do it in a way that is programmatic as every way the report runs, it will be a different list of months as it runs every month for last 365 days.
The index data type:
Index(['level_0', 'UNIQUE_ID', '102018', '112018', '12018', '122017', '122018',
'22018', '32018', '42018', '52018', '62018', '72018', '82018', '92018'],
dtype='object', name='Month')
Use set_index for only dates columns, convert them to datetimes and get order positions by argsort, then change ordering with iloc:
df = df.set_index(['level_0','UNIQUE_ID'])
df = df.iloc[:, pd.to_datetime(df.columns, format='%m%Y').argsort()].reset_index()
print (df)
level_0 UNIQUE_ID 122017 12018 22018 32018 42018 52018 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0 3590.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0 5.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0 202.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0 11.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN NaN
62018 72018 82018 92018 102018 112018 122018
0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN 175.0 NaN NaN NaN NaN NaN
Another idea is create month period index by DatetimeIndex.to_period, so is possible use sort_index:
df = df.set_index(['level_0','UNIQUE_ID'])
df.columns = pd.to_datetime(df.columns, format='%m%Y').to_period('m')
#alternative for convert to datetimes
#df.columns = pd.to_datetime(df.columns, format='%m%Y')
df = df.sort_index(axis=1).reset_index()
print (df)
level_0 UNIQUE_ID 2017-12 2018-01 2018-02 2018-03 2018-04 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN
2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11 2018-12
0 3590.0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 5.0 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 202.0 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 11.0 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN NaN 175.0 NaN NaN NaN NaN NaN
I am trying to convert a dataframe from long to wide, but Im not sure how to convert it to the format below. What am I missing?
d = {'vote': [100, 50,1,23,55,67,89,44],
'vote2': [10, 2,18,26,77,99,9,40],
'ballot1': ['a','b','a','a','b','a','a','b'],
'voteId':[1,2,3,4,5,6,7,8]}
df1=pd.DataFrame(d)
#########################################################
dftemp=df1
#####FORMATTING DATA
dftemp=pd.DataFrame(dftemp.reset_index())
dflw= dftemp.set_index(['voteId','vote','ballot1'])
dflw=dflw.unstack()
dflw.columns = dflw.columns.droplevel(0).rename('')
dflw=pd.DataFrame(dflw)
print(dflw)
MY CURRENT OUTPUT:
a b a b
voteId vote
1 100 0.0 NaN 10.0 NaN
2 50 NaN 1.0 NaN 2.0
GOAL:
voteid (ballot1=a)vote (ballot1=b)vote (ballot1=a)vote2 (ballot1=b)vote2
1 100 NaN 10 NaN
2 NaN 50 NaN 2
I am starting from df1
s=df1.set_index(['voteId','ballot1']).unstack()
s.columns=s.columns.map('(ballot1={0[1]}){0[0]}'.format)
s
Out[1120]:
(ballot1=a)vote (ballot1=b)vote (ballot1=a)vote2 (ballot1=b)vote2
voteId
1 100.0 NaN 10.0 NaN
2 NaN 50.0 NaN 2.0
3 1.0 NaN 18.0 NaN
4 23.0 NaN 26.0 NaN
5 NaN 55.0 NaN 77.0
6 67.0 NaN 99.0 NaN
7 89.0 NaN 9.0 NaN
8 NaN 44.0 NaN 40.0
I have the following DataFrame:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
What I have to do is count the rows that are exactly the same, including the NaN values.
The problem is the following, I use groupby, but it is a function that ignores the NaN values, that is, it does not have them in mind when doing the counting, that is the reason why I am not returning a correct output counting the number of exact repetitions between those rows.
My code is the following one:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
aux = data[data.duplicated(keep=False)]
x = data[data.duplicated(keep=False)].drop_duplicates()
#This line should count my repeated rows
s = aux.groupby(data.columns.tolist(),as_index=False).transform('size')
return x
If I print "x" var, I get this result, it shows all the repeated rows:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
51 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
53 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
Now I have to count those rows from my x result that are exactly the same.
This should be my correct output:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2
Here is my problem and it's that groupby ignores NaN values, and that's why other similar posts about this problem can't help me.
Thanks
If your dataframe's name is df, you can count the number of duplicates using just one line of code:
sum(df.duplicated(keep = False))
If you want to drop duplicate rows, use the drop_duplicates method. documentation
Example:
#data.csv
col1,col2,col3
a,3,NaN #duplicate
b,9,4 #duplicate
c,12,5
a,3,NaN #duplicate
b,9,4 #duplicate
d,19,20
a,3,NaN #duplicate - 5 duplicate rows
Importing data.csv and dropping duplicate rows (by default the first instance of a duplicated row is kept)
import pandas as pd
df = pd.read_csv("data.csv")
print(df.drop_duplicates())
#Output
c1 c2 c3
0 a 3 NaN
1 b 9 4.0
2 c 12 5.0
5 d 19 20.0
To count the number of duplicates rows, use the dataframe's duplicated method. Set "keep" to False (documentation). As mentioned above, you can simply do this using sum(df.duplicated(keep = False)). Here's a messier way to do it that demonstrates what the "duplicated" method does:
duplicate_rows = df.duplicated(keep = False)
print(duplicate_rows)
#count the number of duplicates (i.e. count the number of 'True' values in
#the duplicate_rows boolean series.
number_of_duplicates = sum(duplicate_rows)
print("Number of duplicate rows:")
print(number_of_duplicates)
#Output
#The duplicate_rows pandas series from df.duplicated(keep = False)
0 True
1 True
2 False
3 True
4 True
5 False
6 True
dtype: bool
#The number of rows from sum(df.duplicated(keep = False))
Number of duplicate rows:
5
I just solved it.
The problem as I said was groupby that doesn't not accept Nan Values.
So what I have done to solve it, is to change all Nan Values with fillna(0) function so it changes all NaN to 0 and now I can do the comparation properly.
Here is my new function working properly:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
aux = data[data.duplicated(keep=False)]
x = data[data.duplicated(keep=False)].drop_duplicates()
s = aux.fillna(0).groupby(data.columns.tolist()).size().reset_index().rename(columns={0:'count'})
x['num_reps'] = s['count'].tolist()[::-1]
return x