Re-size the dataframe - python

I have a dataframe that is a 5252 rows x 3 columns
data look something like this
X Y Z
1 1 2
1 2 4
1 3 3.5
2 13 4
1 4 3
2 14 3.5
3 14 2
3 15 1
4 16 .5
4 18 2
. . .
. . .
. . .
1508 751 1
1508 669 1
1508 686 2.5
I want to convert it so the userid is the rows and itemid is the column and Z is the data correspond to X and Y. Something like this:
1 2 3 4 5 6 13 14 15 16 17 18 669 686
1 2 4 3.5 3 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 4 4.5 0 0 0 0 0 0
3 0 0 0 0 0 0 0 2 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 .5 0 2 0 0
.
.
.
1508 0 0 0 0 0 0 0 0 0 0 0 0 1 1

I assume you're using pandas library.
You need pd.pivot_table function. If the dataframe is called df, then you need:
pd.pivot_table(data=df, index="x", columns="y", values="z", aggfunc=sum)

You need to use pd.pivot_table() and use fillna(0). Recreating your sample dataframe:
import pandas as pd
df = pd.DataFrame({'X': [1,1,1,1,2,2,3,3,4], 'Y': [1,2,3,4,13,14,14,15,16], 'Z': [2,4,3.5,3,4,3.5,2,1,.5]})
Gives:
X Y Z
0 1 1 2.0
1 1 2 4.0
2 1 3 3.5
3 1 4 3.0
4 2 13 4.0
5 2 14 3.5
6 3 14 2.0
7 3 15 1.0
8 4 16 0.5
Then using pd.pivot_table():
pd.pivot_table(df, values='Z', index=['X'], columns=['Y']).fillna(0)
Yields:
Y 1 2 3 4 13 14 15 16
X
1 2.0 4.0 3.5 3.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 4.0 3.5 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5

Related

How to find out the cumulative count between numbers?

i want to find the cumulative count before there is a change in value, i.e. how many rows since the last change. For illustration:
Value
diff
#row since last change (how do I create this column?)
6
na
na
5
-1
0
5
0
1
5
0
2
4
-1
0
4
0
1
4
0
2
4
0
3
4
0
4
5
1
0
5
0
1
5
0
2
5
0
3
6
1
0
7
1
0
i tried to use cumsum but it does not reset after each change
IIUC, use a cumcount per group:
df['new'] = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
output:
Value diff new
0 6 na 0
1 5 -1 0
2 5 0 1
3 5 0 2
4 4 -1 0
5 4 0 1
6 4 0 2
7 4 0 3
8 4 0 4
9 5 1 0
10 5 0 1
11 5 0 2
12 5 0 3
13 6 1 0
14 7 1 0
If you want the NaN based on diff: you can mask the output:
df['new'] = (df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
.mask(df['diff'].isna())
)
output:
Value diff new
0 6 NaN NaN
1 5 -1.0 0.0
2 5 0.0 1.0
3 5 0.0 2.0
4 4 -1.0 0.0
5 4 0.0 1.0
6 4 0.0 2.0
7 4 0.0 3.0
8 4 0.0 4.0
9 5 1.0 0.0
10 5 0.0 1.0
11 5 0.0 2.0
12 5 0.0 3.0
13 6 1.0 0.0
14 7 1.0 0.0
If performance is important count consecutive 0 values from difference column:
m = df['diff'].eq(0)
b = m.cumsum()
df['out'] = b.sub(b.mask(m).ffill().fillna(0)).astype(int)
print (df)
Value diff need out
0 6 NaN na 0
1 5 -1.0 0 0
2 5 0.0 1 1
3 5 0.0 2 2
4 4 -1.0 0 0
5 4 0.0 1 1
6 4 0.0 2 2
7 4 0.0 3 3
8 4 0.0 4 4
9 5 1.0 0 0
10 5 0.0 1 1
11 5 0.0 2 2
12 5 0.0 3 3
13 6 1.0 0 0
14 7 1.0 0 0

Sum values of columns that start with the same text string

I want to take the sum of values (row-wise) of columns that start with the same text string. Underneath is my original df with fails on courses.
Original df:
ID P_English_2 P_English_3 P_German_1 P_Math_1 P_Math_3 P_Physics_2 P_Physics_4
56 1 3 1 2 0 0 3
11 0 0 0 1 4 1 0
6 0 0 0 0 0 1 0
43 1 2 1 0 0 1 1
14 0 1 0 0 1 0 0
Desired df:
ID P_English P_German P_Math P_Physics
56 4 1 2 3
11 0 0 5 1
6 0 0 0 1
43 3 1 0 2
14 1 0 1 0
Tried code:
import pandas as pd

df = pd.DataFrame({"ID": [56,11,6,43,14],
"P_Math_1": [2,1,0,0,0],
"P_English_3": [3,0,0,2,1],

 "P_English_2": [1,0,0,1,0],
"P_Math_3": [0,4,0,0,1],
"P_Physics_2": [0,1,1,1,0],

 "P_Physics_4": [3,0,0,1,0],
"P_German_1": [1,0,0,1,0]})
print(df)

categories = ['P_Math', 'P_English', 'P_Physics', 'P_German']
def correct_categories(cols):

 return [cat for col in cols for cat in categories if col.startswith(cat)]
result = df.groupby(correct_categories(df.columns),axis=1).sum()

print(result)
Let's try groupby with axis=1:
# extract the subjects
subjects = [x[0] for x in df.columns.str.rsplit('_',n=1)]
df.groupby(subjects, axis=1).sum()
Output:
ID P_English P_German P_Math P_Physics
0 56 4 1 2 3
1 11 0 0 5 1
2 6 0 0 0 1
3 43 3 1 0 2
4 14 1 0 1 0
Or you can use wide_to_long, assuming ID are unique valued:
(pd.wide_to_long(df, stubnames=categories,
i=['ID'], j='count', sep='_')
.groupby('ID').sum()
)
Output:
P_Math P_English P_Physics P_German
ID
56 2.0 4.0 3.0 1.0
11 5.0 0.0 1.0 0.0
6 0.0 0.0 1.0 0.0
43 0.0 3.0 2.0 1.0
14 1.0 1.0 0.0 0.0

How to conditionally merge across two columns

I have a dataframe like this:
df_ex_A = pd.DataFrame({'X':['r','r','t','t','v','w'],
'A':[3,4,1,2,1,1],
'A_val':[25,25,100,20,10,90]})
Out[115]:
X A A_val
0 r 3 25
1 r 4 25
2 t 1 100
3 t 2 20
4 v 1 10
5 w 1 90
and another df like this:
df_ex_B = pd.DataFrame({ 'X':['r','r','t','t','v','w'],
'B':[4,5,2,3,2,2],
'B_val':[75,65,30,0,0,0]})
Out[117]:
X B B_val
0 r 4 75
1 r 5 65
2 t 2 30
3 t 3 0
4 v 2 0
5 w 2 0
I want to create df by merge operations on equal values of A and B, like this:
X (A==B) A_val B_val
0 r 3 25 0
1 r 4 25 75
2 r 5 0 65
3 t 1 1 0
4 t 2 20 30
5 t 3 0 0
6 v 1 10 0
7 v 2 0 0
8 w 1 90 0
9 w 2 0 0
how to execute a merge to get this df?
Thanks
Let's try using set_index and pd.concat:
dfA = df_ex_A.set_index(['X','A']).rename_axis(['X','A==B'])
dfB = df_ex_B.set_index(['X','B']).rename_axis(['X','A==B'])
pd.concat([dfA,dfB], axis=1).fillna(0).reset_index()
Output:
X A==B A_val B_val
0 r 3 25.0 0.0
1 r 4 25.0 75.0
2 r 5 0.0 65.0
3 t 1 100.0 0.0
4 t 2 20.0 30.0
5 t 3 0.0 0.0
6 v 1 10.0 0.0
7 v 2 0.0 0.0
8 w 1 90.0 0.0
9 w 2 0.0 0.0
Or you can use join after setting indexes and renaming axis:
dfA.join(dfB, how='outer').fillna(0).reset_index()
Output:
X A==B A_val B_val
0 r 3 25.0 0.0
1 r 4 25.0 75.0
2 r 5 0.0 65.0
3 t 1 100.0 0.0
4 t 2 20.0 30.0
5 t 3 0.0 0.0
6 v 1 10.0 0.0
7 v 2 0.0 0.0
8 w 1 90.0 0.0
9 w 2 0.0 0.0
I think what you want is an outer join which can be done with merge specifying how='outer':
df_ex_A.merge(df_ex_B.rename(columns={'B':'A'}), how='outer').fillna(0).rename(columns={'A':'A==B'})

Pandas/Numpy group value changes and derivative value changes above/below 0

I have a series of values (Pandas DF or Numpy Arr):
vals = [0,1,3,4,5,5,4,2,1,0,-1,-2,-3,-2,3,5,8,4,2,0,-1,-3,-8,-20,-10,-5,-2,-1,0,1,2,3,5,6,8,4,3]
df = pd.DataFrame({'val': vals})
I want to classify/group the values into 4 categories:
Increasing above 0
Decreasing above 0
Increasing below 0
Decreasing below 0
Current approach with Pandas is to categorize into above/below 0 and then that into increasing/decreasing by seeing when diff values change above/below 0.
df['above_zero'] = np.where(df['val'] >= 0, 1, 0)
df['below_zero'] = np.where(df['val'] < 0, 1, 0)
df['diffs'] = df['val'].diff()
df['diff_above_zero'] = np.where(df['diffs'] >= 0, 1, 0)
df['diff_below_zero'] = np.where(df['diffs'] < 0, 1, 0)
This produces the desired output, but now I am trying to find a solution how to group these columns into an ascending group number as soon as one of the 4 conditions changes.
Desired output would look like this (*group col is manually typed, might have errors from calculated values):
id val above_zero below_zero diffs diff_above_zero diff_below_zero group
0 0 1 0 0.0 1 0 0
1 1 1 0 1.0 1 0 0
2 3 1 0 2.0 1 0 0
3 4 1 0 1.0 1 0 0
4 5 1 0 1.0 1 0 0
5 5 1 0 0.0 1 0 0
6 4 1 0 -1.0 0 1 1
7 2 1 0 -2.0 0 1 1
8 1 1 0 -1.0 0 1 1
9 0 1 0 -1.0 0 1 1
10 -1 0 1 -1.0 0 1 2
11 -2 0 1 -1.0 0 1 2
12 -3 0 1 -1.0 0 1 2
13 -2 0 1 1.0 1 0 3
14 3 1 0 5.0 1 0 4
15 5 1 0 2.0 1 0 4
16 8 1 0 3.0 1 0 4
17 4 1 0 -4.0 0 1 5
18 2 1 0 -2.0 0 1 5
19 0 1 0 -2.0 0 1 5
20 -1 0 1 -1.0 0 1 6
21 -3 0 1 -2.0 0 1 6
22 -8 0 1 -5.0 0 1 6
23 -20 0 1 -12.0 0 1 6
24 -10 0 1 10.0 1 0 7
25 -5 0 1 5.0 1 0 7
26 -2 0 1 3.0 1 0 7
27 -1 0 1 1.0 1 0 7
28 0 1 0 1.0 1 0 8
29 1 1 0 1.0 1 0 8
30 2 1 0 1.0 1 0 8
31 3 1 0 1.0 1 0 8
32 5 1 0 2.0 1 0 8
33 6 1 0 1.0 1 0 8
34 8 1 0 2.0 1 0 8
35 4 1 0 -4.0 0 1 9
36 3 1 0 -1.0 0 1 9
Would appreciate any help on how to solve this efficiently. Thanks!
Setup
g1 = ['above_zero', 'below_zero', 'diff_above_zero', 'diff_below_zero']
You can simply index all of your boolean columns, and use shift:
c = df.loc[:, g1]
(c != c.shift().fillna(c)).any(1).cumsum()
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 2
11 2
12 2
13 3
14 4
15 4
16 4
17 5
18 5
19 5
20 6
21 6
22 6
23 6
24 7
25 7
26 7
27 7
28 8
29 8
30 8
31 8
32 8
33 8
34 8
35 9
36 9
dtype: int32
The following code will produce two columns: c1 and c2.
The values of c1 correspond to the following 4 categories:
0 means below zero and increasing
1 means above zero and increasing
2 means below zero and decreasing
3 means above zero and decreasing
And c2 corresponds to ascending group number as soon as condition (i.e. c1) changes (as you wanted). Credits to #user3483203 for using the shift with cumsum
# calculate difference
df["diff"] = df['val'].diff()
# set first value in column 'diff' to 0 (as previous step sets it to NaN)
df.loc[0, 'diff'] = 0
df["c1"] = (df['val'] >= 0).astype(int) + (df["diff"] < 0).astype(int) * 2
df["c2"] = (df["c1"] != df["c1"].shift().fillna(df["c1"])).astype(int).cumsum()
Result:
val diff c1 c2
0 0 0.0 1 0
1 1 1.0 1 0
2 3 2.0 1 0
3 4 1.0 1 0
4 5 1.0 1 0
5 5 0.0 1 0
6 4 -1.0 3 1
7 2 -2.0 3 1
8 1 -1.0 3 1
9 0 -1.0 3 1
10 -1 -1.0 2 2
11 -2 -1.0 2 2
12 -3 -1.0 2 2
13 -2 1.0 0 3
14 3 5.0 1 4
15 5 2.0 1 4
16 8 3.0 1 4
17 4 -4.0 3 5
18 2 -2.0 3 5
19 0 -2.0 3 5
20 -1 -1.0 2 6
21 -3 -2.0 2 6
22 -8 -5.0 2 6
23 -20 -12.0 2 6
24 -10 10.0 0 7
25 -5 5.0 0 7
26 -2 3.0 0 7
27 -1 1.0 0 7
28 0 1.0 1 8
29 1 1.0 1 8
30 2 1.0 1 8
31 3 1.0 1 8
32 5 2.0 1 8
33 6 1.0 1 8
34 8 2.0 1 8
35 4 -4.0 3 9
36 3 -1.0 3 9

Pandas creates a new column each time I save a dataframe as csv in Python

I am buffled with the following behavior of Pandas: A new column is added each time I save a dataframe as a csv file.
For a reproducible example:
print(df)
Medical_Keyword_17 Product_Info_2_A5 Medical_History_27 Family_Hist_2
1 0 0 3 0.0
2 0 0 3 0.0
3 0 0 3 0.0
4 0 0 3 0.0
5 0 0 3 0.0
6 0 0 3 0.0
7 0 1 3 NaN
8 0 0 3 0.0
9 0 0 3 0.0
df.to_csv('toy_data.csv')
df1 = pd.read_csv('toy_data.csv')
print(df1)
Unnamed: 0 Medical_Keyword_17 Product_Info_2_A5 Medical_History_27 \
0 1 0 0 3
1 2 0 0 3
2 3 0 0 3
3 4 0 0 3
4 5 0 0 3
5 6 0 0 3
6 7 0 1 3
7 8 0 0 3
8 9 0 0 3
Family_Hist_2
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 NaN
7 0.0
8 0.0
How can I understand this behavior and avoid it?
This first column is called index.
For avoid write it to file use index=False:
df.to_csv('toy_data.csv', index=False)
df1 = pd.read_csv('toy_data.csv')
Or use index_col parameter in read_csv:
df.to_csv('toy_data.csv')
df1 = pd.read_csv('toy_data.csv', index_col=[0])

Categories

Resources