Sum values of columns that start with the same text string - python

I want to take the sum of values (row-wise) of columns that start with the same text string. Underneath is my original df with fails on courses.
Original df:
ID P_English_2 P_English_3 P_German_1 P_Math_1 P_Math_3 P_Physics_2 P_Physics_4
56 1 3 1 2 0 0 3
11 0 0 0 1 4 1 0
6 0 0 0 0 0 1 0
43 1 2 1 0 0 1 1
14 0 1 0 0 1 0 0
Desired df:
ID P_English P_German P_Math P_Physics
56 4 1 2 3
11 0 0 5 1
6 0 0 0 1
43 3 1 0 2
14 1 0 1 0
Tried code:
import pandas as pd

df = pd.DataFrame({"ID": [56,11,6,43,14],
"P_Math_1": [2,1,0,0,0],
"P_English_3": [3,0,0,2,1],

 "P_English_2": [1,0,0,1,0],
"P_Math_3": [0,4,0,0,1],
"P_Physics_2": [0,1,1,1,0],

 "P_Physics_4": [3,0,0,1,0],
"P_German_1": [1,0,0,1,0]})
print(df)

categories = ['P_Math', 'P_English', 'P_Physics', 'P_German']
def correct_categories(cols):

 return [cat for col in cols for cat in categories if col.startswith(cat)]
result = df.groupby(correct_categories(df.columns),axis=1).sum()

print(result)

Let's try groupby with axis=1:
# extract the subjects
subjects = [x[0] for x in df.columns.str.rsplit('_',n=1)]
df.groupby(subjects, axis=1).sum()
Output:
ID P_English P_German P_Math P_Physics
0 56 4 1 2 3
1 11 0 0 5 1
2 6 0 0 0 1
3 43 3 1 0 2
4 14 1 0 1 0
Or you can use wide_to_long, assuming ID are unique valued:
(pd.wide_to_long(df, stubnames=categories,
i=['ID'], j='count', sep='_')
.groupby('ID').sum()
)
Output:
P_Math P_English P_Physics P_German
ID
56 2.0 4.0 3.0 1.0
11 5.0 0.0 1.0 0.0
6 0.0 0.0 1.0 0.0
43 0.0 3.0 2.0 1.0
14 1.0 1.0 0.0 0.0

Related

Fill column with nan if sum of multiple columns is 0

Task
I have a df where I do some ratios that are groupby date and id. I want to fill column c with NaN if the sum of a and b is 0. Any help would be awesome!!
df
date id a b c
0 2001-09-06 1 3 1 1
1 2001-09-07 1 3 1 1
2 2001-09-08 1 4 0 1
3 2001-09-09 2 6 0 1
4 2001-09-10 2 0 0 2
5 2001-09-11 1 0 0 2
6 2001-09-12 2 1 1 2
7 2001-09-13 2 0 0 2
8 2001-09-14 1 0 0 2
Try this:
df['new_c'] = df.c.where(df[['a','b']].sum(1).ne(0))
Out[75]:
date id a b c new_c
0 2001-09-06 1 3 1 1 1.0
1 2001-09-07 1 3 1 1 1.0
2 2001-09-08 1 4 0 1 1.0
3 2001-09-09 2 6 0 1 1.0
4 2001-09-10 2 0 0 2 NaN
5 2001-09-11 1 0 0 2 NaN
6 2001-09-12 2 1 1 2 2.0
7 2001-09-13 2 0 0 2 NaN
8 2001-09-14 1 0 0 2 NaN
It is better to build a new dataframe with same shape , and then do the following :
i = 0
for line in df :
new_df[i]['date'] = line['date']
new_df[i]['a'] = line['a']
new_df[i]['b'] = line['b']
if line['a'] + line['b'] == 0 :
new_df[i]['c'] = Nan
i += 1

Percentage of events before and after a sequence of zeros in pandas rows

I have a dataframe like the following:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total
-----------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5
and I want to know the % of events (the numbers in the cells) before and after the first sequence of zeros of length n appears in each row. This problem started as another question found here: Length of first sequence of zeros of given size after certain column in pandas dataframe, and I am trying to modify the code to do what I need, but I keep getting errors and can't seem to find the right way. This is what I have tried:
def func(row, n):
"""Returns the number of events before the
first sequence of 0s of length n is found
"""
idx = np.arange(0, 91)
a = row[idx]
b = (a != 0).cumsum()
c = b[a == 0]
d = c.groupby(c).count()
#in case there is no sequence of 0s with length n
try:
e = c[c >= d.index[d >= n][0]]
f = str(e.index[0])
except IndexError:
e = [90]
f = str(e[0])
idx_sliced = np.arange(0, int(f)+1)
a = row[idx_sliced]
if (int(f) + n > 90):
perc_before = 100
else:
perc_before = a.cumsum().tail(1).values[0]/row['total']
return perc_before
As is, the error I get is:
---> perc_before = a.cumsum().tail(1).values[0]/row['total']
TypeError: ('must be str, not int', 'occurred at index 0')
Finally, I would apply this function to a dataframe and return a new column with the % of events before the first sequence of n 0s in each row, like this:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total %_before
---------------------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156 43
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231 21
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78 90
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5 100
If trying to solve this, you can test by using this sample input:
a = pd.Series([1,1,13,0,0,0,4,0,0,0,0,0,12,1,1])
b = pd.Series([1,1,13,0,0,0,4,12,1,12,3,0,0,5,1])
c = pd.Series([1,1,13,0,0,0,4,12,2,0,5,0,5,1,1])
d = pd.Series([1,1,13,0,0,0,4,12,1,12,4,50,0,0,1])
e = pd.Series([1,1,13,0,0,0,4,12,0,0,0,54,0,1,1])
df = pd.DataFrame({'0':a, '1':b, '2':c, '3':d, '4':e})
df = df.transpose()
Give this a try:
def percent_before(row, n, ncols):
"""Return the percentage of activities happen before
the first sequence of at least `n` consecutive 0s
"""
start_index, i, size = 0, 0, 0
for i in range(ncols):
if row[i] == 0:
# increase the size of the island
size += 1
elif size >= n:
# found the island we want
break
else:
# start a new island
# row[start_index] is always non-zero
start_index = i
size = 0
if size < n:
# didn't find the island we want
return 1
else:
# get the sum of activities that happen
# before the island
idx = np.arange(0, start_index + 1).astype(str)
return row.loc[idx].sum() / row['total']
df['percent_before'] = df.apply(percent_before, n=3, ncols=15, axis=1)
Result:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 total percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 33 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 53 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 45 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 99 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 87 0.172414
For the full frame, call apply with ncols=91.
Another possible solution:
def get_vals(df, n):
df, out = df.T, []
for col in df.columns:
diff_to_previous = df[col] != df[col].shift(1)
g = df.groupby(diff_to_previous.cumsum())[col].agg(['idxmin', 'size'])
vals = df.loc[g.loc[g['size'] >= n, 'idxmin'].values, col]
if len(vals):
out.append( df.loc[np.arange(0, vals[vals == 0].index[0]), col].sum() / df[col].sum() )
else:
out.append( 1.0 )
return out
df['percent_before'] = get_vals(df, n=3)
print(df)
Prints:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 0.172414
As one of the comment of the previous question was about the speed, I guess you can try to vectorize the problem. I used this dataframe to try (slightly different than your original input):
ID 0 1 2 3 4 5 6 7 8 total
0 A 2 21 0 18 3 0 0 0 2 46
1 B 0 0 12 2 0 8 14 23 0 59
2 C 0 38 19 3 1 3 3 7 1 75
3 D 3 0 0 1 0 0 0 0 0 4
Now what I think is chaining command to create a mask and find where the data is not equal to 0, then use cumsum along the column axis and see where the diff along the column is equal to 0. To find the first one, you can use cummax so that all the columns after (row-wise) are considered are True. Mask the original dataframe with the opposite of this mask, sum along the columns and divide by total. for example with n=2:
n=2
df['%_before'] = df[~(df.ne(0).cumsum(axis=1).diff(n, axis=1)[range(9)]
.eq(0).cummax(axis=1))].sum(axis=1)/df.total
print (df)
ID 0 1 2 3 4 5 6 7 8 total %_before
0 A 2 21 0 18 3 0 0 0 2 46 0.956522
1 B 0 0 12 2 0 8 14 23 0 59 0.000000
2 C 0 38 19 3 1 3 3 7 1 75 1.000000
3 D 3 0 0 1 0 0 0 0 0 4 0.750000
In your case, you need to change range(9) by range(91) to get all your columns
You can do this using the rolling method.
For your example input, given the number of zeros is 5, we can use
df.rolling(window=5, axis=1).apply(lambda x : np.sum(x))
The output would like
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 NaN NaN NaN NaN 15.0 14.0 17.0 4.0 4.0 4.0 4.0 0.0 12.0 13.0
1 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 32.0 28.0 16.0 20.0
2 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 18.0 18.0 23.0 19.0 12.0 11.0
3 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 33.0 79.0 67.0 66.0
4 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 16.0 16.0 16.0 66.0 54.0 55.0
14
0 14.0
1 9.0
2 12.0
3 55.0
4 56.0
Looking at the output, its very easy to see that in the first row, for column 11, since the value is 0, it means that starting at position 7, you have 5 zeros.
Since none of the other rows have 0 in them, it means that none of the other rows have 5 contiguous zeros in them.

Re-size the dataframe

I have a dataframe that is a 5252 rows x 3 columns
data look something like this
X Y Z
1 1 2
1 2 4
1 3 3.5
2 13 4
1 4 3
2 14 3.5
3 14 2
3 15 1
4 16 .5
4 18 2
. . .
. . .
. . .
1508 751 1
1508 669 1
1508 686 2.5
I want to convert it so the userid is the rows and itemid is the column and Z is the data correspond to X and Y. Something like this:
1 2 3 4 5 6 13 14 15 16 17 18 669 686
1 2 4 3.5 3 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 4 4.5 0 0 0 0 0 0
3 0 0 0 0 0 0 0 2 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 .5 0 2 0 0
.
.
.
1508 0 0 0 0 0 0 0 0 0 0 0 0 1 1
I assume you're using pandas library.
You need pd.pivot_table function. If the dataframe is called df, then you need:
pd.pivot_table(data=df, index="x", columns="y", values="z", aggfunc=sum)
You need to use pd.pivot_table() and use fillna(0). Recreating your sample dataframe:
import pandas as pd
df = pd.DataFrame({'X': [1,1,1,1,2,2,3,3,4], 'Y': [1,2,3,4,13,14,14,15,16], 'Z': [2,4,3.5,3,4,3.5,2,1,.5]})
Gives:
X Y Z
0 1 1 2.0
1 1 2 4.0
2 1 3 3.5
3 1 4 3.0
4 2 13 4.0
5 2 14 3.5
6 3 14 2.0
7 3 15 1.0
8 4 16 0.5
Then using pd.pivot_table():
pd.pivot_table(df, values='Z', index=['X'], columns=['Y']).fillna(0)
Yields:
Y 1 2 3 4 13 14 15 16
X
1 2.0 4.0 3.5 3.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 4.0 3.5 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5

Pandas creates a new column each time I save a dataframe as csv in Python

I am buffled with the following behavior of Pandas: A new column is added each time I save a dataframe as a csv file.
For a reproducible example:
print(df)
Medical_Keyword_17 Product_Info_2_A5 Medical_History_27 Family_Hist_2
1 0 0 3 0.0
2 0 0 3 0.0
3 0 0 3 0.0
4 0 0 3 0.0
5 0 0 3 0.0
6 0 0 3 0.0
7 0 1 3 NaN
8 0 0 3 0.0
9 0 0 3 0.0
df.to_csv('toy_data.csv')
df1 = pd.read_csv('toy_data.csv')
print(df1)
Unnamed: 0 Medical_Keyword_17 Product_Info_2_A5 Medical_History_27 \
0 1 0 0 3
1 2 0 0 3
2 3 0 0 3
3 4 0 0 3
4 5 0 0 3
5 6 0 0 3
6 7 0 1 3
7 8 0 0 3
8 9 0 0 3
Family_Hist_2
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 NaN
7 0.0
8 0.0
How can I understand this behavior and avoid it?
This first column is called index.
For avoid write it to file use index=False:
df.to_csv('toy_data.csv', index=False)
df1 = pd.read_csv('toy_data.csv')
Or use index_col parameter in read_csv:
df.to_csv('toy_data.csv')
df1 = pd.read_csv('toy_data.csv', index_col=[0])

Merging row data using panda's in python

I am trying to write a small python application that creates a csv file that contains data for a recipe system,
Imagine the following structure of excel data
Manufacturer Product Data 1 Data 2 Data 3
Test 1 Product 1 1 2 3
Test 1 Product 2 4 5 6
Test 2 Product 1 1 2 3
Test 3 Product 1 1 2 3
Test 3 Product 1 4 5 6
Test 3 Product 1 7 8 9
When merged i woudl like the data to be displayed in following format,
Test 1 Product 1 1 2 3 0 0 0 0 0 0
Test 2 Product 2 4 5 6 0 0 0 0 0 0
Test 2 Product 1 1 2 3 0 0 0 0 0 0
Test 3 Product 1 1 2 3 4 5 6 7 8 9
Any help would be greatfully recieved, so far i can read the panda dataset and convert to a CSV
Regards
Lee
Use melt, groupby, pd.Series, and unstack:
(df.melt(['Manufacturer','Product'])
.groupby(['Manufacturer','Product'])['value']
.apply(lambda x: pd.Series(x.tolist()))
.unstack(fill_value=0)
.reset_index())
Output:
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 4 7 2 5 8 3 6 9
With groupby
df.groupby(['Manufacturer','Product']).agg(tuple).sum(1).apply(pd.Series).fillna(0)
Out[85]:
0 1 2 3 4 5 6 7 8
Manufacturer Product
Test1 Product1 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
Product2 4.0 5.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0
Test2 Product1 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
Test3 Product1 1.0 4.0 7.0 2.0 5.0 8.0 3.0 6.0 9.0
cols = ['Manufacturer', 'Product']
d = df.set_index(cols + [df.groupby(cols).cumcount()]).unstack(fill_value=0)
d
Gets me
Data 1 Data 2 Data 3
0 1 2 0 1 2 0 1 2
Manufacturer Product
Test 1 Product 1 1 0 0 2 0 0 3 0 0
Product 2 4 0 0 5 0 0 6 0 0
Test 2 Product 1 1 0 0 2 0 0 3 0 0
Test 3 Product 1 1 4 7 2 5 8 3 6 9
Followed up wtih
d.sort_index(1, 1).pipe(lambda d: d.set_axis(range(d.shape[1]), 1, False).reset_index())
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 2 3 4 5 6 7 8 9
Or
cols = ['Manufacturer', 'Product']
pd.Series({
n: d.values.ravel() for n, d in df.set_index(cols).groupby(cols)
}).apply(pd.Series).fillna(0, downcast='infer').rename_axis(cols).reset_index()
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 2 3 4 5 6 7 8 9
With defaultdict and itertools.count
from itertools import count
from collections import defaultdict
c = defaultdict(count)
pd.Series({(
m, p, next(c[(m, p)])): v
for _, m, p, *V in df.itertuples()
for v in V
}).unstack(fill_value=0)
0 1 2 3 4 5 6 7 8
Test 1 Product 1 1 2 3 0 0 0 0 0 0
Product 2 4 5 6 0 0 0 0 0 0
Test 2 Product 1 1 2 3 0 0 0 0 0 0
Test 3 Product 1 1 2 3 4 5 6 7 8 9

Categories

Resources