Let's say we have this sample data.
| mem_id | main_title | sub_title |
-----------------------------------
| 1 | 1 | 1 |
| 10 | 3 | 2 |
| 3 | 3 | 2 |
| 45 | 1 | 2 |
| 162 | 2 | 2 |
...
1) summary of data
mem_id : unique id of 200 people
main_title : 3 unique labels (1,2,3)
sub_title : 6 unique labels (1,2,3,4,5,6) and each main_title can have one of these sub_title.
possible to have repetition like one mem_id can have multiple case of (1 : main , 1 : sub)
2) question
I'd like to make R table function result in python.
R table function result is like this.
I can make every possible combination from all main_title and sub_title.
Also can get the count from each case by mem_id.
count.data <- table(data$mem_id, data$main_title, data$sub_title)
count.table <- as.data.frame(count.data)
===============================================
mem_id main_title sub_title value
1 1 1 1 0
2 2 1 1 0
3 3 1 1 0
4 4 1 1 0
5 5 1 1 0
6 6 1 1 0
7 7 1 1 0
.
.
.
I've tried to get this result in Python and the result below is what i got so far.
cross_table1 = pd.melt(data, id_vars=['main_title ', 'sub_title'], value_vars='mem_id', value_name='mem_id')
==================================================
main_title sub_title variable mem_id
1 1 1 mem_id 10
2 1 1 mem_id 10
3 3 1 mem_id 10
4 4 2 mem_id 10
5 1 4 mem_id 132
6 4 1 mem_id 65
7 4 3 mem_id 88
.
.
.
cross_table2 = cross_table1.pivot_table(index=['main_title ', 'sub_title', 'mem_id'], values='variable', aggfunc='count')
cross_table32.reset_index().sort_values('value')
==============================================
main_title sub_title mem_id value
1 1 1 1 4
2 1 1 2 3
3 3 1 3 1
4 4 2 3 10
5 1 4 3 2
6 1 1 4 5
7 3 2 5 2
.
.
.
I recognize this only show the positive result of value(count of case) column.
What i need is to include all possible combination of main_title and sub_title, so like 1&1(main&sub) case has to have 200 rows with possible zero value in count column.
It would be so grateful if I can get any help or advice!!
Thanks :)
In pandas you can do with groupby + reindex
s=df.groupby(df.columns.tolist()).size()
idx=pd.MultiIndex.from_product(list(map(set,df.values.T)))
s=s.reindex(idx,fill_value=0)
s
Out[15]:
162 1 1 0
2 0
2 1 0
2 1
3 1 0
2 0
1 1 1 1
2 0
2 1 0
2 0
3 1 0
2 0
10 1 1 0
2 0
2 1 0
2 0
3 1 0
2 1
3 1 1 0
2 0
2 1 0
2 0
3 1 0
2 1
45 1 1 0
2 1
2 1 0
2 0
3 1 0
2 0
dtype: int64
Related
I have the following dataframe:
p l w s_w v
1 1 1 1 2
1 1 2 1 2
1 1 3 0 5
1 1 4 1 5
1 1 5 1 5
2 1 1 1 1
2 1 2 0 2
2 1 3 0 3
2 1 4 0 4
2 1 5 1 5
2 1 6 1 4
i want to have a new column
where in each row if the value of s_w is 1,
its value is sum(v) in two previous rows ( not necessarily successive ) where s_w==1
and sum(v) for two following rows ( not necessarily successive), again where s_w==1 so sum(v) + sum(v).
I am not interested in any number of zeros between
so resulted dataframe looks like this:
p l w s_w v c_s
1 1 1 1 2 Null
1 1 2 1 2 Null
1 1 3 0 5 Null
1 1 4 1 5 10
1 1 5 1 5 13
2 1 1 1 1 19
2 1 2 0 2 Null
2 1 3 0 3 Null
2 1 4 0 4 Null
2 1 5 1 5 Null
2 1 6 1 4 Null
last two rows value will Null because there are no two 1s after them ( n the other words sum before and after only if there are two 1s in previous and following rows( not necessarily successive, otherwise Null)
A new Edit to the original question:
for each group of P,l if only the value in check column is 1 then find the above mentioned pattern in s_w columns and sum(v) of two previous rows where s_w==1 ( not necessarily successive) and also sum(v) of two following rows where s_w==1 ( not necessarily successive)
p l w s_w check v
1 1 1 1 0 2
1 1 2 1 0 2
1 1 3 0 0 5
1 1 4 1 0 5
1 1 5 1 1 5
2 1 1 1 0 1
2 1 2 0 0 2
2 1 3 0 0 3
2 1 4 0 0 4
2 1 5 1 0 5
2 1 6 1 0 4
Idea is filtered rows with 1 and use rolling sum with shift values for correct align:
s = df.loc[df['s_w'].eq(1), 'v']
df['c_s'] = s.rolling(2).sum().shift().add(s.iloc[::-1].rolling(2).sum().shift())
print (df)
p l w s_w v c_s
0 1 1 1 1 2 NaN
1 1 1 2 1 2 NaN
2 1 1 3 0 5 NaN
3 1 1 4 1 5 10.0
4 1 1 5 1 5 13.0
5 2 1 1 1 1 19.0
6 2 1 2 0 2 NaN
7 2 1 3 0 3 NaN
8 2 1 4 0 4 NaN
9 2 1 5 1 5 NaN
10 2 1 6 1 4 NaN
Another idea:
df['c_s'] = s.shift(-1).add(s.shift(-2)).add(s.shift(2)).add(s.shift(1))
EDIT:
Solution per groups:
s = df[df['s_w'].eq(1)]
f = lambda x: x.rolling(2).sum().shift()
df['c_s'] = s.groupby(['p','l'])['v'].apply(f).add(s.iloc[::-1].groupby(['p','l'])['v'].apply(f))
g = df[df['s_w'].eq(1)].groupby(['p','l'])['v']
df['c_s'] = g.shift(-1).add(g.shift(-2)).add(g.shift(2)).add(g.shift(1))
So I am trying to count the number of consecutive same values in a dataframe and put that information into a new column in the dataframe, but I want the count to look iterative.
Here is what I have so far:
df = pd.DataFrame(np.random.randint(0,3, size=(15,4)), columns=list('ABCD'))
df['subgroupA'] = (df.A != df.A.shift(1)).cumsum()
dfg = df.groupby(by='subgroupA', as_index=False).apply(lambda grp: len(grp))
dfg.rename(columns={None: 'numConsec'}, inplace=True)
df = df.merge(dfg, how='left', on='subgroupA')
df
Here is the result:
A B C D subgroupA numConsec
0 2 1 1 1 1 1
1 1 2 1 0 2 2
2 1 0 2 1 2 2
3 0 1 2 0 3 1
4 1 0 0 1 4 1
5 0 2 2 1 5 2
6 0 2 1 1 5 2
7 1 0 0 1 6 1
8 0 2 0 0 7 4
9 0 0 0 2 7 4
10 0 2 1 1 7 4
11 0 2 2 0 7 4
12 1 2 0 1 8 1
13 0 1 1 0 9 1
14 1 1 1 0 10 1
The problem is, in the numConsec column, I don't want the full count for every row. I want it to reflect how it looks as you iteratively look at the dataframe. The problem is, my dataframe is too large to iteratively loop through and make the counts, as that would be too slow. I need to do it in a pythonic way and make it look like this:
A B C D subgroupA numConsec
0 2 1 1 1 1 1
1 1 2 1 0 2 1
2 1 0 2 1 2 2
3 0 1 2 0 3 1
4 1 0 0 1 4 1
5 0 2 2 1 5 1
6 0 2 1 1 5 2
7 1 0 0 1 6 1
8 0 2 0 0 7 1
9 0 0 0 2 7 2
10 0 2 1 1 7 3
11 0 2 2 0 7 4
12 1 2 0 1 8 1
13 0 1 1 0 9 1
14 1 1 1 0 10 1
Any ideas?
How to get the data frame below
dd = pd.DataFrame({'val':[0,0,1,1,1,0,0,0,0,1,1,0,1,1,1,1,0,0],
'groups':[1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,'ignore','ignore']})
val groups
0 0 1
1 0 1
2 1 1
3 1 1
4 1 1
5 0 2
6 0 2
7 0 2
8 0 2
9 1 2
10 1 2
11 0 3
12 1 3
13 1 3
14 1 3
15 1 3
16 0 ignore
17 0 ignore
I have a series df.val with has values [0,0,1,1,1,0,0,0,0,1,1,0,1,1,1,1,0,0].
How to create df.groups from df.val.
first 0,0,1,1,1 will form group 1,(i.e. from the beginning upto next occurrence of 0 after 1's)
0,0,0,0,1,1 will form group 2, (incremental group number, starting where previous group ended uptill next occurrence of 0 after 1's),...etc
Can anyone please help.
First test if next value after 0 is 1 and create groups by sumulative sums by Series.cumsum:
s = (dd['val'].eq(0) & dd['val'].shift().eq(1)).cumsum().add(1)
Then convert last group to ignore if last value of data are 0 with numpy.where:
mask = s.eq(s.max()) & (dd['val'].iat[-1] == 0)
dd['new'] = np.where(mask, 'ignore', s)
print (dd)
val groups new
0 0 1 1
1 0 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 0 2 2
6 0 2 2
7 0 2 2
8 0 2 2
9 1 2 2
10 1 2 2
11 0 3 3
12 1 3 3
13 1 3 3
14 1 3 3
15 1 3 3
16 0 ignore ignore
17 0 ignore ignore
IIUC first we do diff and cumsum , then we need to find the condition to ignore the previous value we get (np.where)
s=df.val.diff().eq(-1).cumsum()+1
df['New']=np.where(df['val'].eq(1).groupby(s).transform('any'),s,'ignore')
df
val groups New
0 0 1 1
1 0 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 0 2 2
6 0 2 2
7 0 2 2
8 0 2 2
9 1 2 2
10 1 2 2
11 0 3 3
12 1 3 3
13 1 3 3
14 1 3 3
15 1 3 3
16 0 ignore ignore
17 0 ignore ignore
Here is my python code,
from fractions import gcd
print "| 2 3 4 5 6 7 8 9 10 11 12 13 14 15"
print "-----------------------------------"
xlist = range(2,16)
ylist = range(2,51)
for b in ylist:
print b, " | "
for a in xlist:
print gcd(a,b)
I'm having trouble printing a table that will display on the top row 2-15 and on the left column the values 2-50. With a gcd table for each value from each row and each column.
Here is a sample of what I'm getting
| 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 |
2
1
2
You can have it much more concise with list comprehension:
from fractions import gcd
print(" | 2 3 4 5 6 7 8 9 10 11 12 13 14 15")
print("-----------------------------------------------")
xlist = range(2,16)
ylist = range(2,51)
print("\n".join(" ".join(["%2d | " % b] + [("%2d" % gcd(a, b)) for a in xlist]) for b in ylist))
Output:
| 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-----------------------------------------------
2 | 2 1 2 1 2 1 2 1 2 1 2 1 2 1
3 | 1 3 1 1 3 1 1 3 1 1 3 1 1 3
4 | 2 1 4 1 2 1 4 1 2 1 4 1 2 1
5 | 1 1 1 5 1 1 1 1 5 1 1 1 1 5
6 | 2 3 2 1 6 1 2 3 2 1 6 1 2 3
7 | 1 1 1 1 1 7 1 1 1 1 1 1 7 1
8 | 2 1 4 1 2 1 8 1 2 1 4 1 2 1
9 | 1 3 1 1 3 1 1 9 1 1 3 1 1 3
10 | 2 1 2 5 2 1 2 1 10 1 2 1 2 5
11 | 1 1 1 1 1 1 1 1 1 11 1 1 1 1
12 | 2 3 4 1 6 1 4 3 2 1 12 1 2 3
13 | 1 1 1 1 1 1 1 1 1 1 1 13 1 1
14 | 2 1 2 1 2 7 2 1 2 1 2 1 14 1
15 | 1 3 1 5 3 1 1 3 5 1 3 1 1 15
16 | 2 1 4 1 2 1 8 1 2 1 4 1 2 1
17 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1
18 | 2 3 2 1 6 1 2 9 2 1 6 1 2 3
19 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1
20 | 2 1 4 5 2 1 4 1 10 1 4 1 2 5
21 | 1 3 1 1 3 7 1 3 1 1 3 1 7 3
22 | 2 1 2 1 2 1 2 1 2 11 2 1 2 1
23 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1
24 | 2 3 4 1 6 1 8 3 2 1 12 1 2 3
25 | 1 1 1 5 1 1 1 1 5 1 1 1 1 5
26 | 2 1 2 1 2 1 2 1 2 1 2 13 2 1
27 | 1 3 1 1 3 1 1 9 1 1 3 1 1 3
28 | 2 1 4 1 2 7 4 1 2 1 4 1 14 1
29 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1
30 | 2 3 2 5 6 1 2 3 10 1 6 1 2 15
31 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1
32 | 2 1 4 1 2 1 8 1 2 1 4 1 2 1
33 | 1 3 1 1 3 1 1 3 1 11 3 1 1 3
34 | 2 1 2 1 2 1 2 1 2 1 2 1 2 1
35 | 1 1 1 5 1 7 1 1 5 1 1 1 7 5
36 | 2 3 4 1 6 1 4 9 2 1 12 1 2 3
37 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1
38 | 2 1 2 1 2 1 2 1 2 1 2 1 2 1
39 | 1 3 1 1 3 1 1 3 1 1 3 13 1 3
40 | 2 1 4 5 2 1 8 1 10 1 4 1 2 5
41 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1
42 | 2 3 2 1 6 7 2 3 2 1 6 1 14 3
43 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1
44 | 2 1 4 1 2 1 4 1 2 11 4 1 2 1
45 | 1 3 1 5 3 1 1 9 5 1 3 1 1 15
46 | 2 1 2 1 2 1 2 1 2 1 2 1 2 1
47 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1
48 | 2 3 4 1 6 1 8 3 2 1 12 1 2 3
49 | 1 1 1 1 1 7 1 1 1 1 1 1 7 1
50 | 2 1 2 5 2 1 2 1 10 1 2 1 2 5
This works in Python2 and Python3. If you want zeros at the beginning of each one-digit number, replace each occurence of %2d with %02d. You probably shouldn't print the header like that, but do it more like this:
from fractions import gcd
xlist = range(2, 16)
ylist = range(2, 51)
string = " | " + " ".join(("%2d" % x) for x in xlist)
print(string)
print("-" * len(string))
print("\n".join(" ".join(["%2d | " % b] + [("%2d" % gcd(a, b)) for a in xlist]) for b in ylist))
This way, even if you change xlist or ylist, the table will still look good.
Your problem is that the python print statement adds a newline by itself.
One solution to this is to build up your own string to output piece by piece and use only one print statement per line of the table, like such:
from fractions import gcd
print "| 2 3 4 5 6 7 8 9 10 11 12 13 14 15"
print "-----------------------------------"
xlist = range(2,16)
ylist = range(2,51)
for b in ylist:
output=str(b)+" | " #For each number in ylist, make a new string with this number
for a in xlist:
output=output+str(gcd(a,b))+" " #Append to this for each number in xlist
print output #Print the string you've built up
Example output, by the way:
| 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-----------------------------------
2 | 2 1 2 1 2 1 2 1 2 1 2 1 2 1
3 | 1 3 1 1 3 1 1 3 1 1 3 1 1 3
4 | 2 1 4 1 2 1 4 1 2 1 4 1 2 1
5 | 1 1 1 5 1 1 1 1 5 1 1 1 1 5
6 | 2 3 2 1 6 1 2 3 2 1 6 1 2 3
7 | 1 1 1 1 1 7 1 1 1 1 1 1 7 1
8 | 2 1 4 1 2 1 8 1 2 1 4 1 2 1
9 | 1 3 1 1 3 1 1 9 1 1 3 1 1 3
You can specify what kind of character end the line using the end parameter in print.
from fractions import gcd
print("| 2 3 4 5 6 7 8 9 10 11 12 13 14 15")
print("-----------------------------------")
xlist = range(2,16)
ylist = range(2,51)
for b in ylist:
print(b + " | ",end="")
for a in xlist:
print(gcd(a,b),end="")
print("")#Newline
If you are using python 2.x, you need to add from __future__ import print_function to the top for this to work.
I have a data frame that represents fail-data for a series of parts, showing which of 3 tests (A, B, C) pass (0) or fail (1).
A B C
1 0 1 1
2 0 0 0
3 1 0 0
4 0 0 1
5 0 0 0
6 0 1 0
7 1 1 0
8 1 1 1
I'd like to add a final column to the dataframe showing the First Fail (FF) of each part, or a default (P) if no fails.
A B C | FF
1 0 1 1 | B
2 0 0 0 | P
3 1 0 0 | A
4 0 0 1 | C
5 0 0 0 | P
6 0 1 0 | B
7 1 1 0 | A
8 1 1 1 | A
Any easy way to do this pandas? Does it require iterating over each row?
maybe:
>>> df['FF'] = df.dot(df.columns).str.slice(0, 1).replace('', 'P')
>>> df
A B C FF
1 0 1 1 B
2 0 0 0 P
3 1 0 0 A
4 0 0 1 C
5 0 0 0 P
6 0 1 0 B
7 1 1 0 A
8 1 1 1 A
alternatively:
>>> df['FF'] = np.where(df.any(axis=1), df.idxmax(axis=1), 'P')
>>> df
A B C FF
1 0 1 1 B
2 0 0 0 P
3 1 0 0 A
4 0 0 1 C
5 0 0 0 P
6 0 1 0 B
7 1 1 0 A
8 1 1 1 A