Tabulate data frame with groupby and size methods - python

I have a Pandas dataframe, out, that I am computing counts on by the columns, city and raingarden using the
following series:
out.groupby(['city','raingarden']).size() with the output
city raingarden
55405 True 3
Edina True 7
MInneapolis True 8
Minneapolis False 2482
True 847
Minneapolis False 2
True 1
Minneapolis, True 1
Minneapolis, False 2
Minneapolsi False 5
True 3
Minnepolis False 4
Minnespolis False 4
Minnetonka False 1
True 2
Minnneapolis False 5
Mpla True 3
Mpls False 22
True 20
Mpls. False 8
True 17
NE Mpls True 6
Richfield True 1
SLP True 3
St Louis Park True 2
St. Louis Park False 1
Victoria False 1
Wayzata False 2
True 1
minneapolis False 3
mpls True 2
dtype: int64
I want to take this and output it to a tabulate table.
To do this, I did the following:
headers = ['city','has rain garden', 'n']
print tabulate(out.groupby(['city','raingarden']).size().to_frame(), headers, tablefmt="simple")
Issue 1: I need to get a column name on the counts, but have not had any luck;
Issue 2 (which is probably related to issue 1), the output looks like this:
city has rain garden
-------------------------- -----------------
(u'55405', True) 3
(u'Edina', True) 7
(u'MInneapolis', True) 8
(u'Minneapolis', False) 2482
(u'Minneapolis', True) 847
(u'Minneapolis ', False) 2
(u'Minneapolis ', True) 1
(u'Minneapolis,', True) 1
(u'Minneapolis, ', False) 2
(u'Minneapolsi', False) 5
(u'Minneapolsi', True) 3
(u'Minnepolis', False) 4
(u'Minnespolis', False) 4
(u'Minnetonka', False) 1
(u'Minnetonka', True) 2
(u'Minnneapolis', False) 5
(u'Mpla', True) 3
(u'Mpls', False) 22
(u'Mpls', True) 20
(u'Mpls.', False) 8
(u'Mpls.', True) 17
(u'NE Mpls', True) 6
(u'Richfield', True) 1
(u'SLP', True) 3
(u'St Louis Park', True) 2
(u'St. Louis Park', False) 1
(u'Victoria', False) 1
(u'Wayzata', False) 2
(u'Wayzata', True) 1
(u'minneapolis', False) 3
(u'mpls', True) 2
The first two columns are given as a tuple? Thus, how do I split these into separate columns, and how do I add a label for my counts? I am sure what I am trying to achieve should be much simpler than I tried.

By groping by two columns, you are creating a multi-level index Series, which I believe is not what you want. I am not sure how to original data looks like (would be nice providing out.head() in the question), but I believe what you are looking for is:
out.groupby('city').sum()['raingarden']
Here's an example with some randomly generated data:
import random
import string
import pandas as pd
import numpy as np
city = random.sample(string.lowercase*500,100)
raingarden = np.random.randint(0,10,100)
out = pd.DataFrame({'city':city, 'raingarden':raingarden})
Output:
In [30]: out.groupby('city').sum()['raingarden']
Out[30]:
city
a 17
b 7
c 16
d 8
e 24
f 28
g 16
h 49
i 29
j 24
k 4
l 5
m 17
n 29
p 22
q 14
r 19
s 6
t 21
u 8
v 18
w 25
x 11
y 9
z 40
Name: raingarden, dtype: int64

Related

Detect presence of inverse pairs in two columns of a DataFrame

I have a dataframe with two columns; source, and target. I would like to detect inverse rows, i.e. for a pair of values (source, target), if there exists a pair of values (target, source) then assign True to a new column.
My attempt:
cols = ['source', 'target']
_cols = ['target', 'source']
sub_edges = edges[cols]
sub_edges['oneway'] = sub_edges.apply(lambda x: True if x[x.isin(x[_cols])] else False, axis=1)
You can apply a lambda function using similar logic to that in your example. We check if there are any rows in the dataframe with a reversed source/target pair.
Incidentally, the column name 'oneway' indicates to me the opposite of the logic described in your question, but to change this we can just remove the not in the lambda function.
Code
import pandas as pd
import random
edges = {"source": random.sample(range(20), 20),
"target": random.sample(range(20), 20)}
df = pd.DataFrame(edges)
df["oneway"] = df.apply(
lambda x: not df[
(df["source"] == x["target"]) & (df["target"] == x["source"]) & (df.index != x.name)
].empty,
axis=1,
)
Output
source target oneway
0 9 11 False
1 16 1 True
2 1 16 True
3 11 14 False
4 4 13 False
5 18 15 False
6 14 17 False
7 13 12 False
8 19 19 False
9 12 3 False
10 10 6 False
11 15 5 False
12 3 18 False
13 17 0 False
14 6 7 False
15 5 10 False
16 7 2 False
17 8 9 False
18 0 4 False
19 2 8 False

Showing Before and After version of a dataframe using Pandas

------Original Data----
Before:
Speak English Length currentCode currentName
0 True 1 $A USA
1 True 2 $AM Massachusetts
2 True 3 $AMB Boston
3 True 3 $AMS Springfield
4 True 3 $AMA Amherst
5 True 3 $AMP Plymouth
6 False 1 $D Germany
7 False 2 $DB Brandenburg
8 False 3 $DBB Berlin
9 False 3 $DBD Dresden
After
Speak English Length futureCode futureName
0 True 1 $A America
1 True 2 $AM Maine
2 True 3 $AMC Brockton
3 True 3 $AMM Main
4 False 1 $D Denmark
5 False 2 $DC Copenhagen
6 False 3 $DCC Copper
7 False 3 $DCD Dresden
Goal:
Note: The goal is in form of a pivot table in Excel. My codes:
import pandas as pd
before = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx',sheet_name='Before')
after = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx',sheet_name='After')
Attempt #1: Concatenate but do not know how to set index of Speak English and Length afterward:
pd.concat([before,after],axis = 1,keys=['Before','After'],join='outer')
Attempt #2 : Set index for each data frame but cannot concatenate along the columns as Pandas raises ValueError: cannot handle a non-unique multi-index!
before = before.set_index(['Speak English','Length']).sort_index(axis = 0)
after = after.set_index(['Speak English','Length']).sort_index(axis = 0)
pd.concat([before,after],axis = 1,keys=['Current','Future'],join='outer')
Thank you so much for your helps!

Identify random input characters among the certain numerical format in Python df

I had to clean the column with membership_id, however, there are lots of random input values like '0000000', '99999', '*', 'na'.
The membership_id is serial numbers. The format of member ID is ranged from 4 digits to 12 digits, in which:
4 digits - 9 digits are starting from any non-zero number, while from 10 to 12 digits are starting from 1000xxxxxxxx.
Sorry for not describing the format clearly at beginning, I just found the IDs failed to meet this criteria is an invalid one. I would like to distinguish all of these non-membership id format as 0, thanks for help.
member_id
1 176828287
2 176841791
3 202142958
4 222539874
5 223565464
6 224721631
7 227675081
8 30235355118
9 %
10 ---
11 .
12 .215694985
13 0
14 00
15 000
16 00000000000000
17 99999999999999
18 999999999999999
19 : 211066980
20 D5146159
21 JulieGreen
22 N/a
23 NONE
24 None
25 PP - Premium Pr
26 T0000
27 T0000019
28 T0000022
If I understood correctly, uses regex expression = \A((1000\d{8})|([1-9]\d{3,10}))\Z will meet your requirements.
Above regex expression matches below:
12 digits beginning with 1000
4 to 11 digits and must be beginning with 1
Below is one demo:
import pandas as pd
import re
df = pd.DataFrame(['176828287','176841791','202142958','222539874','223565464','224721631','227675081','30235355118',
'%','---','.','.215694985','0','00','000','00000000000000','99999999999999','999999999999999',':211066980',
'D5146159','JulieGreen','N/a','NONE','None','PP - PremiumPr','T0000','T0000019','T0000022'], columns=['member_id'])
r = re.compile(r'\A((1000\d{8})|([1-9]\d{3,10}))\Z')
df['valid'] = df['member_id'].apply(lambda x: bool(r.match(x)))
#you can use df['member_id'] = df['member_id'].apply(lambda x: x if bool(r.match(x)) else 0) to replace invalid id with 0
print(df)
Output:
member_id valid
0 176828287 True
1 176841791 True
2 202142958 True
3 222539874 True
4 223565464 True
5 224721631 True
6 227675081 True
7 30235355118 True
8 % False
9 --- False
10 . False
11 .215694985 False
12 0 False
13 00 False
14 000 False
15 00000000000000 False
16 99999999999999 False
17 999999999999999 False
18 :211066980 False
19 D5146159 False
20 JulieGreen False
21 N/a False
22 NONE False
23 None False
24 PP - PremiumPr False
25 T0000 False
26 T0000019 False
27 T0000022 False
Do you have a regex already made that satisfies the criteria for the data you want to replace with 0? If not, you have to either have to create one, or make a dictionary terms = {'N/a':0, '---':0} of the individual items you want to replace and then call .map(terms) on the series.
pandas has built-in string functions, which include pattern matching algorithms.
So you can easily create a boolean mask, which distinguishes valid from invalid id's:
pattern = r'1000\d{6,8}$|[1-9]\d{3,8}$'
mask = df.member_id.str.match(pattern)
To print only the valid rows, just use the mask as index:
print(df[mask])
member_id
1 176828287
2 176841791
3 202142958
4 222539874
5 223565464
6 224721631
7 227675081
To set invalid data to 0, just use the complement of the mask:
df.loc[~mask] = 0
print(df)
member_id
1 176828287
2 176841791
3 202142958
4 222539874
5 223565464
6 224721631
7 227675081
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0

Pandas isin() function for continuous intervals

Let's say I want to construct a dummy variable that is true if a number is between 1 and 10, I can do:
df['numdum'] = df['number'].isin(range(1,11))
Is there a way to do that for a continuous interval? So, create a dummy variable that is true if a number is in a range, allowing for non-integers.
Series objects (including dataframe columns) have a between method:
>>> s = pd.Series(np.linspace(0, 20, 8))
>>> s
0 0.000000
1 2.857143
2 5.714286
3 8.571429
4 11.428571
5 14.285714
6 17.142857
7 20.000000
dtype: float64
>>> s.between(1, 14.5)
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 False
dtype: bool
This works:
df['numdum'] = (df.number >= 1) & (df.number <= 10)
You could also do the same thing with cut(). No real advantage if there are just two categories:
>>> df['numdum'] = pd.cut( df['number'], [-99,10,99], labels=[1,0] )
number numdum
0 8 1
1 9 1
2 10 1
3 11 0
4 12 0
5 13 0
6 14 0
But it's nice if you have multiple categories:
>>> df['numdum'] = pd.cut( df['number'], [-99,8,10,99], labels=[1,2,3] )
number numdum
0 8 1
1 9 2
2 10 2
3 11 3
4 12 3
5 13 3
6 14 3
Labels can be True and False if that is preferred, or you can not specify the label at all, in which case the labels will contain info on the cutoff points.

Assign value to subset of rows in Pandas dataframe

I want to assign values based on a condition on index in Pandas DataFrame.
class test():
def __init__(self):
self.l = 1396633637830123000
self.dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = arange(self.l,self.l+10))
self.dfb = pd.DataFrame([[self.l+1,self.l+3], [self.l+6,self.l+9]], columns = ['beg', 'end'])
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = np.nan
for i, beg, end in zip(self.dfb.index, self.dfb['beg'], self.dfb['end']):
self.dfa.ix[beg:end]['true'] = True
self.dfa.ix[beg:end]['idx'] = i
def do(self):
self.update()
print self.dfa
t = test()
t.do()
Result:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True NaN
1396633637830123002 4 5 True NaN
1396633637830123003 6 7 True NaN
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True NaN
1396633637830123007 14 15 True NaN
1396633637830123008 16 17 True NaN
1396633637830123009 18 19 True NaN
The true column is correctly assigned, while the idx column is not. Futhermore, this seems to depend on how the columns are initialized because if I do:
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = False
also the true column does not get properly assigned.
What am I doing wrong?
p.s. the expected result is:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True 0
1396633637830123002 4 5 True 0
1396633637830123003 6 7 True 0
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True 1
1396633637830123007 14 15 True 1
1396633637830123008 16 17 True 1
1396633637830123009 18 19 True 1
Edit: I tried assigning using both loc and iloc but it doesn't seem to work:
loc:
self.dfa.loc[beg:end]['true'] = True
self.dfa.loc[beg:end]['idx'] = i
iloc:
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['true'] = True
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['idx'] = i
You are chain indexing, see here. The warning is not guaranteed to happen.
You should prob just do this. No real need to actually track the index in b, btw.
In [44]: dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = np.arange(l,l+10))
In [45]: dfb = pd.DataFrame([[l+1,l+3], [l+6,l+9]], columns = ['beg', 'end'])
In [46]: dfa['in_b'] = False
In [47]: for i, s in dfb.iterrows():
....: dfa.loc[s['beg']:s['end'],'in_b'] = True
....:
or this if you have non-integer dtypes
In [36]: for i, s in dfb.iterrows():
dfa.loc[(dfa.index>=s['beg']) & (dfa.index<=s['end']),'in_b'] = True
In [48]: dfa
Out[48]:
A B in_b
1396633637830123000 0 1 False
1396633637830123001 2 3 True
1396633637830123002 4 5 True
1396633637830123003 6 7 True
1396633637830123004 8 9 False
1396633637830123005 10 11 False
1396633637830123006 12 13 True
1396633637830123007 14 15 True
1396633637830123008 16 17 True
1396633637830123009 18 19 True
[10 rows x 3 columns
If b is HUGE this might not be THAT performant.
As an aside, these look like nanosecond times. Can be more friendly by converting them.
In [49]: pd.to_datetime(dfa.index)
Out[49]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-04 17:47:17.830123, ..., 2014-04-04 17:47:17.830123009]
Length: 10, Freq: None, Timezone: None

Categories

Resources