Showing Before and After version of a dataframe using Pandas - python

------Original Data----
Before:
Speak English Length currentCode currentName
0 True 1 $A USA
1 True 2 $AM Massachusetts
2 True 3 $AMB Boston
3 True 3 $AMS Springfield
4 True 3 $AMA Amherst
5 True 3 $AMP Plymouth
6 False 1 $D Germany
7 False 2 $DB Brandenburg
8 False 3 $DBB Berlin
9 False 3 $DBD Dresden
After
Speak English Length futureCode futureName
0 True 1 $A America
1 True 2 $AM Maine
2 True 3 $AMC Brockton
3 True 3 $AMM Main
4 False 1 $D Denmark
5 False 2 $DC Copenhagen
6 False 3 $DCC Copper
7 False 3 $DCD Dresden
Goal:
Note: The goal is in form of a pivot table in Excel. My codes:
import pandas as pd
before = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx',sheet_name='Before')
after = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx',sheet_name='After')
Attempt #1: Concatenate but do not know how to set index of Speak English and Length afterward:
pd.concat([before,after],axis = 1,keys=['Before','After'],join='outer')
Attempt #2 : Set index for each data frame but cannot concatenate along the columns as Pandas raises ValueError: cannot handle a non-unique multi-index!
before = before.set_index(['Speak English','Length']).sort_index(axis = 0)
after = after.set_index(['Speak English','Length']).sort_index(axis = 0)
pd.concat([before,after],axis = 1,keys=['Current','Future'],join='outer')
Thank you so much for your helps!

Related

Checking if values of a pandas Dataframe are between two lists. Adding a boolean column

I am trying to add a new column to a pandas Dataframe (False/True),which reflects if the Value is between two datapoints from another file.
I have a two files which give the following info:
File A:(x) File B:(y)
'time' 'time_A' 'time_B'
0 1 0 1 3
1 3 1 5 6
2 5 2 8 10
3 7
4 9
5 11
6 13
I tried to do it with the .map function, however it gives true and false for each event, not one column.
x['Event'] = x['time'].map((lamda x: x< y['time_A']),(lamda x: x> y['time_B']))
This would be the expected result
->
File A:
'time' 'Event'
0 1 True
1 3 True
2 5 True
3 7 False
4 9 True
5 11 False
6 13 False
However what i get is something like this
->
File A:
'time'
0 1 "0 True
1 True
2 True"
Name:1, dtype:bool"
2 3 "0 True
1 True
2 True
Name:1, dtype:bool"
This should do it:
(x.assign(key=1)
.merge(y.assign(key=1),
on='key')
.drop('key', 1)
.assign(Event=lambda v: (v['time_A'] <= v['time']) &
(v['time'] <= v['time_B']))
.groupby('time', as_index=False)['Event']
.any())
time Event
0 1 True
1 3 True
2 5 True
3 7 False
4 9 True
5 11 False
6 13 False
Use pd.IntervalIndex here:
idx=pd.IntervalIndex.from_arrays(B['time_A'],B['time_B'],closed='both')
#output-> IntervalIndex([[1, 3], [5, 6], [8, 10]],closed='both',dtype='interval[int64]')
A['Event']=B.set_index(idx).reindex(A['time']).notna().all(1).to_numpy()
print(A)
time Event
0 1 True
1 3 True
2 5 True
3 7 False
4 9 True
5 11 False
6 13 False
One liner:
A['Event'] = sum(A.time.between(b.time_A, b.time_B) for _, b in B.iterrows()) > 0
Explain:
For each row b of B dataframe, A.time.between(b.time_A, b.time_B) returns a boolean series whether time is between time_A and time_B
sum(list_of_boolean_series) > 0: Elementwise OR

Using python df.replace with dict does not permanently change values

I generated a DataFrame that includes a column called "pred_categories" with numerical values of 0, 1, 2, and 3. See below:
fileids pred_categories
0 /Saf/DA192069.txt 3
1 /Med/DA000038.txt 2
2 /Med/DA000040.txt 2
3 /Saf/DA191905.txt 3
4 /Med/DA180730.txt 2
I wrote a dict:
di = {3: "SAF", 2: "MED", 1: "FAC", 0: "ENV"}
And it works at first:
df.replace({'pred_categories': di})
Out[16]:
fileids pred_categories
0 /Saf/DA192069.txt SAF
1 /Med/DA000038.txt MED
2 /Med/DA000040.txt MED
3 /Saf/DA191905.txt SAF
4 /Med/DA180730.txt MED
5 /Saf/DA192307.txt SAF
6 /Env/DA178021.txt ENV
7 /Fac/DA358334.txt FAC
8 /Env/DA178049.txt ENV
9 /Env/DA178020.txt ENV
10 /Env/DA178031.txt ENV
11 /Med/DA000050.txt MED
12 /Med/DA180720.txt MED
13 /Med/DA000010.txt MED
14 /Fac/DA358391.txt FAC
but then when checking
df.head()
it doesn't seem to permanently "save" it in the DataFrame. Any pointers on what I'm doing wrong?
print(df)
fileids pred_categories
0 /Saf/DA192069.txt 3
1 /Med/DA000038.txt 2
2 /Med/DA000040.txt 2
3 /Saf/DA191905.txt 3
4 /Med/DA180730.txt 2
5 /Saf/DA192307.txt 3
6 /Env/DA178021.txt 0
7 /Fac/DA358334.txt 1
8 /Env/DA178049.txt 0
9 /Env/DA178020.txt 0
10 /Env/DA178031.txt 0
11 /Med/DA000050.txt 2
12 /Med/DA180720.txt 2
13 /Med/DA000010.txt 2
14 /Fac/DA358391.txt 1
Per default .replace() returns changed DF, but it doesn't change it in place, so you have to do it this way:
df = df.replace({'pred_categories': di})
or
df.replace({'pred_categories': di}, inplace=True)

Adding a count to prior cell value in Pandas

in Pandas I am looking to add a value in one column 'B' depending on the boolean values from another column 'A'. So if 'A' is True then start counting (i.e. adding a one each new line) as long as 'A' is false. When 'A' is True reset and start counting again. I managed to do this with a 'for' loop but this is very time consuming. I am wondering if there is no more time efficient solution?
the result should look like this:
Date A B
01.2010 False 0
02.2010 True 1
03.2010 False 2
04.2010 False 3
05.2010 True 1
06.2010 False 2
You can use cumsum with groupby and cumcount:
print df
Date A
0 1.201 False
1 1.201 True
2 1.201 False
3 2.201 True
4 3.201 False
5 4.201 False
6 5.201 True
7 6.201 False
roll = df.A.cumsum()
print roll
0 0
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Name: A, dtype: int32
df['B'] = df.groupby(roll).cumcount() + 1
#if in first values are False, output is 0
df.loc[roll == 0 , 'B'] = 0
print df
Date A B
0 1.201 False 0
1 1.201 True 1
2 1.201 False 2
3 2.201 True 1
4 3.201 False 2
5 4.201 False 3
6 5.201 True 1
7 6.201 False 2
thanks, I got the solution from another post similar to this:
rolling_count = 0
def set_counter(val):
if val == False:
global rolling_count
rolling_count +=1
else:
val == True
rolling_count = 1
return rolling_count
df['B'] = df['A'].map(set_counter)

Tabulate data frame with groupby and size methods

I have a Pandas dataframe, out, that I am computing counts on by the columns, city and raingarden using the
following series:
out.groupby(['city','raingarden']).size() with the output
city raingarden
55405 True 3
Edina True 7
MInneapolis True 8
Minneapolis False 2482
True 847
Minneapolis False 2
True 1
Minneapolis, True 1
Minneapolis, False 2
Minneapolsi False 5
True 3
Minnepolis False 4
Minnespolis False 4
Minnetonka False 1
True 2
Minnneapolis False 5
Mpla True 3
Mpls False 22
True 20
Mpls. False 8
True 17
NE Mpls True 6
Richfield True 1
SLP True 3
St Louis Park True 2
St. Louis Park False 1
Victoria False 1
Wayzata False 2
True 1
minneapolis False 3
mpls True 2
dtype: int64
I want to take this and output it to a tabulate table.
To do this, I did the following:
headers = ['city','has rain garden', 'n']
print tabulate(out.groupby(['city','raingarden']).size().to_frame(), headers, tablefmt="simple")
Issue 1: I need to get a column name on the counts, but have not had any luck;
Issue 2 (which is probably related to issue 1), the output looks like this:
city has rain garden
-------------------------- -----------------
(u'55405', True) 3
(u'Edina', True) 7
(u'MInneapolis', True) 8
(u'Minneapolis', False) 2482
(u'Minneapolis', True) 847
(u'Minneapolis ', False) 2
(u'Minneapolis ', True) 1
(u'Minneapolis,', True) 1
(u'Minneapolis, ', False) 2
(u'Minneapolsi', False) 5
(u'Minneapolsi', True) 3
(u'Minnepolis', False) 4
(u'Minnespolis', False) 4
(u'Minnetonka', False) 1
(u'Minnetonka', True) 2
(u'Minnneapolis', False) 5
(u'Mpla', True) 3
(u'Mpls', False) 22
(u'Mpls', True) 20
(u'Mpls.', False) 8
(u'Mpls.', True) 17
(u'NE Mpls', True) 6
(u'Richfield', True) 1
(u'SLP', True) 3
(u'St Louis Park', True) 2
(u'St. Louis Park', False) 1
(u'Victoria', False) 1
(u'Wayzata', False) 2
(u'Wayzata', True) 1
(u'minneapolis', False) 3
(u'mpls', True) 2
The first two columns are given as a tuple? Thus, how do I split these into separate columns, and how do I add a label for my counts? I am sure what I am trying to achieve should be much simpler than I tried.
By groping by two columns, you are creating a multi-level index Series, which I believe is not what you want. I am not sure how to original data looks like (would be nice providing out.head() in the question), but I believe what you are looking for is:
out.groupby('city').sum()['raingarden']
Here's an example with some randomly generated data:
import random
import string
import pandas as pd
import numpy as np
city = random.sample(string.lowercase*500,100)
raingarden = np.random.randint(0,10,100)
out = pd.DataFrame({'city':city, 'raingarden':raingarden})
Output:
In [30]: out.groupby('city').sum()['raingarden']
Out[30]:
city
a 17
b 7
c 16
d 8
e 24
f 28
g 16
h 49
i 29
j 24
k 4
l 5
m 17
n 29
p 22
q 14
r 19
s 6
t 21
u 8
v 18
w 25
x 11
y 9
z 40
Name: raingarden, dtype: int64

Pandas isin() function for continuous intervals

Let's say I want to construct a dummy variable that is true if a number is between 1 and 10, I can do:
df['numdum'] = df['number'].isin(range(1,11))
Is there a way to do that for a continuous interval? So, create a dummy variable that is true if a number is in a range, allowing for non-integers.
Series objects (including dataframe columns) have a between method:
>>> s = pd.Series(np.linspace(0, 20, 8))
>>> s
0 0.000000
1 2.857143
2 5.714286
3 8.571429
4 11.428571
5 14.285714
6 17.142857
7 20.000000
dtype: float64
>>> s.between(1, 14.5)
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 False
dtype: bool
This works:
df['numdum'] = (df.number >= 1) & (df.number <= 10)
You could also do the same thing with cut(). No real advantage if there are just two categories:
>>> df['numdum'] = pd.cut( df['number'], [-99,10,99], labels=[1,0] )
number numdum
0 8 1
1 9 1
2 10 1
3 11 0
4 12 0
5 13 0
6 14 0
But it's nice if you have multiple categories:
>>> df['numdum'] = pd.cut( df['number'], [-99,8,10,99], labels=[1,2,3] )
number numdum
0 8 1
1 9 2
2 10 2
3 11 3
4 12 3
5 13 3
6 14 3
Labels can be True and False if that is preferred, or you can not specify the label at all, in which case the labels will contain info on the cutoff points.

Categories

Resources