Python Dataframe calculate Distance via intermediate point - python

I have a python dataframe with distances listed as follows
dict = {'from' : ['A','A','A','B','B','D','D','D'],
'to' : ['B','C','D','C','E','B','C','E'],
'distances': [4,3,1,1,3,4,2,9]}
df = pd.DataFrame.from_dict(dict)
I want enumerate all of the Distances for:
From point1 == > point2
where point1==> point2 =
From point1 ==> B + From B==> point2 and is included in the a
how can i do this efficiently using python - i assume some kind of pd.merge?
I would then like to reformate the dataframe into the following
columns = ['From','To','Distance','Distance via B']

If you're looking for routes of length 3, here's a solution. Note that in some cases, the direct route (e.g. A to B) is shorter than the route A-B-C:
three_route = pd.merge(df, df, left_on="to", right_on="from")
three_route["distance"] = three_route.distances_x + three_route.distances_y
three_route = three_route[["from_x", "to_x", "to_y", "distance"]]. \
rename(columns = {"from_x":"from", "to_x": "via", "to_y": "to"})
The result is:
from via to distance
0 A B C 5
1 A B E 7
2 D B C 5
3 D B E 7
4 A D B 5
5 A D C 3
6 A D E 10

Related

Replace contents of cell with another cell if condition on a separate cell is met

I have to following data frame
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
we can see 4 columns A,B,C,D the intended outcome is to replace the contents of B with the contents of D, if a condition on C is met, for this example the condition is of C = 1
the intended output is
A = [1,2,5,4,3,1]
B = ["y","No","y","y","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
output_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
output_df.drop('D', axis = 1)
What is the best way to apply this logic to a data frame?
There are many ways to solve, here is another one:
test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
This can be done with np.where:
test_df['B'] = np.where(test_df['C']==1, test_df['D'], test_df['B'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
The desired output is achieved using .loc with column 'C' as the mask.
test_df.loc[test_df['C']==1,'B'] = test_df.loc[test_df['C']==1,'D']
UPDATE: Just found out a similar answer is posted by #QuangHoang. This answer is slightly different in that it does not require numpy
I don't know if inverse is the right word here, but I noticed recently that mask and where are "inverses" of each other. If you pass a ~ to the condition of a .where statement, then you get the same result as mask:
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
test_df['B'] = test_df['B'].where(~(test_df['C'] == 1), test_df['D'])
# test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D']) - Scott Boston's answer
test_df
Out[1]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
You can also use df.where:
test_df['B'] = test_df['D'].where(test_df.C.eq(1), test_df.B)
Output:
In [875]: test_df
Out[875]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

Count Re-occurrence of a value in python

I have a data set which contains something like this:
SNo Cookie
1 A
2 A
3 A
4 B
5 C
6 D
7 A
8 B
9 D
10 E
11 D
12 A
So lets say we have 5 cookies 'A,B,C,D,E'. Now I want to count if any cookie has reoccurred after a new cookie was encountered. For example, in the above example, cookie A was encountered again at 7th place and then at 12th place also. NOTE We wouldn't count A at 2nd place as it came simultaneously, but at position 7th and 12th we had seen many new cookies before seeing A again, hence we count that instance. So essentially I want something like this:
Sno Cookie Count
1 A 2
2 B 1
3 C 0
4 D 2
5 E 0
Can anyone give me logic or python code behind this?
One way to do this would be to first get rid of consecutive Cookies, then find where the Cookie has been seen before using duplicated, and finally groupby cookie and get the sum:
no_doubles = df[df.Cookie != df.Cookie.shift()]
no_doubles['dups'] = no_doubles.Cookie.duplicated()
no_doubles.groupby('Cookie').dups.sum()
This gives you:
Cookie
A 2.0
B 1.0
C 0.0
D 2.0
E 0.0
Name: dups, dtype: float64
Start by removing consecutive duplicates, then count the survivers:
no_dups = df[df.Cookie != df.Cookie.shift()] # Borrowed from #sacul
no_dups.groupby('Cookie').count() - 1
# SNo
#Cookie
#A 2
#B 1
#C 0
#D 2
#E 0
pandas.factorize and numpy.bincount
If immediately repeated values are not counted then remove them.
Do a normal counting of values on what's left.
However, that is one more than what is asked for, so subtract one.
factorize
Filter out immediate repeats
bincount
Produce pandas.Series
i, r = pd.factorize(df.Cookie)
mask = np.append(True, i[:-1] != i[1:])
cnts = np.bincount(i[mask]) - 1
pd.Series(cnts, r)
A 2
B 1
C 0
D 2
E 0
dtype: int64
pandas.value_counts
zip cookies with its lagged self, pulling out non repeats
c = df.Cookie.tolist()
pd.value_counts([a for a, b in zip(c, [None] + c) if a != b]).sort_index() - 1
A 2
B 1
C 0
D 2
E 0
dtype: int64
defaultdict
from collections import defaultdict
def count(s):
d = defaultdict(lambda:-1)
x = None
for y in s:
d[y] += y != x
x = y
return pd.Series(d)
count(df.Cookie)
A 2
B 1
C 0
D 2
E 0
dtype: int64

Change the values of column after having used groupby on another column (pandas dataframe)

I have two data frames, one with the coordinates of places
coord = pd.DataFrame()
coord['Index'] = ['A','B','C']
coord['x'] = np.random.random(coord.shape[0])
coord['y'] = np.random.random(coord.shape[0])
coord
Index x y
0 A 0.888025 0.376416
1 B 0.052976 0.396243
2 C 0.564862 0.30138
and one with several values measured in the places
df = pd.DataFrame()
df['Index'] = ['A','A','B','B','B','C','C','C','C']
df['Value'] = np.random.random(df.shape[0])
df
Index Value
0 A 0.930298
1 A 0.144550
2 B 0.393952
3 B 0.680941
4 B 0.657807
5 C 0.704954
6 C 0.733328
7 C 0.099785
8 C 0.871678
I want to find an efficient way of assigning the coordinates to the df data frame. For the moment I have tried
df['x'] = np.zeros(df.shape[0])
df['y'] = np.zeros(df.shape[0])
for i in df.Index.unique():
df.loc[df.Index == i, 'x'] = coord.loc[coord.Index == i,'x'].values
df.loc[df.Index == i, 'y'] = coord.loc[coord.Index == i,'y'].values
which works and yields
Index Value x y
0 A 0.220323 0.983739 0.121289
1 A 0.115075 0.983739 0.121289
2 B 0.432688 0.809586 0.639811
3 B 0.106178 0.809586 0.639811
4 B 0.259465 0.809586 0.639811
5 C 0.804018 0.827192 0.156095
6 C 0.552053 0.827192 0.156095
7 C 0.412345 0.827192 0.156095
8 C 0.235106 0.827192 0.156095
but this is quite sloppy, and highly inefficient. I tried to use the groupby operation like this
df['x'] =np.zeros(df.shape[0])
df['y'] =np.zeros(df.shape[0])
gb = df.groupby('Index')
for k in gb.groups.keys():
gb.get_group(k)['x'] = coord.loc[coord.Index == i ,'x']
gb.get_group(k)['y'] = coord.loc[coord.Index == i ,'y']
but I get this error here
/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I understand the problem, but I dont know how to overcome it.
Any suggestions ?
merge is what you're looking for.
df
Index Value
0 A 0.930298
1 A 0.144550
2 B 0.393952
3 B 0.680941
4 B 0.657807
5 C 0.704954
6 C 0.733328
7 C 0.099785
8 C 0.871678
coord
Index x y
0 A 0.888025 0.376416
1 B 0.052976 0.396243
2 C 0.564862 0.301380
df.merge(coord, on='Index')
Index Value x y
0 A 0.930298 0.888025 0.376416
1 A 0.144550 0.888025 0.376416
2 B 0.393952 0.052976 0.396243
3 B 0.680941 0.052976 0.396243
4 B 0.657807 0.052976 0.396243
5 C 0.704954 0.564862 0.301380
6 C 0.733328 0.564862 0.301380
7 C 0.099785 0.564862 0.301380
8 C 0.871678 0.564862 0.301380

Converting paired table to binary table

i wouldn't be posting this if i didn't do extensive research in attempt to find the answer. Alas, I have not been able to find any such answer. I have a paired dataset that looks something like this:
PERSON, ATTRIBUTE
person1, a
person1, b
person1, c
person1, d
person2, c
person2, d
person2, x
person3, a
person3, b
person3, e
person3, f
What I want to do is: 1) drop attributes that don't appear more than 10 times, 2) turn it into a binary table that would look something like this:
a b c
person1 1 1 1
person2 0 0 1
person3 1 1 0
So far, I have put together a script to drop the attributes that only appear 10 times; however, it is painfully slow as it has to go through each attribute, determine its frequency and find the corresponding x and y values to append to new variables.
import pandas as pd
import numpy as np
import csv
from collections import Counter
import time
df = pd.read_csv(
filepath_or_buffer='sample.csv',
sep=',')
x = df.ix[:, 1].values
y = df.ix[:, 0].values
x_vals = []
y_vals = []
counter = Counter(x)
start_time = time.time()
for each in counter:
if counter[each]>=10:
for i, j in enumerate(x):
if j==each:
print "Adding position:" + str(i)
x_vals.append(each)
y_vals.append(y[i])
print "Time took: %s" %(time.time()-start_time)
I would love some help in 1) finding a faster way to match attributes that appear more than 10 times and appending the values to new variables.
OR
2) An alternative method entirely to get the final binary table. I feel like converting a paired table to a binary table is probably a common occurrence in the data world, yet i couldnt find any code, module etc that could help with doing that.
Thanks a million!
I would probably add a dummy column and then call pivot_table:
>>> df = pd.DataFrame({"PERSON": ["p1", "p2", "p3"] * 10, "ATTRIBUTE": np.random.choice(["a","b","c","d","e","f","x"], 30)})
>>> df.head()
ATTRIBUTE PERSON
0 d p1
1 b p2
2 x p3
3 b p1
4 f p2
>>> df["count"] = 1
>>> p = df.pivot_table(index="PERSON", columns="ATTRIBUTE", values="count",
aggfunc=sum, fill_value=0)
>>> p
ATTRIBUTE a b c d e f x
PERSON
p1 1 3 1 1 1 0 3
p2 2 1 1 2 1 2 1
p3 0 4 1 1 2 0 2
And then we can select only the attributes with more than 10 occurrences (here 5, from my example):
>>> p.loc[:,p.sum() >= 5]
ATTRIBUTE b x
PERSON
p1 3 3
p2 1 1
p3 4 2

Calculate within categories: Equivalent of R's ddply in Python?

I have some R code I need to port to python. However, R's magic data.frame and ddply are keeping me from finding a good way to do this in python.
Sample data (R):
x <- data.frame(d=c(1,1,1,2,2,2),c=c(rep(c('a','b','c'),2)),v=1:6)
Sample computation:
y <- ddply(x, 'd', transform, v2=(v-min(v))/(max(v)-min(v)))
Sample output:
d c v v2
1 1 a 1 0.0
2 1 b 2 0.5
3 1 c 3 1.0
4 2 a 4 0.0
5 2 b 5 0.5
6 2 c 6 1.0
So here's my question for the pythonistas out there: how would you do the same? You have a data structure with a couple of important dimensions.
For each (c), and each(d) compute (v-min(v))/(max(v)-min(v))) and associate it with the corresponding (d,c) pair.
Feel free to use whatever data structures you want, so long as they're quick on reasonably large datasets (those that fit in memory).
Indeed pandas is the right (and only, I believe) tool for this in Python. It's a bit less magical than plyr but here's how to do this using the groupby functionality:
df = DataFrame({'d' : [1.,1.,1.,2.,2.,2.],
'c' : np.tile(['a','b','c'], 2),
'v' : np.arange(1., 7.)})
# in IPython
In [34]: df
Out[34]:
c d v
0 a 1 1
1 b 1 2
2 c 1 3
3 a 2 4
4 b 2 5
5 c 2 6
Now write a small transform function:
def f(group):
v = group['v']
group['v2'] = (v - v.min()) / (v.max() - v.min())
return group
Note that this also handles NAs since the v variable is a pandas Series object.
Now group by the d column and apply f:
In [36]: df.groupby('d').apply(f)
Out[36]:
c d v v2
0 a 1 1 0
1 b 1 2 0.5
2 c 1 3 1
3 a 2 4 0
4 b 2 5 0.5
5 c 2 6 1
Sounds like you want pandas and group by or aggregate.
You can also achieve a more performance if you use numpy and scipy.
Despite some ugly code it will be faster, pandas way will be slow if number of groups is very large and may even be worse than R. This will always be faster than R:
import numpy as np
import numpy.lib.recfunctions
from scipy import ndimage
x = np.rec.fromarrays(([1,1,1,2,2,2],['a','b','c']*2,range(1, 7)), names='d,c,v')
unique, groups = np.unique(x['d'], False, True)
uniques = range(unique.size)
mins = ndimage.minimum(x['v'], groups, uniques)[groups]
maxs = ndimage.maximum(x['v'], groups, uniques)[groups]
x2 = np.lib.recfunctions.append_fields(x, 'v2', (x['v'] - mins)/(maxs - mins + 0.0))
#save as csv
np.savetxt('file.csv', x2, delimiter=';')

Categories

Resources