Join elements by iterating through the data - python

I have some data in the form:
ID A B VALUE EXPECTED RESULT
1 1 2 5 GROUP1
2 2 3 5 GROUP1
3 3 4 6 GROUP2
4 3 5 5 GROUP1
5 6 4 5 GROUP3
What i want to do is iterate through the data (thousand of rows) and create a common field so i will be able to join the data easily ( *A-> start Node, B->End Node Value-> Order...the data form something like a chain where only neighbors share a common A or B)
Rules for joining:
equal value for all elements of a group
A of element one equal to B of element two (or the oposite but NOT A=A' or B=B')
The most difficult one: assign to same group all sequential data that form a series of intersecting nodes.
That is the first element [1 1 2 5] has to be joined with [2 2 3 5] and then with [4 3 5 5]
Any idea how to accomplish this robustly when iterating through a large number of data? I have problem with rule number 3, the others are easily applied. For limited data i have some success, but this depends on the order i start examining the data. And this doesn't work for the large dataset.
I can use arcpy (preferably) or even Python or R or Matlab to solve this. Have tried arcpy with no success so i am checking on alternatives.
In ArcPy this code works ok but to limited extend (i.e. in large features with many segments i get 3-4 groups instead of 1):
TheShapefile="c:/Temp/temp.shp"
desc = arcpy.Describe(TheShapefile)
flds = desc.fields
fldin = 'no'
for fld in flds: #Check if new field exists
if fld.name == 'new':
fldin = 'yes'
if fldin!='yes': #If not create
arcpy.AddField_management(TheShapefile, "new", "SHORT")
arcpy.CalculateField_management(TheShapefile,"new",'!FID!', "PYTHON_9.3") # Copy FID to new
with arcpy.da.SearchCursor(TheShapefile, ["FID","NODE_A","NODE_B","ORDER_","new"]) as TheSearch:
for SearchRow in TheSearch:
if SearchRow[1]==SearchRow[4]:
Outer_FID=SearchRow[0]
else:
Outer_FID=SearchRow[4]
Outer_NODEA=SearchRow[1]
Outer_NODEB=SearchRow[2]
Outer_ORDER=SearchRow[3]
Outer_NEW=SearchRow[4]
with arcpy.da.UpdateCursor(TheShapefile, ["FID","NODE_A","NODE_B","ORDER_","new"]) as TheUpdate:
for UpdateRow in TheUpdate:
Inner_FID=UpdateRow[0]
Inner_NODEA=UpdateRow[1]
Inner_NODEB=UpdateRow[2]
Inner_ORDER=UpdateRow[3]
if Inner_ORDER==Outer_ORDER and (Inner_NODEA==Outer_NODEB or Inner_NODEB==Outer_NODEA):
UpdateRow[4]=Outer_FID
TheUpdate.updateRow(UpdateRow)
And some data in shapefile form and dbf form

Using matlab:
A = [1 1 2 5
2 2 3 5
3 3 4 6
4 3 5 5
5 6 4 5]
%% Initialization
% index of matrix line sharing the same group
ind = 1
% length of the index
len = length(ind)
% the group array
g = []
% group counter
c = 1
% Start the small algorithm
while 1
% Check if another line with the same "Value" share some common node
ind = find(any(ismember(A(:,2:3),A(ind,2:3)) & A(:,4) == A(ind(end),4),2));
% If there is no new line, we create a group with the discovered line
if length(ind) == len
%group assignment
g(A(ind,1)) = c
c = c+1
% delete the already discovered line (or node...)
A(ind,:) = []
% break if no more node
if isempty(A)
break
end
% reset the index for the next group
ind = 1;
end
len = length(ind);
end
And here is the output:
g =
1 1 2 1 3
As expected

Related

Pandas - Equal occurrences of unique type for a column

I have a Pandas DF called “DF”. I would like to sample data from the population in such a way that, given a occurrence count, N = 100 and column = "Type", I would like to print a total of 100 rows from that column in such a way that the distribution of occurrences of each type is equal.
SNo
Type
Difficulty
1
Single
5
2
Single
15
3
Single
4
4
Multiple
2
5
Multiple
14
6
None
7
7
None
4323
For instance, If I specify N = 3, the output must be :
SNo
Type
Difficulty
1
Single
5
3
Multiple
4
6
None
7
If for the number N, the occurrences of certain types do not meet the minimum split, I can randomly increase another count.
I am wondering on how to approach this programmatically. Thanks!
Use groupby.sample (pandas ≥ 1.1) with N divided by the number of types.
NB. This assumes the N is a multiple of the number of types if you want a strict equality.
N = 3
N2 = N//df['Type'].nunique()
out = df.groupby('Type').sample(n=N2)
handling non multiple of the number of types
Use the same as above and complete to N with random rows excluding those already selected.
N = 5
N2, R = divmod(N, df['Type'].nunique())
out = df.groupby('Type').sample(n=N2)
out = pd.concat([out, df.drop(out.index).sample(n=R)])
As there is still a chance that you complete with items of the same group, if you really want to ensure sampling from different groups replace the last step with:
out = pd.concat([out, df.drop(out.index).groupby('Type').sample(n=1).sample(n=R)]
Example output:
SNo Type Difficulty
4 5 Multiple 14
6 7 None 4323
2 3 Single 4
3 4 Multiple 2
5 6 None 14

Assign list as new columns based on a condition

I have a dataframe df that looks like this:
ID Sequence
0 A->A
1 C->C->A
2 C->B->A
3 B->A
4 A->C->A
5 A->C->C
6 A->C
7 A->C->C
8 B->B
9 C->C and so on ....
I want to create a column called 'Outcome', which is binomial in nature.
Its value essentially depends on three lists that I am generating from below
Whenever 'A' occurs in a sequence, probability of "Outcome" being 1 is 2%
Whenever 'B' occurs in a sequence, probability of "Outcome" being 1 is 6%
Whenever 'C' occurs in a sequence, probability of "Outcome" being 1 is 1%
so here is the code which is generating these 3 (bi_A, bi_B, bi_C) lists -
A=0.02
B=0.06
C=0.01
count_A=0
count_B=0
count_C=0
for i in range(0,len(df)):
if('A' in df.sequence[i]):
count_A+=1
if('B' in df.sequence[i]):
count_B+=1
if('C' in df.sequence[i]):
count_C+=1
bi_A = np.random.binomial(1, A, count_A)
bi_B = np.random.binomial(1, B, count_B)
bi_C = np.random.binomial(1, C, count_C)
What I am trying to do is to combine these 3 lists as an "output" column so that probability of Outcome being 1 when "A" is in sequence is 2% and so on. How to I solve for it as I understand there would be data overlap, where bi_A says one sequence is 0 and bi_B says it's 1, so how would we solve for this ?
End data should look like -
ID Sequence Output
0 A->A 0
1 C->C->A 1
2 C->B->A 0
3 B->A 0
4 A->C->A 0
5 A->C->C 1
6 A->C 0
7 A->C->C 0
8 B->B 0
9 C->C 0
and so on ....
Such that when I find probability of Outcome = 1 when A is in string, it should be 2%
EDIT -
you can generate the sequence data using this code-
import pandas as pd
import itertools
import numpy as np
import random
alphabets=['A','B','C']
combinations=[]
for i in range(1,len(alphabets)+1):
combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))
weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''
df=pd.DataFrame(random.choices(
population=combinations,weights=weights,
k=10000),columns=['sequence'])

How to explode pandas dataframe with lists to label the ones in the same row with same id?

For example, I have a pandas dataframe like this :
Ignoring the "Name" column, I want a dataframe that looks like this, labelling the Hashes of the same group with their "ID"
Here, we traverse each row, we encounter "8a43", and assign ID 1 to it, and wherever we find the same hash value, we assign ID as 1. Then we move on to the next row, and encounter 79e2 and b183. We then traverse all the rows and wherever we find these values, we store their IDs as 2. Now the issue will arise when we reach "abc7". It will be assigned ID=5 as it was previously encountered in "abc5". But I also want that in rows after the current one, wherever I find "26ea", assign the ID=5 to those as well.
I hope all this makes sense. If not, feel free to reach out to me via comments or message. I will clear it out quickly.
Solution using dict
import numpy as np
import pandas as pd
hashvalues = list(df['Hash_Value'])
dic, i = {}, 1
id_list = []
for hashlist in hashvalues:
# convert to list
if isinstance(hashlist, str):
hashlist = hashlist.replace('[','').replace(']', '')
hashlist = hashlist.split(',')
# check if the hash is unknown
if hashlist[0] not in dic:
# Assign a new id
dic[hashlist[0]] = i
k = i
i += 1
else:
# if known use existing id
k = dic[hashlist[0]]
for h in hashlist[1:]:
# set id of the rest of the list hashes
# equal to the first hashes's id
dic[h] = k
id_list.append(k)
else:
id_list.append(np.nan)
print(df)
Hash Name ID
0 [8a43] abc1 1
1 [79e2,b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7,5ea9,1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee,26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4
Use networkx solution for dictionary for common values, select first value in Hash_Value by str and use Series.map:
#if necessary convert to lists
#df['Hash_Value'] = df['Hash_Value'].str.strip('[]').str.split(', ')
import networkx as nx
G=nx.Graph()
for l in df['Hash_Value']:
nx.add_path(G, l)
new = list(nx.connected_components(G))
print (new)
[{'8a43'}, {'79e2', 'b183'}, {'f82a'}, {'5ea9', '1cee', '26ea', 'eaa7'}]
mapped = {node: cid for cid, component in enumerate(new) for node in component}
df['ID'] = df['Hash_Value'].str[0].map(mapped) + 1
print (df)
Hash_Value Name ID
0 [8a43] abcl 1
1 [79e2, b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7, 5ea9, 1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee, 26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4

How not to use loop in a df when access previous lines

I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !
You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1
If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6

get a random item from a group of rows in a xlsx file in python

I have a xlsx file, for example:
A B C D E F G
1 5 2 7 0 1 8
3 4 0 7 8 5 9
4 2 9 7 0 6 2
1 6 3 2 8 8 0
4 3 5 2 5 7 9
5 2 3 2 6 9 1
being my values (that are actually on an excel file).
I nedd to get random rows of it, but separeted for column D values.
You can note that column D has values that are 7 and values that are 2.
I need to get 1 random row of all the rows that have 7 on column D and 1 random row of all the rows that have 2 on column D.
And put the results on another xlsx file.
My expected output needs to be the content of line 0, 1 or 2 and the content of line 3, 4 or 5.
Can someone help me with that?
Thanks!
I've created the code to that. The code below assumes that the excel name is test.xlsx and resides in the same folder as where you run your code. It samples NrandomLines from each unique value in column D and prints that out.
import pandas as pd
import numpy as np
import random
df = pd.read_excel('test.xlsx') # read the excel
vals = df.D.unique() # all unique values in column D, in your case its only 2 and 7
idx = []
N = []
for i in vals: # loop over unique values in column D
locs = (df.D==i).values.nonzero()[0]
idx = idx + [locs] # save row index of every unique value in column D
N = N + [len(locs)] # save how many rows contain specific value in D
NrandomLines = 1 # how many random samples you want
for i in np.arange(len(vals)): # loop over unique values of D
for k in np.arange(NrandomLines): # loop how many random samples you want
randomRow = random.randint(0,N[i]-1) # create random sample
print(df.iloc[idx[i][randomRow],:]) # print out random row
With OpenPyXl, you can use Worksheet.iter_rows to iterate the worksheet rows.
You can use itertools.groupby to group the row according to the "D" column values.
To do that, you can create a small function to pick-up this value in a row:
def get_d(row):
return row[3].value
Then, you can use random.choice to choose a row randomly.
Putting all things togather, you can have:
def get_d(row):
return row[3].value
for key, group in itertools.groupby(rows, key=get_d):
row = random.choice(list(group))
print(row)

Categories

Resources