Group connected graphs in pandas DF

Group connected graphs in pandas DF - python

I have a pandas DF where each column represent a node and two columns an edge, as following:
import pandas as pd
df = pd.DataFrame({'node1': ['2', '4','17', '17', '205', '208'],
'node2': ['4', '13', '25', '38', '208', '300']})
All Nodes are Undirected, i.e. you can get from one to the other undirected_graph
I would like to group them into all connected groupes (Connectivity), as following:
df = pd.DataFrame({'node1': ['2', '4','17', '17', '205', '208'],
'node2': ['4', '13', '25', '38', '208', '300']
,'desired_group': ['1', '1', '2', '2', '3', '3']})
For example, the reason why the first two rows were grouped, is because its possible to get from node 2 to node 13 (through 4).
The closest question that i managed to find is this one:
pandas - reshape dataframe to edge list according to column values but to my understanding, its a different question.
Any help on this would be great, thanks in advance.

Using networkx connected_components
import networkx as nx
G=nx.from_pandas_edgelist(df, 'node1', 'node2')
l=list(nx.connected_components(G))
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
d={k: v for d in L for k, v in d.items()}
#df['New']=df.node1.map(d)
df.node1.map(d)
0 0
1 0
2 1
3 1
4 2
5 2
Name: node1, dtype: int64

If for some reason you could not use an external library, you could implement the algorithms:
import pandas as pd
def bfs(graph, start):
visited, queue = set(), [start]
while queue:
vertex = queue.pop(0)
if vertex not in visited:
visited.add(vertex)
queue.extend(graph[vertex] - visited)
return visited
def connected_components(G):
seen = set()
for v in G:
if v not in seen:
c = set(bfs(G, v))
yield c
seen.update(c)
def graph(edge_list):
result = {}
for source, target in edge_list:
result.setdefault(source, set()).add(target)
result.setdefault(target, set()).add(source)
return result
df = pd.DataFrame({'node1': ['2', '4', '17', '17', '205', '208'],
'node2': ['4', '13', '25', '38', '208', '300']})
G = graph(df[['node1', 'node2']].values)
components = connected_components(G)
lookup = {i: component for i, component in enumerate(components, 1)}
df['group'] = [label for node in df.node1 for label, component in lookup.items() if node in component]
print(df)
Output
node1 node2 group
0 2 4 1
1 4 13 1
2 17 25 3
3 17 38 3
4 205 208 2
5 208 300 2

Related

Insert String values into 2d array/matrix

I have a string of 1s and 0s that I need to insert into a [4] by [4] matrix, that I can then use for other things.
This is my attempt at it:
b = '0110000101101000'
m = [[], [], [], []]
for i in range(4):
for j in range(4):
m[i].append(b[i * j])
But where I expected to get
[['0', '1', '1', '0'], ['0', '0', '0', '1'], ['0', '1', '1', '0'], ['1', '0', '0', '0']
I got [['0', '0', '0', '0'], ['0', '1', '1', '0'], ['0', '1', '0', '0'], ['0', '0', '0', '1']].
Could someone point me in the right direction here?

Get paper and a pencil and write a table of what you have now vs what you want:
i j i*j desired
0 0 0 0
0 1 0 1
0 2 0 2
0 3 0 3
1 0 0 4
1 1 1 5
... up to i=3, j=3
Now you can see that i * j is not the correct index in b. Can you see what the desired index formula is?

I'd agree with #John Zwinck that you can easily figure it out but if you hate math simply do
counter = 0
for i in range(4):
for j in range(4):
m[i].append(b[counter])
counter += 1 # keep track of the overall iterations
otherwise you have to find the starting row you are in (i * columns) and add the current column index
m[i].append(b[i * 4 + j]) # i * 4 gives the overall index of the 0th element of the current row

Here is a hint: range(4) starts from 0 and ends at 3.
See the python documentation: https://docs.python.org/3.9/library/stdtypes.html#typesseq

First of all, the rule to convert coordinates to index is index = row * NRO_COLS + col. You should use i * 4 + j.
Second, you can use list comprehension:
m = [[b[i * 4 + j] for j in range(4)] for i in range(4)]
then, it can be rewritten as:
m = [[b[i + j] for j in range(4)] for i in range(0, len(b), 4)]
or
m = [list(b[i:i+4]) for i in range(0, len(b), 4)]
another alternative is to use numpy, which is a great library, specially to handle multidimensional arrays
import numpy as np
m = np.array(list(b)).reshape(4,4)
or:
print(np.array(list(b)).reshape(4, -1))
print(np.array(list(b)).reshape(-1, 4))

how to perform a line by line merge for multiple files without using pandas merge which reads all data frames to memory

I wish to merge multiple files with a single (f1.txt) file based on 2 column matches after comparison with that file. I can do it in pandas but it reads everything to memory which can get big really fast. I am thinking a line by line reading will not load everything into memory. Pandas is also not an option now. How do I perform the operation while filling in null for cells where a match with f1.txt does not occur?
Here, I used a dictionary, which I am not sure if it will hold in memory and I also can't find a way to add null where there is no match in the other files with f1.txt. The other files could be as many as 1000 different files. The time does not matter as long as I do not read everything to memory
FILES (tab-delimited)
f1.txt
A B num val scol
1 a1 1000 2 3
2 a2 456 7 2
3 a3 23 2 7
4 a4 800 7 3
5 a5 10 8 7
a1.txt
A B num val scol fcol dcol
1 a1 1000 2 3 0.2 0.77
2 a2 456 7 2 0.3 0.4
3 a3 23 2 7 0.5 0.6
4 a4 800 7 3 0.003 0.088
a2.txt
A B num val scol fcol2 dcol1
2 a2 456 7 2 0.7 0.8
4 a4 800 7 3 0.9 0.01
5 a5 10 8 7 0.03 0.07
Current Code
import os
import csv
m1 = os.getcwd() + '/f1.txt'
files_to_compare = [i for i in os.listdir('dir')]
dictionary = dict()
dictionary1 = dict()
with open(m1, 'rt') as a:
reader1 = csv.reader(a, delimiter='\t')
for x in files_to_compare:
with open(os.getcwd() + '/dir/' + x, 'rt') as b:
reader2 = csv.reader(b, delimiter='\t')
for row1 in list(reader1):
dictionary[row1[0]] = list()
dictionary1[row1[0]] = list(row1)
for row2 in list(reader2):
try:
dictionary[row2[0]].append(row2[5:])
except KeyError:
pass
print(dictionary)
print(dictionary1)
What I am trying to achieve is similar to using: df.merge(df1, on=['A','B'], how='left').fillna('null')
current result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['0.03', '0.07']]}
{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}
Desired result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77'],['null', 'null']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6'],['null', 'null']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['null', 'null'],['0.03', '0.07']]}
{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}
My final intent is to write the dictionary to a text file. I do not know how much memory will be used or if it will even fit in memory. if there is a better way without using pandas, that will be nice else how do I make dictionary work?
DASK ATTEMPT:
import dask.dataframe as dd
directory = 'input_dir/'
first_file = dd.read_csv('f1.txt', sep='\t')
df = dd.read_csv(directory + '*.txt', sep='\t')
df2 = dd.merge(first_file, df, on=[A, B])
I kept getting ValueError: Metadata mismatch found in 'from_delayed'
+-----------+--------------------+
| column | Found | Expected |
+--------------------------------+
| fcol | int64 | float64 |
+-----------+--------------------+
I googled, found similar complaints but could not fix it. That was why I decided to try this. Checked my files and all dtypes seem to be consistent. My version of dask was 2.9.1

If you want hand made solution, you can look at heapq.merge and itertools.groupby. This assumes your files are sorted by the first two columns (the key).
I made simple example that merges and groups the files and produces two files, instead of dictionaries (so (almost) nothing is stored in memory, everything is reading/writing from/to disk):
from heapq import merge
from itertools import groupby
first_file_name = 'f1.txt'
other_files = ['a1.txt', 'a2.txt']
def get_lines(filename):
with open(filename, 'r') as f_in:
for line in f_in:
yield [filename, *line.strip().split()]
def get_values(lines):
for line in lines:
yield line
while True:
yield ['null']
opened_files = [get_lines(f) for f in [first_file_name] + other_files]
# save headers
headers = [next(f) for f in opened_files]
with open('out1.txt', 'w') as out1, open('out2.txt', 'w') as out2:
# print headers to files
print(*headers[0][1:6], sep='\t', file=out1)
new_header = []
for h in headers[1:]:
new_header.extend(h[6:])
print(*(['ID'] + new_header), sep='\t', file=out2)
for v, g in groupby(merge(*opened_files, key=lambda k: (k[1], k[2])), lambda k: (k[1], k[2])):
lines = [*g]
print(*lines[0][1:6], sep='\t', file=out1)
out_line = [lines[0][1]]
iter_lines = get_values(lines[1:])
current_line = next(iter_lines)
for current_file in other_files:
if current_line[0] == current_file:
out_line.extend(current_line[6:])
current_line = next(iter_lines)
else:
out_line.extend(['null', 'null'])
print(*out_line, sep='\t', file=out2)
Produces two files:
out1.txt:
A B num val scol
1 a1 1000 2 3
2 a2 456 7 2
3 a3 23 2 7
4 a4 800 7 3
5 a5 10 8 7
out2.txt:
ID fcol dcol fcol2 dcol1
1 0.2 0.77 null null
2 0.3 0.4 0.7 0.8
3 0.5 0.6 null null
4 0.003 0.088 0.9 0.01
5 null null 0.03 0.07

A loop for pairwise comparison with previous value within a column in Python/Pandas

One crucial step in my project is to track the absolute difference of values in a column of pandas dataframe for subsamples.
I managed to write a for-loop to create my subsamples. I select every person and go through every year this person is observed. I further accessed the index of each groups first element, and even compared it the each ones second element.
Here is my MWE data:
df = pd.DataFrame({'year': ['2001', '2004', '2005', '2006', '2007', '2008', '2009',
'2003', '2004', '2005', '2006', '2007', '2008', '2009',
'2003', '2004', '2005', '2006', '2007', '2008', '2009'],
'id': ['1', '1', '1', '1', '1', '1', '1',
'2', '2', '2', '2', '2', '2', '2',
'5', '5', '5','5', '5', '5', '5'],
'money': ['15', '15', '15', '21', '21', '21', '21',
'17', '17', '17', '20', '17', '17', '17',
'25', '30', '22', '25', '8', '7', '12']}).astype(int)
Here is my code:
# do it for all IDs in my dataframe
for i in df.id.unique():
# now check every given year for that particular ID
for j in df[df['id']==i].year:
# access the index of the first element of that ID, as integer
index = df[df['id']==i].index.values.astype(int)[0]
# use that index to calculate absolute difference of the first and second element
abs_diff = abs( df['money'].iloc[index] - df['money'].iloc[index+1] )
# print all the changes, before further calculations
index =+1
print(abs_diff)
My index is not updating. It yields 0000000 0000000 5555555 (3 x 7 changes) but it should show 0,0,0,6,0,0,0 0,0,0,3,-3,0,0 0,5,-8,3,-17,-1,5 (3 x 7 changes). Since the either first or last element have no change, I added 0 in front of each group.
Solution I changed the second loop from for to while:
for i in df.id.unique():
first = df[df['id']==i].index.values.astype(int)[0] # ID1 = 0
last = df[df['id']==i].index.values.astype(int)[-1] # ID1 = 6
while first < last:
abs_diff = abs( df['money'][first] - df['money'][first+1] )
print(abs_diff)
first +=1

`for i in df.id.unique():
for j in df[df['id']==i].year:
index = df[(df['id']==i)&(df['year']==j)].index.values[0].astype(int)
try:
abs_diff = abs(df['money'].iloc[index] - df['money'].iloc[index+1] )
except:
pass
print(abs_diff)`
output:
0
0
6
0
0
0
4
0
0
3
3
0
0
8
5
8
3
17
1
5

You're currently always checking the first value of each batch, so you'd need to do:
# do it for all IDs in my dataframe
for i in df.id.unique():
# now check every given year for that particular ID
for idx,j in enumerate(df[df['id']==i].year):
# access the index of the first element of that ID, as integer
index = df[df['id']==i].index.values.astype(int)[idx]
# use that index to calculate absolute difference of the first and second element
try:
abs_diff = abs( df['money'][index] - df['money'][index+1] )
except:
continue
# print all the changes, before further calculations
index =+1
print(abs_diff)
Which outputs:
0
0
6
0
0
0
4
0
0
3
3
0
0
8
5
8
3
17
1
5

How do I Store input data into multiple matrices in Python?

I have a text file with multiple matrices like this:
4 5 1
4 1 5
1 2 3
[space]
4 8 9
7 5 6
7 4 5
[space]
2 1 3
5 8 9
4 5 6
I want to read this input file in python and store it in multiple matrices like:
matrixA = [...] # first matrix
matrixB = [...] # second matrix
so on. I know how to read external files in python but don't know how to divide this input file in multiple matrices, how can I do this?
Thank you

You can write a code like this:
all_matrices = [] # hold matrixA, matrixB, ...
matrix = [] # hold current matrix
with open('file.txt', 'r') as f:
values = line.split()
if values: # if line contains numbers
matrix.append(values)
else: # if line contains nothing then add matrix to all_matrices
all_matrices.append(matrix)
matrix = []
# do what every you want with all_matrices ...

I am sure the algorithm could be optimized somewhere, but the answer I found is quite simple:
file = open('matrix_list.txt').read() #Open the File
matrix_list = file.split("\n\n") #Split the file in a list of Matrices
for i, m in enumerate(matrix_list):
matrix_list[i]=m.split("\n") #Split the row of each matrix
for j, r in enumerate(matrix_list[i]):
matrix_list[i][j] = r.split() #Split the value of each row
This will result in the following format:
[[['4', '5', '1'], ['4', '1', '5'], ['1', '2', '3']], [['4', '8', '9'], ['7', '5', '6'], ['7', '4', '5']], [['2', '1', '3'], ['5', '8', '9'], ['4', '5', '6']]]
Example on how to use the list:
print(matrix_list) #prints all matrices
print(matrix_list[0]) #prints the first matrix
print(matrix_list[0][1]) #prints the second row of the first matrix
print(matrix_list[0][1][2]) #prints the value from the second row and third column of the first matrix

extract list of combinations and their count values from a matrix dataframe?

I have the following matrix:
import pandas as pd
df_test = pd.DataFrame({'TFD' : ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'],
'Snack' : ['1', '0', '1', '1', '0', '0'],
'Trans' : ['1', '1', '1', '0', '0', '1'],
'Dop' : ['1', '0', '1', '0', '1', '1']}).set_index('TFD')
df_test = df_test.astype(int)
matrix = df_test.T.dot(df_test)
print matrix
=>>>
Dop Snack Trans
Dop 4 2 3
Snack 2 3 2
Trans 3 2 4
What I would want to yield:
Dop-Snack 2
Snack-Trans 2
Trans-Dop 3
Thanks in advance!

Assuming there are no special requirements regarding the order of the pairs:
import itertools
for c, r in itertools.combinations(matrix.columns, 2):
print("{}-{}\t{}".format(c, r, matrix.loc[c, r]))
# Dop-Snack 2
# Dop-Trans 3
# Snack-Trans 2
https://docs.python.org/2/library/itertools.html#itertools.combinations

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group connected graphs in pandas DF - python

Related

Insert String values into 2d array/matrix

how to perform a line by line merge for multiple files without using pandas merge which reads all data frames to memory

A loop for pairwise comparison with previous value within a column in Python/Pandas

How do I Store input data into multiple matrices in Python?

extract list of combinations and their count values from a matrix dataframe?

Categories

Resources