Pandas Dataframe : Complex problem with selecting rows linked each-other - python

I have a dataframe df
Col1 Col2
A B
C A
B D
E F
G D
G H
K J
and a Series id of IDs
ID
A
F
What I want is, for all letters in id, to select other letters that have any link with a max of 2 intermediates.
Let's make the example for A (way easier to understand with the example) :
There are 2 lines including A, linked to B and C, so direct links to A are [B, C]. (No matter if A is in Col1 or Col2)
A B
C A
But B is also linked to D, and D is linked to G :
B D
G D
So links to A are [B, C, D, G].
Even though G and H are linked, it would make more than 2 intermediates from A (A > B > D > G > H making B, D and G as intermediates), so I don't include H in A links lists.
G H
I'm looking for a way to search, for all IDs in id, the links list, and save it in id :
ID LinksList
A [B, C, D, G]
F [E]
I don't mind the type of LinksList (it can be String) as far as I can get the info for a specific ID and work with it. I also don't mind the order of IDs in LinksList, as long as it's complete.
I already found a way to solve the problem, but using 3 for loops, so it takes a really long time.
(For k1 in ID, For k2 range(0,3), select direct links for each element of LinksList + starting element, and put them in LinksList if they're not already in).
Can someone please help me doing it only with Pandas ?
Thanks a lot in advance !!
==== EDIT : Here are the "3 loops", after Karl's comment : ====
i = 0
for k in id:
linklist = list(df[df['Col1'] == k]['Col2']) + list(df[df['Col2'] == k]['Col1'])
new = df.copy()
intermediate_count = 1
while(len(new) > 0 and intermediate_count <= 2):
nn = new.copy()
new = []
for n in nn:
toadd = list(df[df['Col1'] == n]['Col2']) + list(df[df['Col2'] == n]['Col1'])
toadd = list(set(toadd).difference(df))
df = df + toadd
new = new + toadd
if(i==0):
d = {'Id': k, 'Linked': linklist}
df_result = pd.DataFrame(data=d)
i = 1
else:
d = {'Id': k, 'Linked': linklist}
df_result.append(pd.DataFrame(data=d))

I would first append the reciprocal of the dataframe to be able to always go from Col1 to Col2. Then I would use merges to compute the possible results with 1 and 2 intermediate steps. Finally, I would aggregate all those values into sets. Code could be:
# append the symetric (Col2 -> Col1) to the end of the dataframe
df2 = df.append(df.reindex(columns=reversed(df.columns)).rename(
columns={df.columns[len(df.columns)-i]: col
for i, col in enumerate(df.columns, 1)}), ignore_index=True
).drop_duplicates()
# add one step on Col3
df3 = df2.merge(df2, 'left', left_on='Col2', right_on='Col1',
suffixes=('', '_')).drop(columns='Col1_').rename(
columns={'Col2_': 'Col3'})
# add one second stop on Col4
df4 = df3.merge(df2, 'left', left_on='Col3', right_on='Col1',
suffixes=('', '_')).drop(columns='Col1_').rename(
columns={'Col2_': 'Col4'})
# aggregate Col2 to Col4 into a set
df4['Links'] = df4.iloc[:, 1:].agg(set, axis=1)
# aggregate that new column grouped by Col1
result = df4.groupby('Col1')['Links'].agg(lambda x: set.union(*x)).reset_index()
# remove the initial value if present in Links
result['Links'] = result['Links'] - result['Col1'].apply(set)
# and display the result restricted to id
print(result[result['Col1'].isin(id)])
With the sample data, it gives as expected:
Col1 Links
0 A {D, C, B, G}
5 F {E}

We can use Networkx library:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
# Read in pandas dataframe using copy and paste
df = pd.read_clipboard()
# Create graph network from pandas dataframe
G = nx.from_pandas_edgelist(df, 'Col1', 'Col2')
# Create id, Series
id = pd.Series(['A', 'F'])
# Move values in the index of the Series
id.index=id
# Use `single_source_shortest_path` method in nx for each value in, id, Series
id.apply(lambda x: list(nx.single_source_shortest_path(G, x, 3).keys())[1:])
Output:
A [B, C, D, G]
F [E]
dtype: object
Print graph representation:

Related

Creating several graphs using 'for' loop

I have a dataset that looks like
Node Target Group
A B 1
A C 1
B D 2
F E 3
F A 3
G M 3
I would like to create a graph for each distinct value in Group. There are 5 groups in total.
file_num = 1
for item <=5: # this is wrong
item.plot(title='Iteration n.'+item)
plt.savefig(f'{file}_{file_num}.png')
file_num += 1
G = nx.from_pandas_edgelist(df, source='Node', target='Target')
cytoscapeobj = ipycytoscape.CytoscapeWidget()
cytoscapeobj.graph.add_graph_from_networkx(G)
cytoscapeobj
This code, however, does not generate individual graphs (G and the objet from Cytoscape), meaning that something does not work within the loop (at least).
Any help would be extremely useful.
Let's call the DataFrame df. Then you iterate over the unique values in the Group column, subset df by this Group value, and save a new figure each iteration of the loop.
file_num = 1
for group_val in df.Group.unique():
df_group = item[item['group'] == group_val]
df_group.plot(title='Iteration n.'+item)
plt.savefig(f'{file}_{file_num}.png')
file_num += 1
G = nx.from_pandas_edgelist(df_group, source='Node', target='Target')
cytoscapeobj = ipycytoscape.CytoscapeWidget()
cytoscapeobj.graph.add_graph_from_networkx(G)
cytoscapeobj
You should use display() function on cytoscapeobj:
for group_val in df.Group.unique():
<...>
display(cytoscapeobj)

Group and sum the columns that are not in the lists

I have a df data frame with columns with the following pattern: #number - letter and I want to add in a new column other that makes the sum of the columns that are not in letter_table1 and letter_table2:
TEXT, A, B, C, D, E, F, G, H, I
a,1,1,1,2,2,2,3,3,3
b,1,1,1,2,2,2,3,3,3
c,1,1,1,2,2,2,3,3,3
d,1,1,1,2,2,2,3,3,3
e,1,1,1,2,2,2,3,3,3
f,1,1,1,2,2,2,3,3,3
g,1,1,1,2,2,2,3,3,3
h,1,1,1,2,2,2,3,3,3
i,1,1,1,2,2,2,3,3,3
j,1,1,1,2,2,2,3,3,3
for instance :
tableau_lettres1 = [H]
tableau_lettres2 = [I, J]
How can I do that? For the moment I have tried:
df_sum['others'] = df.loc[:,~df.isin(tableau_lettres1, tableau_lettres2)].sum(axis=1)
as well as:
df_sum['others'] = df.loc[:,df.drop(tableau_lettres1, tableau_lettres2)].sum(axis=1)
As tableau_lettres1, tableau_lettres2 are lists, you need to join them to one list, and get the other column names like:
df_sum['others'] = df[[col for col in df.columns.tolist() if col not in tableau_lettres1 + tableau_lettres2]].sum(axis=1)

Retrieve certain value located in dataframe in any row or column and keep it in separate column without forloop

I have a dataframe like below
df
A B C
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
And I want to change it to below:
Resulting DF
A B C D
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
So i tried to use str.contains and once I receive the series with True or False, i put it in eval function to somehow get me the table I want.
Code I tried:
series_index = pd.DataFrame()
series_index = df.columns.str.contains("^TRANSIT_", case=True, regex=True)
print(type(series_index))
series_index.index[series_index].tolist()
I thought to use eval function to write it to separate column,like
df = eval(df[result]=the index) # I dont know, But eval function does evaluation and puts it in a separate column
I couldn't find a simple one-liner, but this works:
idx = list(df1[df1.where(df1.applymap(lambda x: 'TRA' in x if isinstance(x, str) else False)).notnull()].stack().index)
a, b = [], []
for sublist in idx:
a.append(sublist[0])
b.append(sublist[1])
df1['ans'] = df1.lookup(a,b)
Output
A B C ans
0 0 1 TRANSIT_1 TRANSIT_1
1 TRANSIT_3 None None TRANSIT_3
2 0 TRANSIT_5 None TRANSIT_5

Compound inequality in if statement

This is a generalized function I want to use to check if each row of a dataframe follows a specific trend in column values.
def follows_trend(row):
trend = None
if row[("col_5" < "col_6" < "col_4" < "col_1" < "col_2" < "col_3")]:
trend = True
else:
trend = False
return trend
I'll apply it like this
df_trend = df.apply(follows_trend, axis=1)
When I do, it returns all True when there are clearly some rows that should return False. I'm not sure if there is something wrong with the inequality I used or the function itself.
The compound comparisons don't "expand out of" the dict lookup. "col_5" < "col_6" < "col_4" < "col_1" < "col_2" < "col_3" will be evaluated first, producing False because the strings aren't sorted - so your if statement is actually if row[(False)]:. You need to do this:
if row["col_5"] < row["col_6"] < row["col_4"] < row["col_1"] < row["col_2"] < row["col_3"]:
If you have a lot of these expressions, you should probably extract this to a method that takes row and a list of the column names, and uses a loop for the comparisons. If you only have one, but want a somewhat more nice-looking version, try this:
a, b, c, d, e, f = (row[c] for c in ("col_5", "col_6", "col_4", "col_1", "col_2", "col_3"))
if a < b < c < d < e < f:
Also you can reorder the column names, use the diff function to check the difference along the rows and compare the result with 0:
(df[["col_5", "col_6", "col_4", "col_1", "col_2", "col_3"]]
.diff(axis=1).drop('col_5', 1).gt(0).all(1))
Example:
import pandas as pd
df = pd.DataFrame({"A": [1,2], "B": [3,1], "C": [4,2]})
df
# A B C
#0 1 3 4
#1 2 1 2
df.diff(axis=1).drop('A', 1).gt(0).all(1)
#0 True
#1 False
#dtype: bool
you could use query for this. See example below
df = pd.DataFrame(np.random.randn(5, 3), columns=['col1','col2','col3'])
print df
print df.query('col2>col3>col1') # query can accept a string with multiple comparisons.
results in
col1 col2 col3
0 -0.788909 1.591521 1.709402
1 -1.563310 1.188993 2.295683
2 -1.572323 -0.600015 -1.518411
3 1.786051 0.303291 -0.344720
4 0.756029 -0.393941 1.059874
col1 col2 col3
2 -1.572323 -0.600015 -1.518411

Operating on a huge table: group of rows at a time using python

I have a huge table file that looks like the following. In order to work on individual products (name), I tried to use pandas groupby, but it seems to put the whole table (~10G) in memory, which I cannot afford.
name index change
A Q QQ
A Q QQ
A Q QQ
B L LL
C Q QQ
C L LL
C LL LL
C Q QQ
C L LL
C LL LL
The name column is well sorted and I will only care about one name at a time. I hope to use the following criteria on column "change" to filter each name:
Check if number of "QQ" overwhelms number of "LL". Basically, if the number of rows contain "QQ" subtracts the number of rows contain "LL" >=2, then discard/ignore the "LL" rows for this name from now on. If "LL" overwhelms "QQ", then discard rows with QQ. (E.g. A has 3 QQ and 0 LL, and C has 4 LL and 2 QQ. They both are fine.)
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
C L LL
C LL LL
C L LL
C LL LL
Comparing "change" to "index", if no change occurs (e.g. LL in both columns), the row is not valid. Further, for the valid changes, the remaining QQ or LL has to be continuous for >=3 times. Therefore C only has 2 valid changes, and it will be filtered out.
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
I wonder if there is a way to just work on the table name by name, and release memory after each name please. (And don't have to do the two criteria step by step.) Any hint or suggestion will be appreciated!
Because the file is sorted by "name", you can read the file row-by-row:
def process_name(name, data, output_file):
group_by = {}
for index, change in data:
if index not in group_by:
group_by[index] = []
group_by[index].append(change)
# do the step 1 filter logic here
# do the step 2 filter logic here
for index in group_by:
if index == group_by[index]:
# Because there is at least one "no change" this
# whole "name" can be thrown out, so return here.
return
output = []
for index in group_by:
output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))
current_name = None
current_data = []
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
cols = row.strip().split("\t")
name = cols[0]
index = cols[1]
change = cols[2]
if name != current_name:
if name != None:
process_name(current_name, current_data, output_file)
current_name = name
current_data = []
current_data.append((index, change))
# process what's left in the buffer
if current_name is not None:
process_name(current_name, current_data, output_file)
input_file.close()
output_file.close()
I don't totally understand the logic you've explained in #1, so I left that blank. I also feel like you probably want to do step #2 first as that will quickly rule out entire "name"s.
Since your file is sorted and you only seem to be operating on the sub segments by name, perhaps just use Python's groupby and create a table for each name segment as you go:
from itertools import groupby
import pandas as pd
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
for e in segment:
for key, v in zip(header, e.split()):
seg_data[key].append(v)
seg_fram=pd.DataFrame.from_dict(seg_data)
print k
print seg_fram
print
Prints:
A
change index name
0 QQ Q A
1 QQ Q A
2 QQ Q A
B
change index name
0 LL L B
C
change index name
0 QQ Q C
1 LL L C
2 LL LL C
3 QQ Q C
4 LL L C
5 LL LL C
Then the largest piece of memory you will have will be dictated by the largest contiguous group and not the size of the file.
You can use 1/2 the memory of that method by appending to the data frame row by row instead of building the intermediate dict:
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
seg_fram=pd.DataFrame(columns=header)
for idx, e in enumerate(segment):
df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
seg_fram=seg_fram.append(df)
(might be slower though...)
If that does not work, consider using a disk database.

Categories

Resources