I have a data frame with two columns. The two columns contain integer numbers. The second column contains numbers that are linked to the first column. In case there is no link between the two columns, the number in the second column will have zero value. Here is an example of the table.
The expected output is a list of connections between the two columns. Using the attached table as an example, the output will be
[[2, 3, 4, 5], [6, 7, 8]]
This question is similar but not the same as finding transitive relation between two columns in pandas.
You could approach this as a graph, treating the dataframe as an edge list. You can then retrieve the connected nodes with networkx:
import pandas as pd
import networkx as nx
df = pd.DataFrame({'a': range(1, 11), 'b': [0, 4, 2, 5, 0, 7, 8, 0, 0, 0]})
g = nx.from_pandas_edgelist(df[df['b'] != 0], source='a', target='b')
print(list(nx.connected_components(g)))
Output:
[{2, 3, 4, 5}, {8, 6, 7}]
Not really a Pandas answer, but here's one approach (with help from here for finding runs of consecutive integers):
df = pd.DataFrame({'a': range(1, 11),
'b': [0, 4, 2, 5, 0, 7, 8, 0, 0, 0]})
from itertools import groupby
from operator import itemgetter
zero_locs = df['b'].to_numpy().nonzero()[0]
connections = []
for k,g in groupby(enumerate(zero_locs), lambda x: x[0]-x[1]):
group = (map(itemgetter(1),g))
group = list(map(int,group))
group.append(group[-1] + 1)
connections.append(list(df['a'][group]))
connections # [[2, 3, 4, 5], [6, 7, 8]]
Related
Holla!
I have 400 csv files, each with around 50,000 rows (this varies from file to file) and exactly 2 columns. The goal is to find the files which are exactly the same (there might multiple uniquely similar files), but the ultimate goal is to look for the most occurring files with the same data.
The steps I'm trying to implement are listed as follows:
importing csv files as pandas df
this step is to check the shape of the files/dataframes. If the shapes of the df are same, then I may check the elements for equality) (the ones with different shapes already drops off from the same df consideration)
sorting the df based on first column with its corresponding second column
taking difference of the sorted dataframes (if the difference results in 0, the df are exactly same, which is needed)
store the variable names of the same dataframes in a list
Here is a dummy setup I'm working on:
import pandas as pd
import numpy as np
## step 1.
# creating random dataframes (implying importing csv files as df)
# keeping these three as same files
df_0 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_1 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_3 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
# taking these two as same files
df_2 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_4 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_5 = pd.DataFrame({'a': [1, 1, 2, 1, 2], 'b': [2, 3, 4, 2, 1]})
#taking a couple of files as different shape
df_6 = pd.DataFrame({'a': [1, 1, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
df_7 = pd.DataFrame({'a': [1, 2, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
###here there are two different sets of same df's, however as described in ultimate
###goal, the first set i.e. df_0, df_1, df_3 is to be considered since it has most number
###of (3) same df's and the other set has less (2).
## step 2. pending!! (will need it for the original data with 400 files)
## step 3.
# function to sort all the df in the list
def sort_df(df_list):
for df in df_list:
df.sort_values(by=['a'], inplace=True)
return df_list
#print(sort_df([df_0, df_1, df_2, df_3, df_4, df_5]))
# save the sorted df in a list
sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4]) # this performs: 0-1, 0-2, 0-3, 1-2, 1-3, 2-3
#sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4,df_5,df_6,df_7]) # 0-1, 0-2, 0-3, 0-4, 0-5, 0-6, 0-7, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 2-3, 2-4, 2-5, 2-6, 2-7, 3-4, 3-5, 3-6, 3-7, 4-5, 4-6, 4-7, 5-6, 5-7, 6-7
## step 4.
# script to take difference of all the df in the sorted_df_list
def diff_df(df_list):
diff_df_list = []
for i in range(len(df_list)):
for j in range(i+1, len(df_list)):
diff_df_list.append(df_list[i].subtract(df_list[j]))
# if the difference result is 0, then print that the df are same and store the df variable name in a list
if df_list[i].subtract(df_list[j]).equals(df_list[i].subtract(df_list[j])*0):
print('df_{} and df_{} are same'.format(i,j))
return diff_df_list
## step 5.
#### major help is needed here!!!! #####
# if the difference result is 0, then print that the df are same and store the df variable name in a list
## or some way to store the df which are same aka diff is 0
print(diff_df(sorted_df_list))
# # save the difference of all the df in a list
# diff_df_list = diff_df(sorted_df_list)
# print('------------')
# # script to make a list of all df names with all the values as 0
# def zero_df(df_list):
# zero_df_list = []
# for df in df_list:
# if df.equals(df*0):
# zero_df_list.append(df)
# return zero_df_list
# print(zero_df(diff_df_list))
As tested, on first 4 df's, the defined functions work well, that results into df_0, df_1 and df_3 as 0's.
I am seeking help to store these variable names of df's that are the same.
Also, the logic should work well for possible exceptions, that can be checked by incorporating all 8 of the created df's.
If anyone may have feedback or suggestions for these issues, that would be greatly appreciated. Cheers!
An efficient method could be to hash the DataFrames, then identify the duplicates:
def df_hash(df):
s = pd.util.hash_pandas_object(df, index=False) # ignore labels
return hash(tuple(s))
hashes = [df_hash(d) for d in dfs]
dups = pd.Series(hashes).duplicated(keep=False)
out = pd.Series(dfs)[dups]
len(out) # or dups.sum()
# 4
I have a large pandas series that each row in it, is a list of numbers.
I want to detect rows that are subset of other rows and delete them from series.
my solution is using 2 for loops but it is very slow. Can anyone help me and introduce a faster way for this because my for loop is very slow.
for example, we must delete rows 2, 4 in the below sample because they are subsets of rows 1, 3 respectively.
import pandas as pd
cycles = pd.Series([[1, 2, 3, 4], [3, 4], [5, 6, 9, 7], [5, 9]])
First, you could sort the lists since they are numbers and convert them to string. Then for every string simply check if it is a substring of any of the other rows, if so it is a subset. Since everything is sorted we can be sure the order of the numbers will not affect this step.
Finally, filter out only the ones that are not identified as a subset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'cycles': [[9, 5, 4, 3], [9, 5, 4], [2, 4, 3], [2, 3]],
'members': [4, 3, 3, 2]
})
print(df)
cycles members
0 [9, 5, 4, 3] 4
1 [9, 5, 4] 3
2 [2, 4, 3] 3
3 [2, 3] 2
df['cycles'] = df['cycles'].map(np.sort)
df['cycles_str'] = [','.join(map(str, c)) for c in df['cycles']]
# Here we check if matches are >1, because it will match with itself once!
df['is_subset'] = [df['cycles_str'].str.contains(c_str).sum() > 1 for c_str in df['cycles_str']]
df = df.loc[df['is_subset'] == False]
df = df.drop(['cycles_str', 'is_subset'], axis=1)
cycles members
0 [3, 4, 5, 9] 4
2 [2, 3, 4] 3
Edit - The above doesn't work for [1, 2, 4] & [1, 2, 3, 4]
Rewrote the code. This uses 2 loops and set to check for subsets using list comprehension:
# check if >1 True, as it will match with itself once!
df['is_subset'] = [[set(y).issubset(set(x)) for x in df['cycles']].count(True)>1 for y in df['cycles']]
df = df.loc[df['is_subset'] == False]
df = df.drop('is_subset', axis=1)
print(df)
cycles members
0 [9, 5, 4, 3] 4
2 [2, 4, 3] 3
I have two pandas df and they do not have the same length. df1 has unique id's in column id. These id's occur (multiple times) in df2.colA. I'd like to add a list of all occurrences of df1.id in df2.colA (and another column at the matching index of df1.id == df2.colA) into a new column in df1. Either with the index of df2.colA of the match or additionally with other row entries of all matches.
Example:
df1.id = [1, 2, 3, 4]
df2.colA = [3, 4, 4, 2, 1, 1]
df2.colB = [5, 9, 6, 5, 8, 7]
So that my operation creates something like:
df1.colAB = [ [[1,8],[1,7]], [[2,5]], [[3,5]], [[4,9],[4,6]] ]
I've tries a bunch of approaches with mapping, looping explicitly (super slow), checking with isin etc.
You could use Pandas apply to iterate over each row of df1 value while creating a list with all the indices in df2.colA. This can be achieved by using Pandas index and loc over the df2.colB to create a list with all the indices in df2.colA that match the row in df1.id. Then, within the apply itself use a for-loop to create the list of matched values.
import pandas as pd
# setup
df1 = pd.DataFrame({'id':[1,2,3,4]})
print(df1)
df2 = pd.DataFrame({
'colA' : [3, 4, 4, 2, 1, 1],
'colB' : [5, 9, 6, 5, 8, 7]
})
print(df2)
#code
df1['colAB'] = df1['id'].apply(lambda row:
[[row, idx] for idx in df2.loc[df2[df2.colA == row].index,'colB']])
print(df1)
Output from df1
id colAB
0 1 [[1, 8], [1, 7]]
1 2 [[2, 5]]
2 3 [[3, 5]]
3 4 [[4, 9], [4, 6]]
Cant Really bend my mind around this problem I'm having:
say I have 2 arrays
A = [2, 7, 4, 3, 9, 4, 2, 6]
B = [1, 1, 1, 4, 4, 7, 7, 7]
what I'm trying to do is that if a value is repeated in array B (like how 1 is repeated 3 times), those corresponding values in array A are added up to be appended to another array (say C)
so C would look like (from above two arrays):
C = [13, 12, 12]
Also sidenote.. the application I'd be using this code for uses timestamps from a database acting as array B (so that once a day is passed, that value in the array obviously won't be repeated)
Any help is appreciated!!
Here is a solution without pandas, using only itertools groupby:
from itertools import groupby
C = [sum( a for a,_ in g) for _,g in groupby(zip(A,B),key = lambda x: x[1])]
yields:
[13, 12, 12]
I would use pandas for this
Say you put those arrays in a DataFrame. This does the job:
df = pd.DataFrame(
{
'A': [2, 7, 4, 3, 9, 4, 2, 6],
'B': [1, 1, 1, 4, 4, 7, 7, 7]
}
)
df.groupby('B').sum()
If you want pure python solution, you can use itertools.groupby:
from itertools import groupby
A = [2, 7, 4, 3, 9, 4, 2, 6]
B = [1, 1, 1, 4, 4, 7, 7, 7]
out = []
for _, g in groupby(zip(A, B), lambda k: k[1]):
out.append(sum(v for v, _ in g))
print(out)
Prints:
[13, 12, 12]
Considering "b" defined below as a list of dictionaries. How can I remove element 6 from the 'index' in second element of b (b[1]['index'][6]) and save the new list to b?
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.randn(10))
b = [{'color':'red','index':a.index},{'color':'blue','index':a.index}]
output:
[{'color': 'red', 'index': Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')}, {'color': 'blue', 'index': Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')}]
I tried np.delete and .pop or .del for lists (no success), but I do not know what is the best way to do it?
I think this will work for you
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.randn(10))
print a
b = [{'color':'red','index':a.index},{'color':'blue','index':a.index}]
d = b[1]['index']
b[1]['index'] = d.delete(6)
print b[1]['index']
Int64Index([0, 1, 2, 3, 4, 5, 7, 8, 9], dtype='int64')