Holla!
I have 400 csv files, each with around 50,000 rows (this varies from file to file) and exactly 2 columns. The goal is to find the files which are exactly the same (there might multiple uniquely similar files), but the ultimate goal is to look for the most occurring files with the same data.
The steps I'm trying to implement are listed as follows:
importing csv files as pandas df
this step is to check the shape of the files/dataframes. If the shapes of the df are same, then I may check the elements for equality) (the ones with different shapes already drops off from the same df consideration)
sorting the df based on first column with its corresponding second column
taking difference of the sorted dataframes (if the difference results in 0, the df are exactly same, which is needed)
store the variable names of the same dataframes in a list
Here is a dummy setup I'm working on:
import pandas as pd
import numpy as np
## step 1.
# creating random dataframes (implying importing csv files as df)
# keeping these three as same files
df_0 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_1 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_3 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
# taking these two as same files
df_2 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_4 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_5 = pd.DataFrame({'a': [1, 1, 2, 1, 2], 'b': [2, 3, 4, 2, 1]})
#taking a couple of files as different shape
df_6 = pd.DataFrame({'a': [1, 1, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
df_7 = pd.DataFrame({'a': [1, 2, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
###here there are two different sets of same df's, however as described in ultimate
###goal, the first set i.e. df_0, df_1, df_3 is to be considered since it has most number
###of (3) same df's and the other set has less (2).
## step 2. pending!! (will need it for the original data with 400 files)
## step 3.
# function to sort all the df in the list
def sort_df(df_list):
for df in df_list:
df.sort_values(by=['a'], inplace=True)
return df_list
#print(sort_df([df_0, df_1, df_2, df_3, df_4, df_5]))
# save the sorted df in a list
sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4]) # this performs: 0-1, 0-2, 0-3, 1-2, 1-3, 2-3
#sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4,df_5,df_6,df_7]) # 0-1, 0-2, 0-3, 0-4, 0-5, 0-6, 0-7, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 2-3, 2-4, 2-5, 2-6, 2-7, 3-4, 3-5, 3-6, 3-7, 4-5, 4-6, 4-7, 5-6, 5-7, 6-7
## step 4.
# script to take difference of all the df in the sorted_df_list
def diff_df(df_list):
diff_df_list = []
for i in range(len(df_list)):
for j in range(i+1, len(df_list)):
diff_df_list.append(df_list[i].subtract(df_list[j]))
# if the difference result is 0, then print that the df are same and store the df variable name in a list
if df_list[i].subtract(df_list[j]).equals(df_list[i].subtract(df_list[j])*0):
print('df_{} and df_{} are same'.format(i,j))
return diff_df_list
## step 5.
#### major help is needed here!!!! #####
# if the difference result is 0, then print that the df are same and store the df variable name in a list
## or some way to store the df which are same aka diff is 0
print(diff_df(sorted_df_list))
# # save the difference of all the df in a list
# diff_df_list = diff_df(sorted_df_list)
# print('------------')
# # script to make a list of all df names with all the values as 0
# def zero_df(df_list):
# zero_df_list = []
# for df in df_list:
# if df.equals(df*0):
# zero_df_list.append(df)
# return zero_df_list
# print(zero_df(diff_df_list))
As tested, on first 4 df's, the defined functions work well, that results into df_0, df_1 and df_3 as 0's.
I am seeking help to store these variable names of df's that are the same.
Also, the logic should work well for possible exceptions, that can be checked by incorporating all 8 of the created df's.
If anyone may have feedback or suggestions for these issues, that would be greatly appreciated. Cheers!
An efficient method could be to hash the DataFrames, then identify the duplicates:
def df_hash(df):
s = pd.util.hash_pandas_object(df, index=False) # ignore labels
return hash(tuple(s))
hashes = [df_hash(d) for d in dfs]
dups = pd.Series(hashes).duplicated(keep=False)
out = pd.Series(dfs)[dups]
len(out) # or dups.sum()
# 4
Related
I have three dataframes that have the same format and I want to simply add the three respective values on top of each other, so that df_new = df1 + df2 + df3. The new df would have the same amount of rows and columns as each old df.
But doing so only appends the columns. I have searched through the docs and there is a lot on merging etc but nothing on adding values. I suppose there must be a one liner for such a basic operation?
Possible solution is the following:
# pip install pandas
import pandas as pd
#set test dataframes with same structure but diff values
df1 = pd.DataFrame({"col1": [1, 1, 1], "col2": [1, 1, 1],})
df2 = pd.DataFrame({"col1": [2, 2, 2], "col2": [2, 2, 2],})
df3 = pd.DataFrame({"col1": [3, 3, 3], "col2": [3, 3, 3],})
df_new = pd.DataFrame()
for col in list(df1.columns):
df_new[col] = df1[col].map(str) + df2[col].map(str) + df3[col].map(str)
df_new
Returns
I have a data frame with two columns. The two columns contain integer numbers. The second column contains numbers that are linked to the first column. In case there is no link between the two columns, the number in the second column will have zero value. Here is an example of the table.
The expected output is a list of connections between the two columns. Using the attached table as an example, the output will be
[[2, 3, 4, 5], [6, 7, 8]]
This question is similar but not the same as finding transitive relation between two columns in pandas.
You could approach this as a graph, treating the dataframe as an edge list. You can then retrieve the connected nodes with networkx:
import pandas as pd
import networkx as nx
df = pd.DataFrame({'a': range(1, 11), 'b': [0, 4, 2, 5, 0, 7, 8, 0, 0, 0]})
g = nx.from_pandas_edgelist(df[df['b'] != 0], source='a', target='b')
print(list(nx.connected_components(g)))
Output:
[{2, 3, 4, 5}, {8, 6, 7}]
Not really a Pandas answer, but here's one approach (with help from here for finding runs of consecutive integers):
df = pd.DataFrame({'a': range(1, 11),
'b': [0, 4, 2, 5, 0, 7, 8, 0, 0, 0]})
from itertools import groupby
from operator import itemgetter
zero_locs = df['b'].to_numpy().nonzero()[0]
connections = []
for k,g in groupby(enumerate(zero_locs), lambda x: x[0]-x[1]):
group = (map(itemgetter(1),g))
group = list(map(int,group))
group.append(group[-1] + 1)
connections.append(list(df['a'][group]))
connections # [[2, 3, 4, 5], [6, 7, 8]]
I have two pandas df and they do not have the same length. df1 has unique id's in column id. These id's occur (multiple times) in df2.colA. I'd like to add a list of all occurrences of df1.id in df2.colA (and another column at the matching index of df1.id == df2.colA) into a new column in df1. Either with the index of df2.colA of the match or additionally with other row entries of all matches.
Example:
df1.id = [1, 2, 3, 4]
df2.colA = [3, 4, 4, 2, 1, 1]
df2.colB = [5, 9, 6, 5, 8, 7]
So that my operation creates something like:
df1.colAB = [ [[1,8],[1,7]], [[2,5]], [[3,5]], [[4,9],[4,6]] ]
I've tries a bunch of approaches with mapping, looping explicitly (super slow), checking with isin etc.
You could use Pandas apply to iterate over each row of df1 value while creating a list with all the indices in df2.colA. This can be achieved by using Pandas index and loc over the df2.colB to create a list with all the indices in df2.colA that match the row in df1.id. Then, within the apply itself use a for-loop to create the list of matched values.
import pandas as pd
# setup
df1 = pd.DataFrame({'id':[1,2,3,4]})
print(df1)
df2 = pd.DataFrame({
'colA' : [3, 4, 4, 2, 1, 1],
'colB' : [5, 9, 6, 5, 8, 7]
})
print(df2)
#code
df1['colAB'] = df1['id'].apply(lambda row:
[[row, idx] for idx in df2.loc[df2[df2.colA == row].index,'colB']])
print(df1)
Output from df1
id colAB
0 1 [[1, 8], [1, 7]]
1 2 [[2, 5]]
2 3 [[3, 5]]
3 4 [[4, 9], [4, 6]]
Is there a way to sort a dataframe by a combination of different columns? As in if specific columns match among rows, they will be clustered together? An example below: Any help is greatly appreciated!
Original DataFrame
Transformed DataFrame
One way to sort pandas dataframe is to use .sort_values().
The code below replicates your sample dataframe:
df= pd.DataFrame({'v1': [1, 3, 2, 1, 4, 3],
'v2': [2, 2, 4, 2, 3, 2],
'v3': [3, 3, 2, 3, 2, 3],
'v4': [4, 5, 1, 4, 2, 5]})
Using the code below, can sort the dataframe by both column v1 and v2. In this case, v2 is only used to break ties.
df.sort_values(by=['v1', 'v2'], ascending=True)
"by" parameter here is not limited to any number of variables, so could extend the list to include more variables in desired order.
This is the best to match your sort pattern shown in the image.
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
# Make a temp column to sort the df by
df['sort'] = df.astype(str).values.sum(axis=1)
# Sort the df by that column, drop it and reset the index
df = df.sort_values(by='sort').drop(columns='sort').reset_index(drop=1)
print(df)
Link you can refe - Code in python tutor
Edit: Zolzaya Luvsandorj's recommendation is better:
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
df = df.sort_values(by=list(df.columns)).reset_index(drop=1)
print(df)
Link you can refe - Better code in python tutor
I have a dataframe that I am trying to drop some columns from, based on their content. If all of the rows in a column have the same value as one of the items in a list, then I want to drop that column. I am having trouble doing this without messing up the loops. Is there a better way to do this, or some error I can fix? I am getting an error that says:
IndexError: index 382 is out of bounds for axis 0 with size 382
Code:
def trimADAS(df):
notList = ["Word Recall Test","Result"]
print("START")
print(len(df.columns))
numCols = len(df.columns)
for h in range(numCols): # for every column
for i in range(len(notList)): # for every list item
if df[df.columns[h]].all() == notList[i]: # if all column entries == list item
print(notList[i]) # print list item
print(df[df.columns[h]]) # print column
print(df.columns[h]) # print column name
df.drop([df.columns[h]], axis = 1, inplace = True) # drop this column
numCols -= 1
print("END")
print(len(df.columns))
print(df.columns)
return()
Loopy isn't usually the way to go with pandas. Here's one solution.
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 1],
'B': [2, 2, 2, 2],
'C': [3, 3, 3, 3],
'D': [4, 4, 4, 4],
'E': [5, 5, 5, 5]})
lst = [2, 3, 4]
df = df.drop([x for x in df if any((df[x]==i).all() for i in lst)], 1)
# A E
# 0 1 5
# 1 1 5
# 2 1 5
# 3 1 5