Related
Holla!
I have 400 csv files, each with around 50,000 rows (this varies from file to file) and exactly 2 columns. The goal is to find the files which are exactly the same (there might multiple uniquely similar files), but the ultimate goal is to look for the most occurring files with the same data.
The steps I'm trying to implement are listed as follows:
importing csv files as pandas df
this step is to check the shape of the files/dataframes. If the shapes of the df are same, then I may check the elements for equality) (the ones with different shapes already drops off from the same df consideration)
sorting the df based on first column with its corresponding second column
taking difference of the sorted dataframes (if the difference results in 0, the df are exactly same, which is needed)
store the variable names of the same dataframes in a list
Here is a dummy setup I'm working on:
import pandas as pd
import numpy as np
## step 1.
# creating random dataframes (implying importing csv files as df)
# keeping these three as same files
df_0 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_1 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_3 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
# taking these two as same files
df_2 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_4 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_5 = pd.DataFrame({'a': [1, 1, 2, 1, 2], 'b': [2, 3, 4, 2, 1]})
#taking a couple of files as different shape
df_6 = pd.DataFrame({'a': [1, 1, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
df_7 = pd.DataFrame({'a': [1, 2, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
###here there are two different sets of same df's, however as described in ultimate
###goal, the first set i.e. df_0, df_1, df_3 is to be considered since it has most number
###of (3) same df's and the other set has less (2).
## step 2. pending!! (will need it for the original data with 400 files)
## step 3.
# function to sort all the df in the list
def sort_df(df_list):
for df in df_list:
df.sort_values(by=['a'], inplace=True)
return df_list
#print(sort_df([df_0, df_1, df_2, df_3, df_4, df_5]))
# save the sorted df in a list
sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4]) # this performs: 0-1, 0-2, 0-3, 1-2, 1-3, 2-3
#sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4,df_5,df_6,df_7]) # 0-1, 0-2, 0-3, 0-4, 0-5, 0-6, 0-7, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 2-3, 2-4, 2-5, 2-6, 2-7, 3-4, 3-5, 3-6, 3-7, 4-5, 4-6, 4-7, 5-6, 5-7, 6-7
## step 4.
# script to take difference of all the df in the sorted_df_list
def diff_df(df_list):
diff_df_list = []
for i in range(len(df_list)):
for j in range(i+1, len(df_list)):
diff_df_list.append(df_list[i].subtract(df_list[j]))
# if the difference result is 0, then print that the df are same and store the df variable name in a list
if df_list[i].subtract(df_list[j]).equals(df_list[i].subtract(df_list[j])*0):
print('df_{} and df_{} are same'.format(i,j))
return diff_df_list
## step 5.
#### major help is needed here!!!! #####
# if the difference result is 0, then print that the df are same and store the df variable name in a list
## or some way to store the df which are same aka diff is 0
print(diff_df(sorted_df_list))
# # save the difference of all the df in a list
# diff_df_list = diff_df(sorted_df_list)
# print('------------')
# # script to make a list of all df names with all the values as 0
# def zero_df(df_list):
# zero_df_list = []
# for df in df_list:
# if df.equals(df*0):
# zero_df_list.append(df)
# return zero_df_list
# print(zero_df(diff_df_list))
As tested, on first 4 df's, the defined functions work well, that results into df_0, df_1 and df_3 as 0's.
I am seeking help to store these variable names of df's that are the same.
Also, the logic should work well for possible exceptions, that can be checked by incorporating all 8 of the created df's.
If anyone may have feedback or suggestions for these issues, that would be greatly appreciated. Cheers!
An efficient method could be to hash the DataFrames, then identify the duplicates:
def df_hash(df):
s = pd.util.hash_pandas_object(df, index=False) # ignore labels
return hash(tuple(s))
hashes = [df_hash(d) for d in dfs]
dups = pd.Series(hashes).duplicated(keep=False)
out = pd.Series(dfs)[dups]
len(out) # or dups.sum()
# 4
I have the following dataframe and list:
df = [[[1,2,3],'a'],[[4,5],'b'],[[6,7,8],'c']]
list = [[1,2,3],[4,5]]
And I want to do a inner merge between them, so I can keep the items in common. This will be my result:
df = [[1,2,3],'a'],[[4,5],'b']]
I have been thinking in converting both to strings, but even if I convert my list to string, I haven't been able to merge both of them as the merge function requires the items to be series or dataframes (not strings). This could be a great help!!
Thanks
If I understand you correctly, you want only keep rows from the dataframe where the values (lists) are both in the column and the list:
lst = [[1, 2, 3], [4, 5]]
print(df[df["col1"].isin(lst)])
Prints:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
DataFrame used:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
2 [6, 7, 8] c
Thanks for your answer!
This is what worked for me:
Convert my list to a series (my using DB):
match = pd.Series(','.join(map(str,match)))
Convert the list of my master DB into a string:
df_temp2['match_s'].loc[m] =
','.join(map(str,df_temp2['match'].loc[m]))
Applied an inner merge on both DB:
df_temp3 = df_temp2.merge(match.rename('match'), how='inner',
left_on='match_s', right_on='match')
Hope it also works for somebody else :)
I want select digit 5:8 of a 10 digit string number for each row of one column. I have tried indexing in loops but that seems very tedious. Is there a more simple method?
Small example of the data:
import pandas as pd
data = [[1, 2, '12345678910'], [1, 2, '10987654321'], [1, 2, '11029384756']]
df = pd.DataFrame(data, columns = ['Var1', 'Var2', 'Var3])
Var3 should be manipulated and the outcome should be strings of shorter length:
data = [[1, 2, '5678'], [1, 2, '6543'], [1, 2, '3847']]
df = pd.DataFrame(data, columns = ['Var1', 'Var2', 'Var3'])
You can use the apply function to the specific column and then get the substring for each value:
import pandas as pd
data = [[1, 2, '12345678910'], [1, 2, '10987654321'], [1, 2, '11029384756']]
df = pd.DataFrame(data, columns=['Var1', 'Var2', 'Var3'])
df['Var3'] = df['Var3'].apply(lambda x: x[4:8])
Note: In your result you are not extracting the 5:8 characters, but I hope you get the idea.
Is there a way to sort a dataframe by a combination of different columns? As in if specific columns match among rows, they will be clustered together? An example below: Any help is greatly appreciated!
Original DataFrame
Transformed DataFrame
One way to sort pandas dataframe is to use .sort_values().
The code below replicates your sample dataframe:
df= pd.DataFrame({'v1': [1, 3, 2, 1, 4, 3],
'v2': [2, 2, 4, 2, 3, 2],
'v3': [3, 3, 2, 3, 2, 3],
'v4': [4, 5, 1, 4, 2, 5]})
Using the code below, can sort the dataframe by both column v1 and v2. In this case, v2 is only used to break ties.
df.sort_values(by=['v1', 'v2'], ascending=True)
"by" parameter here is not limited to any number of variables, so could extend the list to include more variables in desired order.
This is the best to match your sort pattern shown in the image.
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
# Make a temp column to sort the df by
df['sort'] = df.astype(str).values.sum(axis=1)
# Sort the df by that column, drop it and reset the index
df = df.sort_values(by='sort').drop(columns='sort').reset_index(drop=1)
print(df)
Link you can refe - Code in python tutor
Edit: Zolzaya Luvsandorj's recommendation is better:
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
df = df.sort_values(by=list(df.columns)).reset_index(drop=1)
print(df)
Link you can refe - Better code in python tutor
I have a dataframe with one of its column having a list at each index. I want to concatenate these lists into one list. I am using
ids = df.loc[0:index, 'User IDs'].values.tolist()
However, this results in
['[1,2,3,4......]'] which is a string. Somehow each value in my list column is type str. I have tried converting using list(), literal_eval() but it does not work. The list() converts each element within a list into a string e.g. from [12,13,14...] to ['['1'',','2',','1',',','3'......]'].
How to concatenate pandas column with list values into one list? Kindly help out, I am banging my head on it for several hours.
consider the dataframe df
df = pd.DataFrame(dict(col1=[[1, 2, 3]] * 2))
print(df)
col1
0 [1, 2, 3]
1 [1, 2, 3]
pandas simplest answer
df.col1.sum()
[1, 2, 3, 1, 2, 3]
numpy.concatenate
np.concatenate(df.col1)
array([1, 2, 3, 1, 2, 3])
chain
from itertools import chain
list(chain(*df.col1))
[1, 2, 3, 1, 2, 3]
response to comments:
I think your columns are strings
from ast import literal_eval
df.col1 = df.col1.apply(literal_eval)
If instead your column is string values that look like lists
df = pd.DataFrame(dict(col1=['[1, 2, 3]'] * 2))
print(df) # will look the same
col1
0 [1, 2, 3]
1 [1, 2, 3]
However pd.Series.sum does not work the same.
df.col1.sum()
'[1, 2, 3][1, 2, 3]'
We need to evaluate the strings as if they are literals and then sum
df.col1.apply(literal_eval).sum()
[1, 2, 3, 1, 2, 3]
If you want to flatten the list this is pythonic way to do it:
import pandas as pd
df = pd.DataFrame({'A': [[1,2,3], [4,5,6]]})
a = df['A'].tolist()
a = [i for j in a for i in j]
print a