Holla!
I have 400 csv files, each with around 50,000 rows (this varies from file to file) and exactly 2 columns. The goal is to find the files which are exactly the same (there might multiple uniquely similar files), but the ultimate goal is to look for the most occurring files with the same data.
The steps I'm trying to implement are listed as follows:
importing csv files as pandas df
this step is to check the shape of the files/dataframes. If the shapes of the df are same, then I may check the elements for equality) (the ones with different shapes already drops off from the same df consideration)
sorting the df based on first column with its corresponding second column
taking difference of the sorted dataframes (if the difference results in 0, the df are exactly same, which is needed)
store the variable names of the same dataframes in a list
Here is a dummy setup I'm working on:
import pandas as pd
import numpy as np
## step 1.
# creating random dataframes (implying importing csv files as df)
# keeping these three as same files
df_0 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_1 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_3 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
# taking these two as same files
df_2 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_4 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_5 = pd.DataFrame({'a': [1, 1, 2, 1, 2], 'b': [2, 3, 4, 2, 1]})
#taking a couple of files as different shape
df_6 = pd.DataFrame({'a': [1, 1, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
df_7 = pd.DataFrame({'a': [1, 2, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
###here there are two different sets of same df's, however as described in ultimate
###goal, the first set i.e. df_0, df_1, df_3 is to be considered since it has most number
###of (3) same df's and the other set has less (2).
## step 2. pending!! (will need it for the original data with 400 files)
## step 3.
# function to sort all the df in the list
def sort_df(df_list):
for df in df_list:
df.sort_values(by=['a'], inplace=True)
return df_list
#print(sort_df([df_0, df_1, df_2, df_3, df_4, df_5]))
# save the sorted df in a list
sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4]) # this performs: 0-1, 0-2, 0-3, 1-2, 1-3, 2-3
#sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4,df_5,df_6,df_7]) # 0-1, 0-2, 0-3, 0-4, 0-5, 0-6, 0-7, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 2-3, 2-4, 2-5, 2-6, 2-7, 3-4, 3-5, 3-6, 3-7, 4-5, 4-6, 4-7, 5-6, 5-7, 6-7
## step 4.
# script to take difference of all the df in the sorted_df_list
def diff_df(df_list):
diff_df_list = []
for i in range(len(df_list)):
for j in range(i+1, len(df_list)):
diff_df_list.append(df_list[i].subtract(df_list[j]))
# if the difference result is 0, then print that the df are same and store the df variable name in a list
if df_list[i].subtract(df_list[j]).equals(df_list[i].subtract(df_list[j])*0):
print('df_{} and df_{} are same'.format(i,j))
return diff_df_list
## step 5.
#### major help is needed here!!!! #####
# if the difference result is 0, then print that the df are same and store the df variable name in a list
## or some way to store the df which are same aka diff is 0
print(diff_df(sorted_df_list))
# # save the difference of all the df in a list
# diff_df_list = diff_df(sorted_df_list)
# print('------------')
# # script to make a list of all df names with all the values as 0
# def zero_df(df_list):
# zero_df_list = []
# for df in df_list:
# if df.equals(df*0):
# zero_df_list.append(df)
# return zero_df_list
# print(zero_df(diff_df_list))
As tested, on first 4 df's, the defined functions work well, that results into df_0, df_1 and df_3 as 0's.
I am seeking help to store these variable names of df's that are the same.
Also, the logic should work well for possible exceptions, that can be checked by incorporating all 8 of the created df's.
If anyone may have feedback or suggestions for these issues, that would be greatly appreciated. Cheers!
An efficient method could be to hash the DataFrames, then identify the duplicates:
def df_hash(df):
s = pd.util.hash_pandas_object(df, index=False) # ignore labels
return hash(tuple(s))
hashes = [df_hash(d) for d in dfs]
dups = pd.Series(hashes).duplicated(keep=False)
out = pd.Series(dfs)[dups]
len(out) # or dups.sum()
# 4
I have three dataframes that have the same format and I want to simply add the three respective values on top of each other, so that df_new = df1 + df2 + df3. The new df would have the same amount of rows and columns as each old df.
But doing so only appends the columns. I have searched through the docs and there is a lot on merging etc but nothing on adding values. I suppose there must be a one liner for such a basic operation?
Possible solution is the following:
# pip install pandas
import pandas as pd
#set test dataframes with same structure but diff values
df1 = pd.DataFrame({"col1": [1, 1, 1], "col2": [1, 1, 1],})
df2 = pd.DataFrame({"col1": [2, 2, 2], "col2": [2, 2, 2],})
df3 = pd.DataFrame({"col1": [3, 3, 3], "col2": [3, 3, 3],})
df_new = pd.DataFrame()
for col in list(df1.columns):
df_new[col] = df1[col].map(str) + df2[col].map(str) + df3[col].map(str)
df_new
Returns
I want select digit 5:8 of a 10 digit string number for each row of one column. I have tried indexing in loops but that seems very tedious. Is there a more simple method?
Small example of the data:
import pandas as pd
data = [[1, 2, '12345678910'], [1, 2, '10987654321'], [1, 2, '11029384756']]
df = pd.DataFrame(data, columns = ['Var1', 'Var2', 'Var3])
Var3 should be manipulated and the outcome should be strings of shorter length:
data = [[1, 2, '5678'], [1, 2, '6543'], [1, 2, '3847']]
df = pd.DataFrame(data, columns = ['Var1', 'Var2', 'Var3'])
You can use the apply function to the specific column and then get the substring for each value:
import pandas as pd
data = [[1, 2, '12345678910'], [1, 2, '10987654321'], [1, 2, '11029384756']]
df = pd.DataFrame(data, columns=['Var1', 'Var2', 'Var3'])
df['Var3'] = df['Var3'].apply(lambda x: x[4:8])
Note: In your result you are not extracting the 5:8 characters, but I hope you get the idea.
I have a pandas dataframe df. And lets say I wanted to share df with you guys here to allow you to easily recreate df in your own notebook.
Is there a command or function that will generate the pandas dataframe create statement? I realize that for a lot of data the statement would be quite large seeing that it must include the actual data, so a header would be ideal.
Essentially, a command that I can run on df and get something like this:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
or
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
... columns=['a', 'b', 'c'])
I'm not sure how to even phrase this question. Like taking a dataframe and deconstructing the top 5 rows or something?
We usually using read_clipboard
pd.read_clipboard()
Out[328]:
col1 col2
0 1 3
1 2 4
Or If you have the df save it into dict so that we can easily convert it back to the sample we need
df.head(5).to_dict()
Ideally, I want to be able something like:
cols = ['A', 'B', 'C']
df = pandas.DataFrame(index=range(5), columns=cols)
df.get_column(cols[0]) = [1, 2, 3, 4, 5]
What is the pythonic/pandonic way to do this?
Edit: I know that I can access the column 'A' by df.A, but in general I do not know what the column names are.
You do not need to store what columns a DataFrame has separately.
You can find out what columns exist in a pandas DataFrame by accessing the DataFrame.columns variable.
To access the Series attached to a particular column, you can use the getitem method of the DataFrame []
Tiny example:
col = df.columns[0]
df[col] = [1, 2, 3, 4, 5]
Okay, this is particularly straightforward.
df[cols[0]] = [1, 2, 3, 4, 5]