I want select digit 5:8 of a 10 digit string number for each row of one column. I have tried indexing in loops but that seems very tedious. Is there a more simple method?
Small example of the data:
import pandas as pd
data = [[1, 2, '12345678910'], [1, 2, '10987654321'], [1, 2, '11029384756']]
df = pd.DataFrame(data, columns = ['Var1', 'Var2', 'Var3])
Var3 should be manipulated and the outcome should be strings of shorter length:
data = [[1, 2, '5678'], [1, 2, '6543'], [1, 2, '3847']]
df = pd.DataFrame(data, columns = ['Var1', 'Var2', 'Var3'])
You can use the apply function to the specific column and then get the substring for each value:
import pandas as pd
data = [[1, 2, '12345678910'], [1, 2, '10987654321'], [1, 2, '11029384756']]
df = pd.DataFrame(data, columns=['Var1', 'Var2', 'Var3'])
df['Var3'] = df['Var3'].apply(lambda x: x[4:8])
Note: In your result you are not extracting the 5:8 characters, but I hope you get the idea.
Related
Holla!
I have 400 csv files, each with around 50,000 rows (this varies from file to file) and exactly 2 columns. The goal is to find the files which are exactly the same (there might multiple uniquely similar files), but the ultimate goal is to look for the most occurring files with the same data.
The steps I'm trying to implement are listed as follows:
importing csv files as pandas df
this step is to check the shape of the files/dataframes. If the shapes of the df are same, then I may check the elements for equality) (the ones with different shapes already drops off from the same df consideration)
sorting the df based on first column with its corresponding second column
taking difference of the sorted dataframes (if the difference results in 0, the df are exactly same, which is needed)
store the variable names of the same dataframes in a list
Here is a dummy setup I'm working on:
import pandas as pd
import numpy as np
## step 1.
# creating random dataframes (implying importing csv files as df)
# keeping these three as same files
df_0 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_1 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
df_3 = pd.DataFrame({'a': [4, 1, 2, 1, 1], 'b': [1, 3, 4, 2, 4]})
# taking these two as same files
df_2 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_4 = pd.DataFrame({'a': [3, 2, 2, 1, 1], 'b': [1, 3, 4, 3, 4]})
df_5 = pd.DataFrame({'a': [1, 1, 2, 1, 2], 'b': [2, 3, 4, 2, 1]})
#taking a couple of files as different shape
df_6 = pd.DataFrame({'a': [1, 1, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
df_7 = pd.DataFrame({'a': [1, 2, 2, 1, 2,3], 'b': [2, 3, 4, 2, 1,2]})
###here there are two different sets of same df's, however as described in ultimate
###goal, the first set i.e. df_0, df_1, df_3 is to be considered since it has most number
###of (3) same df's and the other set has less (2).
## step 2. pending!! (will need it for the original data with 400 files)
## step 3.
# function to sort all the df in the list
def sort_df(df_list):
for df in df_list:
df.sort_values(by=['a'], inplace=True)
return df_list
#print(sort_df([df_0, df_1, df_2, df_3, df_4, df_5]))
# save the sorted df in a list
sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4]) # this performs: 0-1, 0-2, 0-3, 1-2, 1-3, 2-3
#sorted_df_list = sort_df([df_0, df_1,df_2, df_3,df_4,df_5,df_6,df_7]) # 0-1, 0-2, 0-3, 0-4, 0-5, 0-6, 0-7, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 2-3, 2-4, 2-5, 2-6, 2-7, 3-4, 3-5, 3-6, 3-7, 4-5, 4-6, 4-7, 5-6, 5-7, 6-7
## step 4.
# script to take difference of all the df in the sorted_df_list
def diff_df(df_list):
diff_df_list = []
for i in range(len(df_list)):
for j in range(i+1, len(df_list)):
diff_df_list.append(df_list[i].subtract(df_list[j]))
# if the difference result is 0, then print that the df are same and store the df variable name in a list
if df_list[i].subtract(df_list[j]).equals(df_list[i].subtract(df_list[j])*0):
print('df_{} and df_{} are same'.format(i,j))
return diff_df_list
## step 5.
#### major help is needed here!!!! #####
# if the difference result is 0, then print that the df are same and store the df variable name in a list
## or some way to store the df which are same aka diff is 0
print(diff_df(sorted_df_list))
# # save the difference of all the df in a list
# diff_df_list = diff_df(sorted_df_list)
# print('------------')
# # script to make a list of all df names with all the values as 0
# def zero_df(df_list):
# zero_df_list = []
# for df in df_list:
# if df.equals(df*0):
# zero_df_list.append(df)
# return zero_df_list
# print(zero_df(diff_df_list))
As tested, on first 4 df's, the defined functions work well, that results into df_0, df_1 and df_3 as 0's.
I am seeking help to store these variable names of df's that are the same.
Also, the logic should work well for possible exceptions, that can be checked by incorporating all 8 of the created df's.
If anyone may have feedback or suggestions for these issues, that would be greatly appreciated. Cheers!
An efficient method could be to hash the DataFrames, then identify the duplicates:
def df_hash(df):
s = pd.util.hash_pandas_object(df, index=False) # ignore labels
return hash(tuple(s))
hashes = [df_hash(d) for d in dfs]
dups = pd.Series(hashes).duplicated(keep=False)
out = pd.Series(dfs)[dups]
len(out) # or dups.sum()
# 4
I have three dataframes that have the same format and I want to simply add the three respective values on top of each other, so that df_new = df1 + df2 + df3. The new df would have the same amount of rows and columns as each old df.
But doing so only appends the columns. I have searched through the docs and there is a lot on merging etc but nothing on adding values. I suppose there must be a one liner for such a basic operation?
Possible solution is the following:
# pip install pandas
import pandas as pd
#set test dataframes with same structure but diff values
df1 = pd.DataFrame({"col1": [1, 1, 1], "col2": [1, 1, 1],})
df2 = pd.DataFrame({"col1": [2, 2, 2], "col2": [2, 2, 2],})
df3 = pd.DataFrame({"col1": [3, 3, 3], "col2": [3, 3, 3],})
df_new = pd.DataFrame()
for col in list(df1.columns):
df_new[col] = df1[col].map(str) + df2[col].map(str) + df3[col].map(str)
df_new
Returns
Is there a way to sort a dataframe by a combination of different columns? As in if specific columns match among rows, they will be clustered together? An example below: Any help is greatly appreciated!
Original DataFrame
Transformed DataFrame
One way to sort pandas dataframe is to use .sort_values().
The code below replicates your sample dataframe:
df= pd.DataFrame({'v1': [1, 3, 2, 1, 4, 3],
'v2': [2, 2, 4, 2, 3, 2],
'v3': [3, 3, 2, 3, 2, 3],
'v4': [4, 5, 1, 4, 2, 5]})
Using the code below, can sort the dataframe by both column v1 and v2. In this case, v2 is only used to break ties.
df.sort_values(by=['v1', 'v2'], ascending=True)
"by" parameter here is not limited to any number of variables, so could extend the list to include more variables in desired order.
This is the best to match your sort pattern shown in the image.
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
# Make a temp column to sort the df by
df['sort'] = df.astype(str).values.sum(axis=1)
# Sort the df by that column, drop it and reset the index
df = df.sort_values(by='sort').drop(columns='sort').reset_index(drop=1)
print(df)
Link you can refe - Code in python tutor
Edit: Zolzaya Luvsandorj's recommendation is better:
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
df = df.sort_values(by=list(df.columns)).reset_index(drop=1)
print(df)
Link you can refe - Better code in python tutor
I have a dataframe with one of its column having a list at each index. I want to concatenate these lists into one list. I am using
ids = df.loc[0:index, 'User IDs'].values.tolist()
However, this results in
['[1,2,3,4......]'] which is a string. Somehow each value in my list column is type str. I have tried converting using list(), literal_eval() but it does not work. The list() converts each element within a list into a string e.g. from [12,13,14...] to ['['1'',','2',','1',',','3'......]'].
How to concatenate pandas column with list values into one list? Kindly help out, I am banging my head on it for several hours.
consider the dataframe df
df = pd.DataFrame(dict(col1=[[1, 2, 3]] * 2))
print(df)
col1
0 [1, 2, 3]
1 [1, 2, 3]
pandas simplest answer
df.col1.sum()
[1, 2, 3, 1, 2, 3]
numpy.concatenate
np.concatenate(df.col1)
array([1, 2, 3, 1, 2, 3])
chain
from itertools import chain
list(chain(*df.col1))
[1, 2, 3, 1, 2, 3]
response to comments:
I think your columns are strings
from ast import literal_eval
df.col1 = df.col1.apply(literal_eval)
If instead your column is string values that look like lists
df = pd.DataFrame(dict(col1=['[1, 2, 3]'] * 2))
print(df) # will look the same
col1
0 [1, 2, 3]
1 [1, 2, 3]
However pd.Series.sum does not work the same.
df.col1.sum()
'[1, 2, 3][1, 2, 3]'
We need to evaluate the strings as if they are literals and then sum
df.col1.apply(literal_eval).sum()
[1, 2, 3, 1, 2, 3]
If you want to flatten the list this is pythonic way to do it:
import pandas as pd
df = pd.DataFrame({'A': [[1,2,3], [4,5,6]]})
a = df['A'].tolist()
a = [i for j in a for i in j]
print a
Ideally, I want to be able something like:
cols = ['A', 'B', 'C']
df = pandas.DataFrame(index=range(5), columns=cols)
df.get_column(cols[0]) = [1, 2, 3, 4, 5]
What is the pythonic/pandonic way to do this?
Edit: I know that I can access the column 'A' by df.A, but in general I do not know what the column names are.
You do not need to store what columns a DataFrame has separately.
You can find out what columns exist in a pandas DataFrame by accessing the DataFrame.columns variable.
To access the Series attached to a particular column, you can use the getitem method of the DataFrame []
Tiny example:
col = df.columns[0]
df[col] = [1, 2, 3, 4, 5]
Okay, this is particularly straightforward.
df[cols[0]] = [1, 2, 3, 4, 5]