Efficiently split pandas dataframe based on combinations of one column values - python

Lets say I have a dataframe with one column and it has 3 unique values
Click here to see input
import pandas as pd
df = pd.DataFrame(['a', 'b', 'c'], columns = ['string'])
df
I want to split this dataframe into smaller data frames, such that each dataframe will contain 2 unique values. In the above case I need 3 data frames 3c2(nCr) = 3. df1 - [a b] df2 - [a c] df3 - [b c]. Please click on the below link to see my current implementation.
Click here to see current code and output
import itertools
for i in itertools.combinations(df.string.values, 2):
print(df[df.string.isin(i)], '\n')
I am looking something like groupby in pandas. Because sub-setting data inside loop is time consuming. In one of the sample case, I have 609 unique values and it was taking around 3 mins to complete the loop. So, looking for some optimized way to perform the same operation, as the unique values may shoot up to 1000's in real scenarios.

It will be slow because you're creating 370k dataframes. If all of them are supposed to only hold two values, why does it need to be a dataframe?
df = pd.DataFrame({'x': range(100)})
df['key'] = 1
records = df.merge(df, on='key').drop('key', axis=1).to_dict('r')
[pd.Series(x) for x in records]
You will see that records is computed quite fast but then it takes a few minutes to generate all of these series objects.

Related

Combine duplicate rows in Pandas

I got a dataframe where some rows contains almost duplicate values. I'd like to combine these rows as much as possible to reduce the row numbers. Let's say I got following dataframe:
One
Two
Three
A
B
C
B
B
B
C
A
B
In this example I'd like the output to be:
One
Two
Three
ABC
AB
CB
The real dataframe got thousands of rows with eight columns.
The csv from a dataframe-sample:
Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
A,A,A,A,A,A,A,A
A,A,A,A,A,A,A,B
A,A,A,A,A,A,A,C
A,A,A,A,A,A,B,A
A,A,A,A,A,A,B,B
A,A,A,A,A,A,B,C
A,A,A,A,A,A,C,A
A,A,A,A,A,A,C,B
A,A,A,A,A,A,C,C
C,C,C,C,C,C,A,A
C,C,C,C,C,C,A,B
C,C,C,C,C,C,A,C
C,C,C,C,C,C,B,A
C,C,C,C,C,C,B,B
C,C,C,C,C,C,B,C
C,C,C,C,C,C,C,A
C,C,C,C,C,C,C,B
C,C,C,C,C,C,C,C
To easier show how desired outcome woud look like:
Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
AC,AC,AC,AC,AC,AC,ABC,ABC
I've tried some code but I end up in real long code snippets which I doubt could be the best and most natural solution. Any suggestions?
If your data are all characters you can end up with this solution and collapse everything to one single row:
import pandas as pd
data = pd.read_csv("path/to/data")
collapsed = data.astype(str).sum().applymap(lambda x: ''.join(set(x)))
Check this answer on how to get unique characters in a string.
You can use something like this:
df = df.groupby(['Two'])['One','Three'].apply(''.join).reset_index()
If you can provide a small bit of code that creates the first df it'd be easier to try out solutions.
Also this other post may help: pandas - Merge nearly duplicate rows based on column value
EDIT:
Does this get you the output you're looking for?
joined_df = df.apply(''.join, axis=0)
variation of this: Concatenate all columns in a pandas dataframe

Iterate of two different dataframes efficiently on a specific column and store only the common rows

I have two dataframes as shown below.
import databricks.koalas as ks
input_data = ks.DataFrame({'code':['123a', '345b', '678c'],
'id':[1, 2, 3]})
my_data = ks.DataFrame({'code':['123a', '12a', '678c'],
'id':[7, 8, 9], 'stype':['A', 'E', '5']})
These two dataframes have a column called code and I want to check the values in column code that exist in my_data and also exist in input_data and store them in a resulting dataframe called output. The output dataframe will have only the code column values that are present in the input_data. The number of columns in each dataframe can differ and I have just shown a sample here
The output dataframe will have a result such as follows based on the provided sample in this question.
display(output)
# Result is below
Code id
'123a' 7
I found solutions online that mostly use for loops but I was wondering if there is a more efficient way to approach this.
Thank you all!
Can try using an inner merge on the two dataframes, and then on the new dataframe, just keeping the one column you want.
For example,
df_new = my_data.merge(input_data, on='code')
df_new = df_new[['code', 'id']]

Equality in python dataframe

Im performing some operations in a df of 4000 columns and 17520 rows. I have to repeat these operations 100 times with 5 different randomly selected columns from the df. I am using the following function:
for i in range(0,100):
rand_cols = np.random.permutation(df.columns)[0:5]
df2 = df[rand_cols]
df2[:,:] *= 2
My question is the following:
Does the operation in the df2 which is the 5 random columns of df affect the columns in the original df?
Thanks
No it doesn't. Just like Valentino in the comments suggested, if you try it with some dummy DataFrame, you can see it doesn't change:
df=pd.DataFrame({'c':range(50)})
df2=df.loc[df['c']%2==0,:]
df2*=10
if you look at df you can see it didn't change.
The reason is df2 saves the view of df but not the data itself

Remove duplicity in DataFrames (peers)

I'm trying to remove duplicate elements in a dataframe. This DataFrame comes from calculating the distance between a given list of geocoordinates. As you can see in the following DataFrame, the data is duplicated but I can't set the index to 'dist' because in other cases, the distance might be 0 or 1 (repeated) and then important data will be discarded.
import pandas as pd
df = pd.DataFrame({'Name_x':['a','b','c','d'],
'Name_y':['b','a','d','c'],
'Latitude_x':['lat_a','lat_b','lat_c','lat_d'],
'Longitude_x':['long_a','long_b','long_c','long_d'],
'Latitude_y':['lat_b','lat_a','lat_d','lat_c'],
'Longitude_y':['long_b','long_a','long_d','long_c'],
'dist':[0,0,1,1]})
df
In this case I would like to remain with the values Name_x: ['a','c'], Name_y['b','d'] with the corresponding geocoordinates: Latitude_x:['lat_a','lat_c'], Latitude_y:['lat_b','lat_d'], Longitude_x:['long_a','long_c'], Longitude_y: ['long_b','long_d'].
I'm not sure if you want this:
df['Name_x'].eq(df['Name_y'].shift()) # filter by equals for name
df.loc[df['Name_x'].eq(df['Name_y'].shift())] # Your "unique" rows

Apply function in a pandas dataframe

I've figured out how apply a function to an entire column or subsection of a pandas dataframe in lieu of writing a loop that modifies each cell one by one.
Is it possible to write a function that takes cells within the dataframe as inputs when doing the above?
Eg. A function that in the current cell returns the product of the previous cell's value multiplied by the cell before that previous cell. I'm doing this line by line now in a loop and it is unsurprisingly very inefficient. I'm quite new to python.
For the case you mention (multiplying the two previous cells), you can do the following (which loops through each column, but not each cell):
import pandas as pd
a = pd.DataFrame({0:[1,2,3,4,5],1:[2,3,4,5,6],2:0,3:0})
for i in range(2,len(a)):
a[i] = a[i-1]*a[i-2]
This will make each column in a the previous two columns multiplied together
If you want to perform this operation going down rows instead of columns, you can just transpose the dataframe (and then transpose it again after performing the loop to get it back in the original format)
EDIT
What's actually wanted is the product of the elements in the previous rows of two columns and the current rows of two columns. This can be accomplished using shift:
import pandas as pd
df= pd.DataFrame({"A": [1,2,3,4], "B": [1,2,3,4], "C": [2,3,4,5], "D": [5,5,5,5]})
df['E'] = df['A'].shift(1)*df['B'].shift(1)*df['C']*df['D']
df['E']
Produces:
0 NaN
1 15.0
2 80.0
3 225.0
This does the trick, and shift can go both forward and backward depending on your need:
df['Column'] = df['Column'].shift(1) * df['Column'].shift(2)

Categories

Resources