Combine duplicate rows in Pandas - python
I got a dataframe where some rows contains almost duplicate values. I'd like to combine these rows as much as possible to reduce the row numbers. Let's say I got following dataframe:
One
Two
Three
A
B
C
B
B
B
C
A
B
In this example I'd like the output to be:
One
Two
Three
ABC
AB
CB
The real dataframe got thousands of rows with eight columns.
The csv from a dataframe-sample:
Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
A,A,A,A,A,A,A,A
A,A,A,A,A,A,A,B
A,A,A,A,A,A,A,C
A,A,A,A,A,A,B,A
A,A,A,A,A,A,B,B
A,A,A,A,A,A,B,C
A,A,A,A,A,A,C,A
A,A,A,A,A,A,C,B
A,A,A,A,A,A,C,C
C,C,C,C,C,C,A,A
C,C,C,C,C,C,A,B
C,C,C,C,C,C,A,C
C,C,C,C,C,C,B,A
C,C,C,C,C,C,B,B
C,C,C,C,C,C,B,C
C,C,C,C,C,C,C,A
C,C,C,C,C,C,C,B
C,C,C,C,C,C,C,C
To easier show how desired outcome woud look like:
Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
AC,AC,AC,AC,AC,AC,ABC,ABC
I've tried some code but I end up in real long code snippets which I doubt could be the best and most natural solution. Any suggestions?
If your data are all characters you can end up with this solution and collapse everything to one single row:
import pandas as pd
data = pd.read_csv("path/to/data")
collapsed = data.astype(str).sum().applymap(lambda x: ''.join(set(x)))
Check this answer on how to get unique characters in a string.
You can use something like this:
df = df.groupby(['Two'])['One','Three'].apply(''.join).reset_index()
If you can provide a small bit of code that creates the first df it'd be easier to try out solutions.
Also this other post may help: pandas - Merge nearly duplicate rows based on column value
EDIT:
Does this get you the output you're looking for?
joined_df = df.apply(''.join, axis=0)
variation of this: Concatenate all columns in a pandas dataframe
Related
Iterate over a pandas DataFrame & Check Row Comparisons
I'm trying to iterate over a large DataFrame that has 32 fields, 1 million plus rows. What i'm trying to do is iterate over each row, and check whether any of the rest of the rows have duplicate information in 30 of the fields, while the other two fields have different information. I'd then like to store the the ID info. of the rows that meet these conditions. So far i've been trying to figure out how to check two rows with the below code, it seems to work when comparing single columns but throws an error when I try more than one column, could anyone advise on how best to approach? for index in range(len(df)): for row in range(index, len(df)): if df.iloc[index][1:30] == df.iloc[row][1:30]: print(df.iloc[index])
As a general rule, you should always always try not to iterate over the rows of a DataFrame. It seems that what you need is the pandas duplicated() method. If you have a list of the 30 columns you want to use to determine duplicates rows, the code looks something like this: df.duplicated(subset=['col1', 'col2', 'col3']) # etc. Full example: # Set up test df from io import StringIO sub_df = pd.read_csv( StringIO("""ID;col1;col2;col3 One;23;451;42;31 Two;24;451;42;54 Three;25;513;31;31""" ), sep=";" ) Find which rows are duplicates in col1 and col2. Note that the default is that the first instance is not marked as a duplicate, but later duplicates are. This behaviour can be changed as described in the documentation I linked to above. mask = sub_df.duplicated(["col1", "col2"]) This looks like: Now, filter using the mask. sub_df["ID"][sub_df.duplicated(["col1", "col2"])] Of course, you can do the last two steps in one line.
Using Python with Pandas to output random rows from two columns
I have a spreadsheet with three columns. I want to output an n number of random rows, and this works for outputting any amount of random rows from one column: df = pandas.read_excel(filename, header=0, names=["Speaker","Time","Message"]) random.choices(df["Message"], k=10) From what I've read, you should be able to select multiple columns by doing this: df = pandas.read_excel(filename, header=0, names=["Speaker","Time","Message"]) random.choices(df[["Speaker","Message"]], k=10) But this gives me a keyerror. I'm not sure what I'm missing. Other examples seem to make it pretty straightforward, but I must be missing something, probably extremely simple. Thanks.
random.choices is for list-like 1 demential data (ie: list, tuple, etc). It won't work for dataframes where you have a 2 demential data (row x column). If you like to have random picks from dataframe, you can use pandas sample function. df.sample(10) or to get specific columns. df[['Speaker', 'Message']].sample(10)
How to append rows to a Pandas dataframe, and have it turn multiple overlapping cells (with the same index) into a single value, instead of a series?
I am appending different dataframes to make one set. Occasionally, some values have the same index, so it stores the value as a series. Is there a quick way within Pandas to just overwrite the value instead of storing all the values as a series?
You weren't very clear guy. If you want to resolve the duplicated indexes problem, probably the pd.Dataframe.reset_index() method will be enough. But, if you have duplicate rows when you concat the Dataframes, just use the pd.DataFrame.drop_duplicates() method. Else, share a bit of your code with or be clearer.
I'm not sure that the code below is what you're searching. we say two dataframes, one columns, the same index and different values. and you wanna overwrite the value in one dataframe with the other. you can do it with a simple loop with iloc indexer. import pandas as pd df_1 = pd.DataFrame({'col_1':['a','b','c','d']}) df_2 = pd.DataFrame({'col_1':['q','w','e','r']}) rows = df_1.shape[0] for idx in range(rows): df_1['col_1'].iloc[idx] = df_2['col_2'].iloc[idx] Then, you check the df_1. you should get that: df_1 col_1 0 q 1 w 2 e 3 r Whatever the response is what you want, let me know so I can help you.
Renaming similar columns from different DataFrames using for loop and Regex in Python
Today I've been working with five DataFrames that are almost the same, but for different courses. They are named df2b2015, df4b2015, df6b2015, df2m2015. Every one of those DataFrames has a column named prom_lect2b_rbd for df2b2015, prom_lect4b_rbd for df4b2015, and so on. I want to append those DataFrames, but because every column has a different name, they don't go together. I'm trying to turn every one of those columns into a prom_lect_rbd column, so I can then append them without problem. Is there a way I can do that with a for loop and regex. Else, is there a way I can do it with other means? Thanks! PS: I know some things, like I can turn the columns into what I want using: re.sub('\d(b|m)','', a) Where a is the columns name. But I can't find a way to mix that with loops and column renaming. Edit: DataFrame(s) look like this: df2b2015: rbd prom_lect2b_rbd 1 5 2 6 df4b2015: rbd prom_lect4b_rbd 1 8 2 9 etc.
Managed to do it. Probably not the most Pythonic way, but it does what I wanted: dfs=[df2b2015,df4b2015,df6b2015,df8b2015,df2m2015] cols_lect=['prom_lect2b_rbd','prom_lect4b_rbd','prom_lect6b_rbd', 'prom_lect8b_rbd','prom_lect2m_rbd'] for j,k in zip(dfs,cols_lect): j.rename(columns={k:re.sub('\d(b|m)','', k)}, inplace=True)
Something like this, with .filter(regex=)? It does assume there is only one matching column per dataframe but your example permits that. import pandas as pd import numpy as np df1 = pd.DataFrame(np.random.rand(10,3),columns=['prom_lect2b_rbd','foo','bar']) df2 = pd.DataFrame(np.random.rand(10,3),columns=['prom_lect4b_rbd','foo','bar']) for df in [df1,df2]: colname = df.filter(regex='prom_lect*').columns.format() df.rename(columns={colname[0]:'prom_lect_rbd'}) print(df1) print(df2)
Python pandas: fill a dataframe with data from another
I have an empty pandas dataframe as displayed in the first picture. What I like first dataframe So, many, many Pfam IDs as columns and many different gene IDs as indices. Then I have a second dataframe like this. second dataframe Now what I would like to do is getting the data from the second into the first, doing this I simply like to write a 0 in each Pfam column that has no entry for a particular gene ID, and a 1 in each case a gene has a Pfam. Any help would be highly appreciated.
assume the first dataframe is named d1 and the second is d2 d1.fillna(d2.groupby([d2.index, 'Pfam']).size().mul(0).unstack())