How to merge dataframe faster?

How to merge dataframe faster? - python

I have a df as following
import pandas as pd
df = pd.DataFrame(
{'number_1': ['1', '2', None, None, '5', '6', '7', '8'],
'fruit_1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
'name_1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
'number_2': [None, None, '3', None, None, None, None, None],
'fruit_2': [None, None, 'blueberry', None, None, None, None, None],
'name_2': [None, None, 'anthony', None, None, None, None, None],
'number_3': [None, None, '3', '4', None, None, None, None],
'fruit_3': [None, None, 'blueberry', 'strawberry', None, None, None, None],
'name_3': [None, None, 'anthony', 'terry', None, None, None, None],
}
)
Here what I'd like to do is:
find columns which has the same item. name_1, name_2, name_3 for example.
combine the columns to get rid of the None values.
The desired result is
number fruit name
0 1 apple tom
1 2 banana jerry
2 3 blueberry anthony
3 4 strawberry terry
4 5 watermelon paul
5 6 peach edward
6 7 orange reggie
7 8 lemon nicholas
Here is how I do it.
# Get the first column
merge_df = pd.DataFrame(df.iloc[:, 0])
merge_df.columns = [merge_df.columns[0].split('_')[0]]
item_list = [column_list[0].split('_')[0]]
column_list = df.columns.to_list()
for i in range(len(column_list)):
for j in range(i + 1, len(column_list)):
first_item = column_list[i].split('_')[0]
second_item = column_list[j].split('_')[0]
# change series name
df_series = df.iloc[:, j]
df_series.name = second_item
if first_item != second_item and second_item not in item_list:
merge_df = pd.concat([merge_df, df_series], axis=1)
item_list.append(column_list[j].split('_')[0])
if first_item == second_item:
# combine df and series
if second_item in merge_df.columns:
merge_df = merge_df.assign(
**{f'{second_item}': merge_df[second_item].combine(df_series,
lambda x, y: x if x is not None else y)})
print(merge_df)
Problem is it is very slow if df has multiple columns.
Anyone has an advice to optimize this?
Edit:
The accepted answer has given a perfect way to use a regex. Here I had a more complicated issue which is similar to this. I put it here instead of creating a new answer.
Here the df is
import pandas as pd
df = pd.DataFrame(
{'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
'number_C2_E2': [None, None, '3', None, None, None, None, None],
'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
'number_C3_E1': [None, None, '3', '4', None, None, None, None],
'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
}
)
Here the rule is: if a column removes _C{0~9} or _C{0~9}{0~9} or _C{0~9}{0~9}{0~9} is equal to another column, these two columns can be combined. Let's take number_C1_E1 number_C2_E2 number_C3_E1 as an example, here number_C1_E1 and number_C3_E1 can be combined because they are both number_E1 after removing _C{0~9}. In this way, the desired result is
number_E1 fruit_E1 name_E1 number_E2 fruit_E2 name_E2
0 1 apple tom None None None
1 2 banana jerry None None None
2 3 blueberry anthony 3 blueberry anthony
3 4 strawberry terry None None None
4 5 watermelon paul None None None
5 6 peach edward None None None
6 7 orange reggie None None None
7 8 lemon nicholas None None None

extract the first word of the column names with a regex and groupby.first on columns:
out = df.groupby(df.columns.str.extract('([^_]+)', expand=False),
axis=1, sort=False).first()
Output:
number fruit name
0 1 apple tom
1 2 banana jerry
2 3 blueberry anthony
3 4 strawberry terry
4 5 watermelon paul
5 6 peach edward
6 7 orange reggie
7 8 lemon nicholas
Second example: use the same logic with str.replace to remove the internal part
# remove internal _xxx_
out = df.groupby(df.columns.str.replace(r'_[^_]+(?=_)', '', regex=True),
axis=1, sort=False).first()
# remove second to last xxx
out = df.groupby(df.columns.str.replace(r'(_[^_]+)(?=_[^_]+$)', '', regex=True),
axis=1, sort=False).first()
Output:
number_E1 fruit_E1 name_E1 number_E2 fruit_E2 name_E2
0 1 apple tom None None None
1 2 banana jerry None None None
2 3 blueberry anthony 3 blueberry anthony
3 4 strawberry terry None None None
4 5 watermelon paul None None None
5 6 peach edward None None None
6 7 orange reggie None None None
7 8 lemon nicholas None None None

Related

How to iterate through a dataframe and find a specific part of a string and add it too a new column?

I have a dataframe and there is a specific string I want to pull out and delete apart of it. The string repeats throughout the file with different endings. I want to find part of the string, delete some of it, and add the part I want to keep to several columns. I have an empty dataframe column that I want to add the kept part too. I have included a picture of the current dataframe with the empty column where I want the data to go. I will also add a screenshot of what I want the data to look like. I want it too repeat this until there is no longer that specific string.

As long as you have a way of identifying the values you want to turn into the group data and a way of manipulating those values to make them what you want, then you can do something like this.
import pandas as pd
data = [
[None, 'Group: X', None, None],
[None, 1, 'A1', 20],
[None, 1, 'A1', None],
[None, 2, 'B1', 40],
[None, 2, 'B1', None],
[None, 'Group: Y', None, None],
[None, 1, 'A1', 30],
[None, 1, 'A1', None],
[None, 2, 'B1', 60],
[None, 2, 'B1', None],
]
columns = ['Group', 'Sample', 'Well', 'DiluationFactor']
def identifying_function(value):
return isinstance(value, str) and 'Group: ' in value
def manipulating_function(value):
return value.replace('Group: ', '')
df = pd.DataFrame(data=data, columns=columns)
print(df)
# identify which rows contain the group data
mask = df['Sample'].apply(identifying_function)
# manipulate the data from those rows and write them to the Group column
df.loc[mask, 'Group'] = df.loc[mask, 'Sample'].apply(manipulating_function)
# forward fill the Group column
df['Group'].ffill(inplace=True)
# eliminate the no longer needed rows
df = df.loc[~mask]
print(df)
DataFrame Before:
Group Sample Well DiluationFactor
0 None Group: X None NaN
1 None 1 A1 20.0
2 None 1 A1 NaN
3 None 2 B1 40.0
4 None 2 B1 NaN
5 None Group: Y None NaN
6 None 1 A1 30.0
7 None 1 A1 NaN
8 None 2 B1 60.0
9 None 2 B1 NaN
DataFrame After:
Group Sample Well DiluationFactor
1 X 1 A1 20.0
2 X 1 A1 NaN
3 X 2 B1 40.0
4 X 2 B1 NaN
6 Y 1 A1 30.0
7 Y 1 A1 NaN
8 Y 2 B1 60.0
9 Y 2 B1 NaN

How to trim all the columns after a column name is found None in pandas

I have a dataframe as follows:
ID Sr. No Col_1 None Col_2
1 10 Abc None XUZ_09
2 20 Xyz None Abc_227
I want to discard every thing After None field i.e. After Col_1.
One way to do is as follows:
df_final = df_final.iloc[:,:-3]
But I want to make the -3 dynamic. So the resultant Dataframe would be
ID Sr. No Col_1
1 10 Abc
2 20 Xyz
Any clue on this please?

Or taking by the column name:
df = df.iloc[:, :(df.columns == 'None').argmax()]

Creating the Dataframe
d = {
'ID': [1,2,34],
'Sr. No': [13,23,343],
'Col_1': [1,23,4345],
'None': [None, None, None],
'Col_3': [None, None, None]
}
df = pd.DataFrame(d)
df
ID Sr. No Col_1 None Col_3
0 1 13 1 None None
1 2 23 23 None None
2 34 343 4345 None None
Splitting the dataframe with "None" column
columns = df.columns.to_list()
split_index = columns.index('None')
df = df[columns[:-split_index + 1]]
Result
df
ID Sr. No Col_1
0 1 13 1
1 2 23 23
2 34 343 4345
``

If column name is None like NoneType or None like string use solution with compare it and Index.cumsum with compare values before first matched values by compare 0, last pass to DataFrame.loc with : for get all rows and columns by mask:
d = {
'ID': [1,2,34],
'Sr. No': [13,23,343],
'Col_1': [1,23,4345],
'None': [None, None, None]
}
df = pd.DataFrame(d)
mask = (df.columns.isna() | (df.columns == 'None')).cumsum() == 0
df1 = df.loc[:, mask]
print (df1)
ID Sr. No Col_1
0 1 13 1
1 2 23 23
2 34 343 4345
d = {
'ID': [1,2,34],
'Sr. No': [13,23,343],
'Col_1': [1,23,4345],
None: [None, None, None]
}
df = pd.DataFrame(d)
mask = (df.columns.isna() | (df.columns == 'None')).cumsum() == 0
df1 = df.loc[:, mask]
print (df1)
ID Sr. No Col_1
0 1 13 1
1 2 23 23
2 34 343 4345
d = {
'ID': [1,2,34],
'Sr. No': [13,23,343],
'Col_1': [1,23,4345],
'col_z': [None, None, None]
}
df = pd.DataFrame(d)
mask = (df.columns.isna() | (df.columns == 'None')).cumsum() == 0
df1 = df.loc[:, mask]
print (df1)
ID Sr. No Col_1 col_z
0 1 13 1 None
1 2 23 23 None
2 34 343 4345 None
EDIT:
d = {
'ID': [1,2,34],
'Sr. No': [13,23,343],
'Col_1': [1,23,4345]
}
df = pd.DataFrame(d, index=[1,None,5])
mask = (df.index.isna() | (df.index== 'None')).cumsum() == 0
df1 = df.loc[mask]
print (df1)
ID Sr. No Col_1
1 1 13 1

Pandas dataframe compare same keys

Hi i want to compare same keys of a pandas dataframe.
car
values(dict)
0
audi1
{'colour': 'black', 'PS': '3', 'owner': 'peter'}
1
audi2
{'owner': 'fred', 'colour': 'black', 'PS': '230', 'number': '3'}
2
ford
{'windows': '3', 'PS': '3', 'owner': 'peter'}
3
bmw
{'colour': 'black', 'windows': 'no', 'owner': 'peter', 'number': '3'}
wanted solution
colour
owner
PS
number
windows
black
3
0
0
0
0
peter
0
3
0
0
0
3
0
0
2
2
1
fred
0
1
0
0
0
no
0
0
0
0
1
I hope my problem is understandable
d = {'audi1': {'colour': 'black', 'PS': '3', 'owner': 'peter'}, 'audi2': {'owner': 'fred', 'colour': 'black', 'PS': '230', 'number': '3'}, 'ford': {'windows': '3', 'PS': '3', 'owner': 'peter'}, 'bmw': {'colour': 'black', 'windows': 'no', 'owner': 'peter', 'number': '3'}}
df = pd.DataFrame(d.items(), columns=['car', 'values'])

You can create a new dataframe from the dictionaries present in the values column then stack the frame to reshape, finally use crosstab to create a frequency table:
s = pd.DataFrame(df['values'].tolist()).stack()
table = pd.crosstab(s, s.index.get_level_values(1))
Alternate but similar approach with groupby + value_counts followed by unstack to reshape:
s = pd.DataFrame(df['values'].tolist()).stack()
table = s.groupby(level=1).value_counts().unstack(level=0, fill_value=0)
>>> table
PS colour number owner windows
230 1 0 0 0 0
3 2 0 2 0 1
black 0 3 0 0 0
fred 0 0 0 1 0
no 0 0 0 0 1
peter 0 0 0 3 0

Flatten json to get multiple columns in Pandas

I have a sample dataframe as
sample_df = pd.DataFrame({'id': [1, 2], 'fruits' :[
[{'name': u'mango', 'cost': 100, 'color': u'yellow', 'size': 12}],
[{'name': u'mango', 'cost': 150, 'color': u'yellow', 'size': 21},
{'name': u'banana', 'cost': 200, 'color': u'green', 'size': 10} ]
]})
I would like to flatten the fruits column to get new columns like name, cost, color and size. One id can have more than 1 fruit entry. For example id 2 has information for 2 fruits mango and banana
print(sample_df)
fruits id
0 [{'name': 'mango', 'cost': 100, 'color': 'yell... 1
1 [{'name': 'mango', 'cost': 150, 'color': 'yell... 2
In the output I would like to have 3 records, 1 record with fruit information for id 1 and 2 records for fruit information for id 2
Is there a way to parse this structure using pandas ?

First unnesting your columns , then concat the values after called DataFrame
s=unnesting(sample_df,['fruits']).reset_index(drop=True)
df=pd.concat([s.drop('fruits',1),pd.DataFrame(s.fruits.tolist())],axis=1)
df
Out[149]:
id color cost name size
0 1 yellow 100 mango 12
1 2 yellow 150 mango 21
2 2 green 200 banana 10
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
Method 2
sample_df.set_index('id').fruits.apply(pd.Series).stack().apply(pd.Series).reset_index(level=0)
Out[159]:
id color cost name size
0 1 yellow 100 mango 12
0 2 yellow 150 mango 21
1 2 green 200 banana 10

Pandas get row names (sample names) from Multiindex

suppose a dataframe which has an index like this one:
df = pd.DataFrame(np.array([[1,2,3,4],[4,5,6,1],['A','B','C','A'],['a','b','a','b']]).T,columns=['d1','d2','type','subtype'])
df.set_index(['type', 'subtype','d1']).unstack('d1')
df = pd.DataFrame(np.array([[1,2,3,4],[4,5,6,1],['A','B','C','A'],['a','b','a','b']]).T,columns=['d1','d2','type','subtype'])
df = df.set_index(['type', 'subtype','d1']).unstack('d1')
df.index
MultiIndex(levels=[['A', 'B', 'C'], ['a', 'b']],
labels=[[0, 0, 1, 2], [0, 1, 1, 0]],
names=['type', 'subtype'])
I do use the values of the dataframe for some analysis (e.g. PCA). Aftewards, I would like to plot the results and name the points according to the index. I know the information of the row names is provided by the levels and the labels in the multiindex. How can I produce a list which gives me the names of each sample (e.g. ['Aa', 'Ab', 'Bb', 'Ca']) ?
Do I really have to do this ?:
l1 = df.index.get_level_values(0).values.tolist()
l2 = df.index.get_level_values(1).values.tolist()
[i1 + i2 for i1, i2 in zip(l1,l2)]
Which produces me:
['Aa', 'Ab', 'Bb', 'Ca']
Or is there a more elegant solution ?

You can use map:
df.index = df.index.map(''.join)
print (df)
d2
d1 1 2 3 4
Aa 4 None None None
Ab None None None 1
Bb None 5 None None
Ca None None 6 None
Or list comprehension:
df.index = [''.join(idx) for idx in df.index]
print (df)
d2
d1 1 2 3 4
Aa 4 None None None
Ab None None None 1
Bb None 5 None None
Ca None None 6 None
Solution with str.join:
df.index = df.index.to_series().str.join('')
print (df)
d2
d1 1 2 3 4
Aa 4 None None None
Ab None None None 1
Bb None 5 None None
Ca None None 6 None

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to merge dataframe faster? - python

Related

How to iterate through a dataframe and find a specific part of a string and add it too a new column?

How to trim all the columns after a column name is found None in pandas

Pandas dataframe compare same keys

Flatten json to get multiple columns in Pandas

Pandas get row names (sample names) from Multiindex

Categories

Resources