I have a dataframe df (see image below) which I need to merge with N dataframes.
In this post, for the sake of clarity, N=3.
The goal is to check if every value of the column Id exists in the three other dataframes and have the same value as well. And if so, the row has to be highlighted with a green color. That's it !
Code :
import pandas as pd
import numpy as np
### --- Dataframes
df = pd.DataFrame({'Id' : ['AA', 'BB', 'CC', 'DD', 'EE'],
'Value': ['three', 'two', 'five','four', 'one']})
df1 = pd.DataFrame({'Id1' : [np.nan, 'CC', 'BB', 'DD', np.nan],
'Value1' : ['one', 'four', 'two', np.nan, np.nan]})
df2 = pd.DataFrame({'Id2' : ['AA', 'BB', 'CC', 'DD', 'JJ'],
'Value2' : [np.nan, 'two', 'five', np.nan, 'six']})
df3 = pd.DataFrame({'Id3' : ['FF', 'HH', 'CC', 'GG', 'BB'],
'Value3' : ['seven', 'five', 'one','three', 'two']})
### --- Joining df to df1, df2 and df3
df_df1 = df.merge(df1, left_on='Id', right_on='Id1', how='left')
df_df1_df2 = df_df1.merge(df2, left_on='Id', right_on='Id2', how='left')
df_df1_df2_df3 = df_df1_df2.merge(df3, left_on='Id', right_on='Id3', how='left')
### --- Creating a function to highlight the aligned rows
def highlight_aligned_row(x):
m1 = (x['Id'] == x['Id1']) & (x['Id'] == x['Id2']) & (x['Id'] == x['Id3'])
m2 = (x['Value'] == x['Value1']) & (x['Value']== x['Value2']) & (x['Value'] == x['Value3'])
df = pd.DataFrame('background-color: ', index=x.index, columns=x.columns)
df['Id'] = np.where(m1 & m2, f'background-color: green', df['Id'])
return df
>>> df_df1_df2_df3.style.apply(highlight_aligned_row, axis=None)
My questions are :
How do we highlight the entire row when a condition is fulfilled ?
Is there a more efficient way to merge 10 dataframes ?
How can we check if every value/row of the original dataframe is aligned with the values of the final dataframe (after the merge) ?
Thank you in advance for your suggestions and your help !
I would do it like this. Hope the comments in between make clear what I am doing. Hopefully, they also answer your questions, but let me know if anything remains unclear.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Id' : ['AA', 'BB', 'CC', 'DD', 'EE'],
'Value': ['three', 'two', 'five','four', 'one']})
df1 = pd.DataFrame({'Id1' : [np.nan, 'BB', 'CC', 'DD', np.nan],
'Value1' : ['one', 'two', 'four', np.nan, np.nan]})
df2 = pd.DataFrame({'Id2' : ['AA', 'BB', 'CC', 'DD', 'JJ'],
'Value2' : [np.nan, 'two', 'five', np.nan, 'six']})
df3 = pd.DataFrame({'Id3' : ['FF', 'BB', 'CC', 'GG', 'HH'],
'Value3' : ['seven', 'two', 'one','v4', 'v5']})
# *IF* your dfs (like above) are all same shape with same index, then easiest is to
# collect your dfs in a list and use pd.concat along axis 1 to merge:
# dfs = [df, df1, df2, df3]
# df_all = pd.concat(dfs, axis=1)
# *BUT* from your comments, this does not appear to be the case. Then instead,
# again collect dfs in a list, and merge them with df in a loop
dfs = [df, df1, df2, df3]
for idx, list_df in enumerate(dfs):
if idx == 0:
df_all = list_df
else:
df_all = df_all.merge(list_df, left_on='Id',
right_on=[col for col in list_df.columns
if col.startswith('Id')][0],
how='left')
def highlight_aligned_row(x):
n1 = x.loc[:,[col for col in x.columns
if col.startswith('Id')]].eq(x.loc[:, 'Id'], axis=0).all(axis=1)
m1 = x.loc[:,[col for col in x.columns
if col.startswith('Value')]].eq(x.loc[:, 'Value'], axis=0).all(axis=1)
eval_bool = n1 & m1
# Just for x['Id']: [False, True, False, False, False]
# repeat 8 times (== len(df.columns)) will lead to .shape == (40,).
# reshape to 5 rows (== len(df)) and 8 cols. Second row will be [8x True] now,
# other rows all 8x False
rows, cols = len(eval_bool), len(x.columns) # 5, 8
eval_bool_repeated = eval_bool.to_numpy().repeat(cols).reshape(rows,cols)
# setup your df
df = pd.DataFrame('background-color: ', index=x.index, columns=x.columns)
# now apply eval_bool_repeated to entire df, not just df['Id']
df = np.where(eval_bool_repeated, f'background-color: green', df)
return df
Result:
Related
Problem:
Given a large data set (3 million rows x 6 columns) what's the fastest way to join values of columns in a single pandas data frame, based on the rows where the mask is true?
My current solution:
import pandas as pd
import numpy as np
# Note: Real data will be 3 millon rows X 6 columns,
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
#print(df)
msg_text_filter = ['msg0', 'msg2']
columns = df.columns.drop(df.columns[:3])
column_join = ["d0"]
mask = df['msg'].isin(msg_text_filter)
df.replace(np.nan,'',inplace=True)
# THIS IS SLOW, HOW TO SPEED UP?
df['d0'] = np.where(
mask,
df[['d0','d1','d2']].agg(''.join, axis=1),
df['d0']
)
df.loc[mask, columns] = np.nan
print(df)
IMHO you can save a lot of time by using
df[['d0', 'd1', 'd2']].sum(axis=1)
instead of
df[['d0', 'd1', 'd2']].agg(''.join, axis=1)
And I think instead of using np.where you could just do:
df.loc[mask, 'd0'] = df.loc[mask, ['d0', 'd1', 'd2']].sum(axis=1)
I am trying to combine hundreds of CSVs together in python using the following code:
import os
import pandas as pd
import glob
path = '/Users/parkerbolstad/Downloads/'
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f, sep=',') for f in all_files)
df_merged = pd.concat(df_from_each_file, axis=1, ignore_index=False)
df_merged.to_csv( "merged.csv")
But, this combines all the files together in totality. The first column of each file is dates. I want to pull the dates from the first file and then skip them for the rest.
As of now, I have a new column with dates in it every 4 columns.
Simply run for-loop to remove this column in all df except first - [1:]
for df in df_from_each_file[1:]:
df.drop('date', axis=1, inplace=True)
import pandas as pd
df1 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['X', 'Y', 'Z']
})
df3 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['1', '2', '3']
})
df_from_each_file = [df1, df2, df3]
for df in df_from_each_file[1:]:
df.drop('date', axis=1, inplace=True)
result = pd.concat(df_from_each_file, axis=1)
print(result)
Result:
date value value value
0 2021.08.01 A X 1
1 2021.08.02 B Y 2
2 2021.08.03 C Z 3
Or in all df convert column with date into index and later reset index.
This should correctly connect rows if dates will be in different rows or there will be missing dates.
import pandas as pd
df1 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['X', 'Y', 'Z']
})
df3 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['1', '2', '3']
})
df_from_each_file = [df1, df2, df3]
for df in df_from_each_file:
df.index = df['date']
df.drop('date', axis=1, inplace=True)
result = pd.concat(df_from_each_file, axis=1)
#result = result.sort_index()
result = result.reset_index()
print(result)
I have a CSV-file with only a single line, but with a lot of the same column headers (NOT duplicates). My final goal is to analyze the value of a given column dependent on the value of the previous column with the same name (which is not the column adjacent to it).
My data might look like this:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ***start block*** | stimulus | words.RT | words.ACC | ***end block*** | ***start block*** | stimulus | words.RT | words.ACC | ***end block*** |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| | pic1.png | 2300 | 1 | | | pic2.png | 2401 | 0 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
and so forth.
Now, I would like to be able to analyze the values of e.g. words.RT depending on the value of words.ACC in the previous block.
I'm not sure what the best approach to this is. I tried loading the CSV into a pandas-dataframe:
import pandas as pd
file = "01.csv"
df = pd.read_csv(file, delimiter=";")
df.columns = df.columns.str.strip("\t")
df.columns = df.columns.str.strip(".34")
df = df.iloc[[0]]
which basically gives me a datatable looking like the one I showed before. Is it possible to split the row into multiple rows according to the blocks? To me, it looks like I would need a three-dimensional array in order to encode the blocks? Is that even possible with pandas?
You can create
df1 = df.iloc[ : , 0:4]
df2 = df.iloc[ : , 4:8]
and append them
df = df1.append(df2)
import pandas as pd
data = {
'A1': [1,2],
'B1': [3,4],
'C1': [5,6],
'D1': [7,8],
'A2': [1,2],
'B2': [3,4],
'C2': [5,6],
'D2': [7,8],
}
df = pd.DataFrame(data)
print(df)
df1 = df.iloc[: , 0:4]
df1.columns = ['A', 'B', 'C', 'D']
df2 = df.iloc[: , 4:8]
df2.columns = ['A', 'B', 'C', 'D']
df = df1.append(df2)
df = df.reset_index(drop=True)
print(df)
If you have more blocks then you can use for-loop and
df.iloc[ : , i:i+4]
import pandas as pd
data = {
'A1': [1,2],
'B1': [3,4],
'C1': [5,6],
'D1': [7,8],
'A2': [1,2],
'B2': [3,4],
'C2': [5,6],
'D2': [7,8],
'A3': [1,2],
'B4': [3,4],
'C5': [5,6],
'D6': [7,8],
}
df = pd.DataFrame(data)
print(df)
# get first block
new_df = df.iloc[:, 0:4]
new_df.columns = ['A', 'B', 'C', 'D']
# get other blocks
for i in range(4, len(df.columns), 4):
temp_df = df.iloc[:, i:i+4]
temp_df.columns = ['A', 'B', 'C', 'D']
new_df = new_df.append( temp_df )
new_df = new_df.reset_index(drop=True)
print(new_df)
EDIT:
The same but with variable block_size and numbers as column's names.
import pandas as pd
data = {
'A1': [1,2],
'B1': [3,4],
'C1': [5,6],
'D1': [7,8],
'A2': [1,2],
'B2': [3,4],
'C2': [5,6],
'D2': [7,8],
'A3': [1,2],
'B3': [3,4],
'C3': [5,6],
'D3': [7,8],
'A4': [1,2],
'B4': [3,4],
'C4': [5,6],
'D4': [7,8],
}
df = pd.DataFrame(data)
print(df)
block_size = 4
# get first block
new_df = df.iloc[:, 0:block_size]
# set numbers for columns
new_df.columns = list(range(block_size))
# get other blocks
for i in range(block_size, len(df.columns), block_size):
temp_df = df.iloc[:, i:i+block_size]
# set the same numbers for columns
temp_df.columns = list(range(block_size))
new_df = new_df.append( temp_df )
# after loop reset rows numbers (indexes)
new_df = new_df.reset_index(drop=True)
print(new_df)
I have a Series of Labels
pd.Series(['L1', 'L2', 'L3'], ['A', 'B', 'A'])
and a dataframe
pd.DataFrame([[1,2], [3,4]], ['I1', 'I2'], ['A', 'B'])
I'd like to have a dataframe with columns ['L1', 'L2', 'L3'] with the column data from 'A', 'B', 'A' respectively. Like so...
pd.DataFrame([[1,2,1], [3,4,3]], ['I1', 'I2'], ['L1', 'L2', 'L3'])
in a nice pandas way.
Since you mention reindex
#s=pd.Series(['L1', 'L2', 'L3'], ['A', 'B', 'A'])
#df=pd.DataFrame([[1,2], [3,4]], ['I1', 'I2'], ['A', 'B'])
df.reindex(s.index,axis=1).rename(columns=s.to_dict())
Out[598]:
L3 L2 L3
I1 1 2 1
I2 3 4 3
This will produce the dataframe you described:
import pandas as pd
import numpy as np
data = [['A','B','A','A','B','B'],
['B','B','B','A','B','B'],
['A','B','A','B','B','B']]
columns = ['L1', 'L2', 'L3', 'L4', 'L5', 'L6']
pd.DataFrame(data, columns = columns)
You can use loc accessor:
s = pd.Series(['L1', 'L2', 'L3'], ['A', 'B', 'A'])
df = pd.DataFrame([[1,2], [3,4]], ['I1', 'I2'], ['A', 'B'])
res = df.loc[:, s.index]
print(res)
A B A
I1 1 2 1
I2 3 4 3
Or iloc accesor with columns.get_loc:
res = df.iloc[:, s.index.map(df.columns.get_loc)]
Both methods allows accessing duplicate labels / locations, in the same vein as NumPy arrays.
I have a csv file separated by tabs:
I need only to focus in the two first columns and find, for example, if the pair A-B appears in the document again as B-A and print A-B if the B-A appears. The same for the rest of pairs.
For the example proposed the output is:
ยท A-B
& C-D
dic ={}
import sys
import os
import pandas as pd
import numpy as np
import csv
colnames = ['col1', 'col2', 'col3', 'col4', 'col5']
data = pd.read_csv('koko.csv', names=colnames, delimiter='\t')
col1 = data.col1.tolist()
col2 = data.col2.tolist()
dataset = list(zip(col1,col2))
for a,b in dataset:
if (a,b) and (b,a) in dataset:
dic [a] = b
print (dic)
output = {'A': 'B', 'B': 'A', 'D': 'C', 'C':'D'}
How can I avoid duplicated (or swapped) results in the dictionary?
Does this work?:
import pandas as pd
import numpy as np
col_1 = ['A', 'B', 'C', 'B', 'D']
col_2 = ['B', 'C', 'D', 'A', 'C']
df = pd.DataFrame(np.column_stack([col_1,col_2]), columns = ['Col1', 'Col2'])
df['combined'] = list(zip(df['Col1'], df['Col2']))
final_set = set(tuple(sorted(t)) for t in df['combined'])
final_set looks like this:
{('C', 'D'), ('A', 'B'), ('B', 'C')}
The output contains more than A-B and C-D because of the second row that has B-C
The below should work,
example df used:
df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A'], 'Col2' : ['B','D','C','A','C','B']})
This is the function I used:
temp = df[['Col1','Col2']].apply(lambda row: sorted(row), axis = 1)
print(temp[['Col1','Col2']].drop_duplicates())
useful links:
checking if a string is in alphabetical order in python
Difference between map, applymap and apply methods in Pandas
Here is one way.
df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A','E'],
'Col2' : ['B','D','C','A','C','B','F']})
df = df.drop_duplicates()\
.apply(sorted, axis=1)\
.loc[df.duplicated(subset=['Col1', 'Col2'], keep=False)]\
.drop_duplicates()
# Col1 Col2
# 0 A B
# 1 C D
Explanation
The steps are:
Remove duplicate rows.
Sort dataframe by row.
Remove unique rows by keeping only duplicates.
Remove duplicate rows again.