I have a list of data frame where I want to loop all the data frame and create the dataframe by append the column by filter the columns with the list provided. below is the code
#df_ls contains list of dataframes
df_ls = [A, B, C, D]
for j in df_ls:
match_ls = ['card', 'brave', 'wellness']
for i in text:
if i in j.columns:
print(i)
df1 = j[i]
df2 = df1
df_full = df2.append(df1)
Need the Result new dataframe with single column contains all the match_ls string value.
A
card banner
rex 23
fex 45
jex 66
B
brave laminate
max ste
vax pre
jox lex
expected output
rex
fex
jex
max
vax
jox
Use list comprehension with filter columns names by Index.intersection and last use concat for join together:
df_ls = [A, B, C, D]
match_ls = ['card', 'brave', 'wellness']
dfs = [j[j.columns.intersection(match_ls)].stack().reset_index(drop=True) for j in df_ls]
df_full = pd.concat(dfs, ignore_index=True)
Loop version:
dfs = []
for j in df_ls:
df = j[j.columns.intersection(match_ls)].stack().reset_index(drop=True)
print (df)
dfs.append(df)
df_full = pd.concat(dfs, ignore_index=True)
print (df_full)
Related
I have a dictionary of dataframes df_dict. I then have a substring "blue". I want to identify the name of the dataframe in my dictionary of dataframes that has at least one column that has a name containing the substring "blue".
I am thinking of trying something like:
for df in df_dict:
if df.columns.contains('blue'):
return df
else:
pass
However, I am not sure if a for loop is necessary here. How can I find the name of the dataframe I am looking for in my dictionary of dataframes?
I think loops are necessary for iterate items of dictionary:
df1 = pd.DataFrame({"aa_blue": [1,2,3],
'col':list('abc')})
df2 = pd.DataFrame({"f": [1,2,3],
'col':list('abc')})
df3 = pd.DataFrame({"g": [1,2,3],
'bluecol':list('abc')})
df_dict = {'df1_name' : df1, 'df2_name' : df2, 'df3_name' : df3}
out = [name for name, df in df_dict.items() if df.columns.str.contains('blue').any()]
print (out)
['df1_name', 'df3_name']
Or:
out = [name for name, df in df_dict.items() if any('blue' in y for y in df.columns)]
print (out)
['df1_name', 'df3_name']
For list of DataFrames use:
out = [df for name, df in df_dict.items() if df.columns.str.contains('blue').any()]
out = [df for name, df in df_dict.items() if any('blue' in y for y in df.columns)]
print (out)
[ aa_blue col
0 1 a
1 2 b
2 3 c, g bluecol
0 1 a
1 2 b
2 3 c]
after processing some data I got df, now I need to get max 3 value from the data frame with column name
data=[[4.12,3,2],[1.0123123,-6.12312,5.123123],[-3.123123,-8.512323,6.12313]]
df = pd.DataFrame(data,columns =['a','b','c'],index=['aa','bb','cc'])
df
output:
a b c
aa 4.120000 3.000000 2.000000
bb 1.012312 -6.123120 5.123123
cc -3.123123 -8.512323 6.123130
Now I assigned each value with a columns name
df1 = df.astype(str).apply(lambda x:x+'='+x.name)
a b c
aa 4.12=a 3.0=b 2.0=c
bb 1.0123123=a -6.12312=b 5.123123=c
cc -3.123123=a -8.512323=b 6.12313=c
I need to get the max, I have tried to sort the data frame but not able to get the output
what I need is
final_df
max=1 max=2 max=3
aa 4.12=a 3.0=b 2.0=c
bb 5.123123=c 1.0123123=a -6.12312=b
cc 6.12313=c -3.123123=a -8.512323=b
I suggest you proceed as follows:
import pandas as pd
data=[[4.12,3,2],[1.0123123,-6.12312,5.123123],[-3.123123,-8.512323,6.12313]]
df = pd.DataFrame(data,columns =['a','b','c'],index=['aa','bb','cc'])
# first sort values in descending order
df.values[:, ::-1].sort(axis=1)
# then rename row values
df1 = df.astype(str).apply(lambda x: x + '=' + x.name)
# rename columns
df1.columns = [f"max={i}" for i in range(1, len(df.columns)+1)]
Result as desired:
max=1 max=2 max=3
aa 4.12=a 3.0=b 2.0=c
bb 5.123123=a 1.0123123=b -6.12312=c
cc 6.12313=a -3.123123=b -8.512323=c
As the solution proposed by #GuglielmoSanchini does not give the expected result, It works as follows:
# Imports
import pandas as pd
import numpy as np
# Data
data=[[4.12,3,2],[1.0123123,-6.12312,5.123123],[-3.123123,-8.512323,6.12313]]
df = pd.DataFrame(data,columns =['a','b','c'],index=['aa','bb','cc'])
df1 = df.astype(str).apply(lambda x:x+'='+x.name)
data = []
for index, row in df1.iterrows():
# the indices of the numbers sorted in descending order
indices_max = np.argsort([float(item[:-2]) for item in row])[::-1]
# We add the new values sorted
data.append(row.iloc[indices_max].values.tolist())
# We create the new dataframe with values sorted
df = pd.DataFrame(data, columns = [f"max={i}" for i in range(1, len(df1.columns)+1)])
df.index = df1.index
print(df)
Here is the result:
max=1 max=2 max=3
aa 4.12=a 3.0=b 2.0=c
bb 5.123123=c 1.0123123=a -6.12312=b
cc 6.12313=c -3.123123=a -8.512323=b
I have two data frames that I would like to compare for equality in a row-wise manner. I am interested in computing the number of rows that have the same values for non-joined attributes.
For example,
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
I will be joining these two data frames on column a and b. There are two rows (first two) that have the same values for c and d in both the data frames.
I am currently using the following approach where I first join these two data frames, and then compute each row's values for equality.
df = df1.merge(df2, on=['a','b'])
cols1 = [c for c in df.columns.tolist() if c.endswith("_x")]
cols2 = [c for c in df.columns.tolist() if c.endswith("_y")]
num_rows_equal = 0
for index, row in df.iterrows():
not_equal = False
for col1,col2 in zip(cols1,cols2):
if row[col1] != row[col2]:
not_equal = True
break
if not not_equal: # row values are equal
num_rows_equal += 1
num_rows_equal
Is there a more efficient (pythonic) way to achieve the same result?
A shorter way of achieving that:
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
df = df1.merge(df2, on=['a','b'])
comparison_cols = [c.strip('_x') for c in df.columns.tolist() if c.endswith("_x")]
num_rows_equal = (df1[comparison_cols][df1[comparison_cols] == df2[comparison_cols]].isna().sum(axis=1) == 0).sum()
use pandas merge ordered, merging with 'inner'. From there, you can get your dataframe shape and by extension your number of rows.
df_r = pd.merge_ordered(df1,df2,how='inner')
a b c d
0 1 2 60 50
1 2 3 20 90
no_of_rows = df_r.shape[0]
#print(no_of_rows)
#2
I have a dataframe with CSVs in language column
Name Language
0 A French,Espanol
1 B Deutsch,English
I wish to transform the above dataframe as below
Name Language
0 A French
1 A Espanol
2 B Deutsch
3 B English
I tried the below code but couldn't accomplish
df=df.join(df.pop('Language').str.extractall(',$')[0] .reset_index(level=1,drop=True) .rename('Language')) .reset_index(drop=True)
pandas.DataFrame.explode should be suited for that task. Combine it with pandas.DataFrame.assign to get the desired column:
import pandas as pd
df = pd.DataFrame({'Name':['A', 'B'], 'Language': ['French,Espanol', 'Deutsch,English']})
df = df.assign(Language=df['Language'].str.split(',')).explode('Language')
# Name Language
# 0 A French
# 0 A Espanol
# 1 B Deutsch
# 1 B English
First create a new dataframe with the same columns, then split second values and appent rows to the dataframe.
import pandas as pd
csv_df = pd.DataFrame([['1', '2,3'], ['2', '4,5']], columns=['Name', 'Language'])
df = pd.DataFrame(columns=['Name ', 'Language'])
for index, row in csv_df .iterrows():
name = row['Name']
s = row['Language']
txt = s.split(',')
for x in txt:
df = df.append(pd.Series([name, x], index=df.columns), ignore_index=True)
print(df)
I have two DataFrame objects:
df1: columns = [a, b, c]
df2: columns = [d, e]
I want to merge df1 with df2 using the equivalent of sql in pandas:
select * from df1 inner join df2 on df1.b=df2.e and df1.b <> df2.d and df1.c = 0
The following sequence of steps should get you there:
df1 = df[df1.c==0]
merged = df1.merge(df2, left_on='b', right_on='e')
merge = merged[merged.b != merged.d]