How to replace columns names in pandas but based on dictionary? - python

I have 4 different dfs.
Example of column names:
df1 = a1, b1, c1, d1
df2 = b1, c1, e1
df3 = a1, b1, c1
And I created dicrionary like this:
dict = {a1:art1, b1:base1, c1:cell1, d1:dan1, e1:el1}
It's possible to rename column not using for loop? I mean, I tried to do that by rename function in for loop, but then I need to loop through all dataframe and it's not looks good in code and I think it's not that fast as should be.
My excpected result is:
df1 = art1, base1, cell1, dan1
df2 = base1, cell1, el1
df3 = art1, base1, cell1
And I found some answers on stack but nothing fit to my problem, where I have one dictionary and few df with not unique columns names.

A comprehension:
d = {'a1': 'art1', 'b1': 'base1', 'c1': 'cell1', 'd1': 'dan1', 'e1': 'el1'}
df1, df2, df3 = [df.rename(columns=d) for df in [df1, df2, df3]]
A simple loop:
for df in [df1, df2, df3]:
df.rename(columns=d, inplace=True)
Using map:
df1, df2, df3 = list(map(lambda df: df.rename(columns=d), [df1, df2, df3]))
At the end, IMHO the simple loop with inplace=True is the most elegant.

Related

Pandas merge multiple dataframes on one temporal index, with latest value from all others

I'm merging some dataframes which have a time index.
import pandas as pd
df1 = pd.DataFrame(['a', 'b', 'c'],
columns=pd.MultiIndex.from_product([['target'], ['key']]),
index = [
'2022-04-15 20:20:20.000000',
'2022-04-15 20:20:21.000000',
'2022-04-15 20:20:22.000000'],)
df2 = pd.DataFrame(['a2', 'b2', 'c2', 'd2', 'e2'],
columns=pd.MultiIndex.from_product([['feature2'], ['keys']]),
index = [
'2022-04-15 20:20:20.100000',
'2022-04-15 20:20:20.500000',
'2022-04-15 20:20:20.900000',
'2022-04-15 20:20:21.000000',
'2022-04-15 20:20:21.100000',],)
df3 = pd.DataFrame(['a3', 'b3', 'c3', 'd3', 'e3'],
columns=pd.MultiIndex.from_product([['feature3'], ['keys']]),
index = [
'2022-04-15 20:20:19.000000',
'2022-04-15 20:20:19.200000',
'2022-04-15 20:20:20.000000',
'2022-04-15 20:20:20.200000',
'2022-04-15 20:20:23.100000',],)
then I use this merge procedure:
def merge(dfs:list[pd.DataFrame], targetColumn:'str|tuple[str]'):
from functools import reduce
if len(dfs) == 0:
return None
if len(dfs) == 1:
return dfs[0]
for df in dfs:
df.index = pd.to_datetime(df.index)
merged = reduce(
lambda left, right: pd.merge(
left,
right,
how='outer',
left_index=True,
right_index=True),
dfs)
for col in merged.columns:
if col != targetColumn:
merged[col] = merged[col].fillna(method='ffill')
return merged[merged[targetColumn].notna()]
like this:
merged = merge([df1, df2, df3], targetColumn=('target', 'key'))
which produces this:
And it all works great. Problem is efficiency - notice in the merge() procedure I use reduce and an outer merge to join the dataframes together, this can make a HUGE interim dataframe which then gets filtered down. But what if my pc doesn't have enough ram to handle that huge dataframe in memory? well that's the problem I'm trying to avoid.
I'm wondering if there's a way to avoid expanding the data out into a huge dataframe while merging.
Of course a regular old merge isn't sufficient because it only merges on exactly matching indexes rather than the latest temporal index before the target variable's observation:
df1.merge(df2, how='left', left_index=True, right_index=True)
Has this kind of thing been solved efficiently? Seems like a common data science issue, since no one wants to leak future information into their models, and everyone has various inputs to merge together...
You're in luck: pandas.merge_asof does exactly what you need!
We use the default direction='backward' argument:
A “backward” search selects the last row in the right DataFrame whose
‘on’ key is less than or equal to the left’s key.
Using your three example DataFrames:
import pandas as pd
from functools import reduce
# Convert all indexes to datetime
for df in [df1, df2, df3]:
df.index = pd.to_datetime(df.index)
# Perform as-of merges
res = reduce(lambda left, right:
pd.merge_asof(left, right, left_index=True, right_index=True),
[df1, df2, df3])
print(res)
target feature2 feature3
key keys keys
2022-04-15 20:20:20 a NaN c3
2022-04-15 20:20:21 b d2 d3
2022-04-15 20:20:22 c e2 d3
Here's some code that works for your example. I'm not sure about more general cases of multi-indexed columns, but in any event it contains the basic ideas for merging on a single temporal index.
merged = df1.copy(deep=True)
for df in [df2, df3]:
idxNew = df.index.get_indexer(merged.index, method='pad')
idxMerged = [i for i, x in enumerate(idxNew) if x != -1]
idxNew = [x for x in idxNew if x != -1]
n = len(merged.columns)
merged[df.columns] = None
merged.iloc[idxMerged,n:] = df.iloc[idxNew,:].set_index(merged.index[idxMerged])
print(merged)
Output:
target feature2 feature3
key keys keys
2022-04-15 20:20:20.000000 a None c3
2022-04-15 20:20:21.000000 b d2 d3
2022-04-15 20:20:22.000000 c e2 d3

rename columns according to list

I have 3 lists of data frames and I want to add a suffix to each column according to whether it belongs to a certain list of data frames. its all in order, so the first item in the suffix list should be appended to the columns of data frames in the first list of data frames etc. I am trying here but its adding each item in the suffix list to each column.
In the expected output
all columns in dfs in cat_a need group1 appended
all columns in dfs in cat_b need group2 appended
all columns in dfs in cat_c need group3 appended
data and code are here
df1, df2, df3, df4 = (pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('a', 'b')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('c', 'd')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('e', 'f')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('g', 'h')))
cat_a = [df1, df2]
cat_b = [df3, df4, df2]
cat_c = [df1]
suffix =['group1', 'group2', 'group3']
dfs = [cat_a, cat_b, cat_c]
for x, y in enumerate(dfs):
for i in y:
suff=suffix
i.columns = i.columns + '_' + suff[x]
thanks for taking a look!
Brian Joseph's answer is great*, but I'd like to point out that you were very close, you just weren't renaming the columns correctly. Your last line should be like this:
i.columns = [col + '_' + suff[x] for col in i.columns]
instead of this:
i.columns = i.columns + '_' + suff[x]
Assuming you want to have multiple suffixes for some dataframes, I think this is what you want?:
suffix_mapper = {
'group1': [df1, df2],
'group2': [df3, df4, df2],
'group3': [df1]
}
for suffix, dfs in suffix_mapper.items():
for df in dfs:
df.columns = [f"{col}_{suffix}" for col in df.columns]
I think the issue is because you're not taking a copy of the dataframe so each cat dataframe is referencing a df dataframe multiple times.
Try:
cat_a = [df1.copy(), df2.copy()]
cat_b = [df3.copy(), df4.copy(), df2.copy()]
cat_c = [df1.copy()]

pandas merge by excluding certain columns from merge

I want to merge two dataframes like:
df1.columns = A, B, C, E, ..., D
df2.columns = A, B, C, F, ..., D
If I merge them, it merges on all columns. Also since the number of columns is high I don't want to specify them in on. I prefer to exclude the columns which I don't want to be merged. How can I do that?
mdf = pd.merge(df1, df2, exclude D)
I expect the result be like:
mdf.columns = A, B, C, E, F ..., D_x, D_y
You mentioned you mentioned you don't want to use on "since the number of columns is much".
You could still use on this way even if there are a lot of columns:
mdf = pd.merge(df1, df2, on=[i for i in df1.columns if i != 'D'])
Or
By using pd.Index.difference
mdf = pd.merge(df1, df2, on=df1.columns.difference(['D']).tolist())
Another solution can be:
mdf = pd.merge(df1, df2, on= df1.columns.tolist().remove('D')
What about dropping the unwanted column after the merge?
You can use pandas.DataFrame.drop:
mdf = pd.merge(df1, df2).drop('D', axis=1)
or dropping before the merge:
mdf = pd.merge(df1.drop('D', axis=1), df2.drop('D', axis=1))
One solution is using intersection and then difference on df1 and df2 columns:
mdf = pd.merge(df1, df2, on=df1.columns.intersection(df2.columns).difference(['D']).tolist())
The other solution could be renaming columns you want to exclude from merge:
df2.rename(columns={"D":"D_y"}, inplace=True)
mdf = pd.merge(df1, df2)

Python panda search for value in a df from another df

I’ve got two data frames :-
Df1
Time V1 V2
02:00 D3F3 0041
02:01 DD34 0040
Df2
FileName V1 V2
1111.txt D3F3 0041
2222.txt 0000 0040
Basically I want to compare the v1 v2 columns and if they match print the row time from df1 and the row from df2 filename. So far all i can find is the
isin()
, which simply gives you a boolean output.
So the output would be :
1111.txt 02:00
I started using dataframes because i though i could query the two df's on the V1 / V2 values but I can't see a way. Any pointers would be much appreciated
Use merge on the dataframe columns that you want to have the same values. You can then drop the rows with NaN values, as those will not have matching values. From there, you can print the merged dataframes values however you see fit.
df1 = pd.DataFrame({'Time': ['8a', '10p'], 'V1': [1, 2], 'V2': [3, 4]})
df2 = pd.DataFrame({'fn': ['8.txt', '10.txt'], 'V1': [3, 2], 'V2': [3, 4]})
df1.merge(df2, on=['V1', 'V2'], how='outer').dropna()
=== Output: ===
Time V1 V2 fn
1 10p 2 4 10.txt
The most intuitive solution is:
1) iterate the V1 column in DF1;
2) for each item in this column, check if this item exists in the V1 column of DF2;
3) if the item exists in DF2's V1, then find the index of that item in the DF2 and then you would be able to find the file name.
You can try using pd.concat.
On this case it would be like:
pd.concat([df1, df2.reindex(df1.index)], axis=1)
It will create a new dataframe with all the values, but in case there are some values that doesn't match in both dataframes, it'll return NaN. If you doesn't want this to happen you must use this:
pd.concat([df1, df4], axis=1, join='inner')
If you wanna learn a bit more, use pydata: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
You can use merge option with inner join
df2.merge(df1,how="inner",on=["V1","V2"])[["FileName","Time"]]
While I think Eric's solution is more pythonic, if your only aim is to print the rows on which df1 and df2 have v1 and v2 values the same, provided the two dataframes are of the same length, you can do the following:
for row in range(len(df1)):
if (df1.iloc[row,1:] == df2.iloc[row,1:]).all() == True:
print(df1.iloc[row], df2.iloc[row])
Try this:
client = boto3.client('s3')
obj = client.get_object(Bucket='', Key='')
data = obj['Body'].read()
df1 = pd.read_excel(io.BytesIO(data), sheet_name='0')
df2 = pd.read_excel(io.BytesIO(data), sheet_name='1')
head = df2.columns[0]
print(head)
data = df1.iloc[[8],[0]].values[0]
print(data)
print(df2)
df2.columns = df2.iloc[0]
df2 = df2.drop(labels=0, axis=0)
df2['Head'] = head
df2['ID'] = pd.Series([data,data])
print(df2)
df2.to_csv('test.csv',index=False)

Intersection of pandas dataframe with multiple columns

I have a list of dataframes as:
[df1, df2, df3, ..., df100, oddDF]
Each dataframe dfi has DateTime as column1 and Temperature as column2. Except the dataframe oddDF which has DateTime as column1 and has temperature columns in column2 and column3.
I am looking to create a list of dataframe or one dataframe which has the common temperatures from each of df1, .. df100 and oddDF
I am trying the following:
dfs = [df0, df1, df2, .., df100, oddDF]
df_final = reduce(lambda left,right: pd.merge(left,right,on='DateTime'), dfs)
But it produces df_final as empty
If however I do just:
dfs = [df0, df1, df2, .., df100]
df_final = reduce(lambda left,right: pd.merge(left,right,on='DateTime'), dfs)
df_final produces the right answer.
How do I incorporate oddDF in the code also. I have checked to make sure that oddDF's DateTime column has the common dates with
df1, df2, .., df100

Categories

Resources