I have a datafame in pandas with has a group of columns with hyphens (theres several but I'll use 2 as an example, _1 and _2), which both depict a different year.
df = pd.DataFrame({'A': ['BP','Virgin'],
'B(LY)': ['A','C'],
'B(LY_1)': ['B', 'D'],
'C': [1, 3],
'C_1': [2,4],
'D': ['W','Y'],
'D_1': ['X','Z']})
I'm trying to reorganise the table to pivot it, so that it look like this:
df = pd.DataFrame({'A': ['BP','BP', 'Virgin', 'Virgin'],
'Year': ['A','B','C','D'],
'C': [1,2,3,4],
'D': ['W','X','Y','Z']})
But I can't work out how to do it. The problem is, I only need the the hyphen column to match the equivalent hyphen for the other variables. Any help is appreciated, thanks
EDIT
here is a real life example of the data
df = pd.DataFrame({'Company': ['BP','Virgin'],
'Account_date(LY)': ['May','Apr'],
'Account_date(LY_1)': ['Apr', 'Mar'],
'Account_date(LY_2)': ['Mar', 'Feb'],
'Account_date(LY_3)': ['Feb', 'Jan'],
'Acc_day': [1, 5],
'Acc_day_1': [2,6],
'Acc_day_2': [3,7],
'Acc_day_2': [4,8],
'D': ['W','A'],
'D_1': ['X','B'],
'D_1': ['Y','C'],
'D_1': ['Z','D']})
desired output:
df = pd.DataFrame({'Company': ['BP','BP','BP','BP', 'Virgin', 'Virgin','Virgin', 'Virgin'],
'Year': ['May','Apr','Mar','Feb','Apr','Mar','Feb','May'],
'Acc_day': [1,2,3,4,5,6,7,8],
'D': ['W','X','Y','Z','A','B','C','D']})
You can use:
# set A aside
df2 = df.set_index('A')
# split columns to MultiIndex on "_"
df2.columns = df2.columns.str.split('_', expand=True)
# reshape
out = df2.stack().droplevel(1).rename(columns={'B': 'Year'}).reset_index()
Or using janitor's pivot_longer:
import janitor
out = (df.pivot_longer(index='A', names_sep='_', names_to=('.value', '_drop'), sort_by_appearance=True)
.rename(columns={'B': 'Year'}).drop(columns='_drop')
)
Output:
A Year C D
0 BP A 1 W
1 BP B 2 X
2 Virgin C 3 Y
3 Virgin D 4 Z
updated example
using a mapper to match (LY) -> _1, etc.
import re
# you can generate this mapper programmatically if needed
mapper = {'(LY)': '_1', '(LY-1)': '_2'}
# set A aside
df2 = df.set_index('A')
# split columns to MultiIndex on "_"
pattern = '|'.join(map(re.escape, mapper))
df2.columns = df2.columns.str.replace(pattern, lambda m: mapper[m.group()], regex=True).str.split('_', expand=True)
# reshape
out = df2.stack().droplevel(1).rename(columns={'B': 'Year'}).reset_index()
Output:
A Year C D
0 BP A 1 W
1 BP B 2 X
2 Virgin C 3 Y
3 Virgin D 4 Z
Related
I have a pandas data frame like this:
df = pd.DataFrame({"Id": [1, 1, 1, 2, 2, 2, 2],
"Letter": ['A', 'B', 'C', 'A', 'D', 'B', 'C']})
How can I add a new column efficiently, "Merge" such that it concatenates all the values from the column "letter" by "Id", so the final data frame would look like this:
You can groupby Id column then transform
df['Merge'] = df.groupby('Id').transform(lambda x: '-'.join(x))
print(df)
Id Letter Merge
0 1 A A-B-C
1 1 B A-B-C
2 1 C A-B-C
3 2 A A-D-B-C
4 2 D A-D-B-C
5 2 B A-D-B-C
6 2 C A-D-B-C
Thanks for sammywemmy pointing out lambda is needless here, so you can use a simpler form
df['Merge'] = df.groupby('Id').transform('-'.join)
I have a dataframe with words as index and a corresponding sentiment score in another column. Then, I have another dataframe which has one column with list of words (token list) with multiple rows. So each row will have a column with different lists. I want to find the average of sentiment score for a particular list. This has to be done for a huge number of rows, and hence efficiency is important.
One method I have in mind is given below:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
'''
df
words
0 [a, b, c]
1 [hi, this, is, a, sample]
'''
def find_score(tokenlist, ref_df):
# ref_df contains two cols, 'tokens' and 'score'
temp_df = pd.DataFrame()
temp_df['tokens'] = tokenlist
return temp_df.merge(ref_df, on='tokens', how='inner')['sentiment_score'].mean(axis=0)
# this should return score
df['score'] = df['tokens'].apply(find_score, axis=1, args=(ref_df))
# each input for find_score will be a list
Is there any more efficient way to do it without creating dataframe for each list?
You can create a dictionary for mapping from the reference dataframe ref_df and then use .map() on each token list on each row of dataframe df, as follows:
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Demo
Test Data Construction
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'tokens': ['a', 'b', 'c', 'd', 'hi', 'this', 'is', 'sample', 'example'],
'sentiment_score': [1, 2, 3, 4, 11, 12, 13, 14, 15]})
print(df)
tokens
0 [a, b, c]
1 [hi, this, is, a, sample]
print(ref_df)
tokens sentiment_score
0 a 1
1 b 2
2 c 3
3 d 4
4 hi 11
5 this 12
6 is 13
7 sample 14
8 example 15
Run New Code
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Output
print(df)
tokens score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 10.2
Let's try explode, merge, and agg:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'sentiment_score': {'a': 1, 'b': 2,
'c': 3, 'hi': 4,
'this': 5, 'is': 6,
'sample': 7}})
# Explode Tokens into rows (Preserve original index)
new_df = df.explode('tokens').reset_index()
# Merge sentiment_scores
new_df = new_df.merge(ref_df, left_on='tokens',
right_index=True,
how='inner')
# Group By Original Index and agg back to lists and take mean
new_df = new_df.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [a, hi, this, is, sample] 4.6
After Explode:
index tokens
0 0 a
1 0 b
2 0 c
3 1 hi
4 1 this
5 1 is
6 1 a
7 1 sample
After Merge
index tokens sentiment_score
0 0 a 1
1 1 a 1
2 0 b 2
3 0 c 3
4 1 hi 4
5 1 this 5
6 1 is 6
7 1 sample 7
(The one-liner)
new_df = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
If the order of the tokens in the list matters, the scores can be calculated and merged back to the original df instead of using list aggregation:
mean_scores = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index').mean() \
.reset_index(drop=True)
new_df = df.merge(mean_scores,
left_index=True,
right_index=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 4.6
This question already has answers here:
pandas - filter dataframe by another dataframe by row elements
(7 answers)
Closed 2 years ago.
I have two dataframes like this
import pandas as pd
df1 = pd.DataFrame(
{
'A': list('abcaewar'),
'B': list('ghjglmgb'),
'C': list('lkjlytle'),
'ignore': ['stuff'] * 8
}
)
df2 = pd.DataFrame(
{
'A': list('abfu'),
'B': list('ghio'),
'C': list('lkqw'),
'stuff': ['ignore'] * 4
}
)
and I would like to remove all rows in df1 where A, B and C are identical to values in df2, so in the above case the expected outcome is
A B C ignore
0 c j j stuff
1 e l y stuff
2 w m t stuff
3 r b e stuff
One way of achieving this would be
comp_columns = ['A', 'B', 'C']
df1 = df1.set_index(comp_columns)
df2 = df2.set_index(comp_columns)
keep_ind = [
ind for ind in df1.index if ind not in df2.index
]
new_df1 = df1.loc[keep_ind].reset_index()
Does anyone see a more straightforward way of doing this which avoids the reset_index() operations and the loop to identify non-overlapping indices, e.g. by a mart way of masking? Ideally, I don't have to hardcode the columns, but can define them in a list as above as I sometimes need 2, sometimes 3 or sometimes 4 or more columns for the removal.
Use DataFrame.merge with optional parameter indicator=True, then use boolean masking to filter the rows in df1:
df3 = df1.merge(df2[['A', 'B', 'C']], on=['A', 'B', 'C'], indicator=True, how='left')
df3 = df3[df3.pop('_merge').eq('left_only')]
Result:
# print(df3)
A B C ignore
2 c j j stuff
4 e l y stuff
5 w m t stuff
7 r b e stuff
I am looking for fastest way to join columns with same names using separator.
my dataframes:
df1:
A,B,C,D
my,he,she,it
df2:
A,B,C,D
dog,cat,elephant,fish
expected output:
df:
A,B,C,D
my:dog,he:cat,she:elephant,it:fish
As you can see, I want to merge columns with same names, two cells in one.
I can use this code for A column:
df=df1.merge(df2)
df['A'] = df[['A_x','A_y']].apply(lambda x: ':'.join(x), axis = 1)
In my real dataset i have above 30 columns, and i dont want to write same lines for each of them, is there any faster way to receive my expected output?
How about concat and groupby ?
df3 = pd.concat([df1,df2],axis=0)
df3 = df3.groupby(df3.index).transform(lambda x : ':'.join(x)).drop_duplicates()
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish
How about this?
df3 = df1 + ':' + df2
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish
This is good because if there's columns that doesn't match, you get NaN, so you can filter then later if you want:
df1 = pd.DataFrame({'A': ['my'], 'B': ['he'], 'C': ['she'], 'D': ['it'], 'E': ['another'], 'F': ['and another']})
df2 = pd.DataFrame({'A': ['dog'], 'B': ['cat'], 'C': ['elephant'], 'D': ['fish']})
df1 + ':' + df2
A B C D E F
0 my:dog he:cat she:elephant it:fish NaN NaN
you can do this by simply adding the two dataframe with a separator.
import pandas as pd
df1 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df2 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df1["A"] = "my"
df1["B"] = "he"
df1["C"] = "she"
df1["D"] = "it"
df2["A"] = "dog"
df2["B"] = "cat"
df2["C"] = "elephant"
df2["D"] = "fish"
print(df1)
print(df2)
df3 = df1 + ':' + df2
print(df3)
This will give you a result like:
A B C D
0 my he she it
A B C D
0 dog cat elephant fish
A B C D
0 my:dog he:cat she:elephant it:fish
Is this what you try to achieve? Although, this only works if you have same columns in both the dataframes. The extra columns will have nans. What do you want to do with the columns those are not same in df1 and df2? Please comment below to help me understand your problem better.
You can simply do:
df = df1 + ':' + df2
print(df)
Which is simple and effective
This should be your answer
I have a dataframe with repeated column names which account for repeated measurements.
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df3 = pd.concat([df,df2], axis=1)
df3
A B A B
0 -0.875884 -0.298203 0.877414 1.282025
1 1.605602 -0.127038 -0.286237 0.572269
2 1.349540 -0.067487 0.126440 1.063988
3 -0.142809 1.282968 0.941925 -1.593592
4 -0.630353 1.888605 -1.176436 -1.623352
I'd like to take the mean of cols 'A's and 'B's such that the dataframe shrinks to
A B
0 0.000765 0.491911
1 0.659682 0.222616
2 0.737990 0.498251
3 0.399558 -0.155312
4 -0.903395 0.132627
If I do the typical
df3['A'].mean(axis=1)
I get a Series (with no column name) and I should then build a new dataframe with the means of each col group. Also the .groupby() method apparently doesn't allow you to group by column name, but rather you give the columns and it sorts the indexes. Is there a fancy way to do this?
Side question: why does
df = pd.DataFrame({'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)})
not generate a 4-column dataframe but merges same-name cols?
You can use the level keyword (regarding your columns as the first level (level 0) of the index with only one level in this case):
In [11]: df3
Out[11]:
A B A B
0 -0.367326 -0.422332 2.379907 1.502237
1 -1.060848 0.083976 0.619213 -0.303383
2 0.805418 -0.109793 0.257343 0.186462
3 2.419282 -0.452402 0.702167 0.216165
4 -0.464248 -0.980507 0.823302 0.900429
In [12]: df3.mean(axis=1, level=0)
Out[12]:
A B
0 1.006291 0.539952
1 -0.220818 -0.109704
2 0.531380 0.038334
3 1.560725 -0.118118
4 0.179527 -0.040039
You've created df3 in a strange way for this simple case the following would work:
In [86]:
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
print(df)
print(df2)
A B
0 -0.732807 -0.571942
1 -1.546377 -1.586371
2 0.638258 0.569980
3 -1.017427 1.395300
4 0.666853 -0.258473
[5 rows x 2 columns]
A B
0 0.589185 1.029062
1 -1.447809 -0.616584
2 -0.506545 0.432412
3 -1.168424 0.312796
4 1.390517 1.074129
[5 rows x 2 columns]
In [87]:
(df+df2)/2
Out[87]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
to answer your side question, this is nothing to do with Pandas and more to do with the dict constructor:
In [88]:
{'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)}
Out[88]:
{'B': array([-0.03087831, -0.24416885, -2.29924624, 0.68849978, 0.41938536]),
'A': array([ 2.18471335, 0.68051101, -0.35759988, 0.54023489, 0.49029071])}
dict keys must be unique so my guess is that in the constructor it just reassigns the values to the pre-existing keys
EDIT
If you insist on having duplicate columns then you have to create a new dataframe from this because if you were to update the columns 'A' and 'B', the mean will be duplicated still as the columns are repeated:
In [92]:
df3 = pd.concat([df,df2], axis=1)
new_df = pd.DataFrame()
new_df['A'], new_df['B'] = df3['A'].sum(axis=1)/df3['A'].shape[1], df3['B'].sum(axis=1)/df3['B'].shape[1]
new_df
Out[92]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
So the above would work with df3 and in fact for an arbritary numer of repeated columns which is why I am using shape, you could hard code this to 2 if you new the columns were only ever duplicated once