Length mismatch while applying a logic to Dataframe

Length mismatch while applying a logic to Dataframe - python

I'm trying to change to uppercase of alternate column names of a Dataframe having 6 columns.
input :
df.columns[::2].str.upper()
Output :
Index(['FIRST_NAME', 'AGE_VALUE', 'MOB_#'], dtype='object')
Now i want to apply this to Dataframe.
input : df.columns= df.columns[::2].str.upper()
ValueError: Length mismatch: Expected axis has 6 elements, new values have 3 elements

You can use rename
df
a b c d e f
0 a b c d e f
column_names = dict(zip(df.columns[::2], df.columns[::2].str.upper()))
column_names
{'a': 'A', 'c': 'C', 'e': 'E'}
df = df.rename(columns=column_names)
df
A b C d E f
0 a b c d e f

Related

Group by using agg and count

I have the following group by in pandas
df = df.groupby("pred_text1").agg({
"rouge_score":"first","pred_text1":"first",
"input_text":"first", "target_text":"first", "prefix":"first"})
I want to have another columns which counts the number in each group (repetition of each pred_text)
I know that using transform('count') after groupby can add such a new column, but I still need agg function too.

import pandas as pd
df = pd.DataFrame({'Group': ['G1', 'G1', 'G2', 'G3', 'G2', 'G1'], 'ValueLabel': [0, 1, 1, 1, 0, 2]})
You can do this in a few steps:
First aggregate.
df = df.groupby('Group').agg({'ValueLabel': ['first', 'count']})
The new columns are a pd.MultiIndex type which we will flatten. After that we can create a mapping of the names to the labels we want and rename the columns.
df.columns = df.columns.to_flat_index()
mapper = {label: new_label for label, new_label in zip \
(pd.MultiIndex.from_product([['ValueLabel'], \
['first', 'count']]), ['ValueFirstLabel', 'ValueCountLabel'])}
df.rename(mapper, axis=1, inplace=True)

You can use pd.NamedAgg function. (Pandas 0.25+)
Code:
import pandas as pd
# A sample dataframe
df = pd.DataFrame({
'pred_text1': [chr(ord('A')+i%3) for i in range(10)],
'rouge_score': [chr(ord('A')+i%5) for i in range(10)],
'input_text': [chr(ord('A')+i%7) for i in range(10)],
'target_text': [chr(ord('A')+i%9) for i in range(10)],
'prefix': [chr(ord('A')+i%11) for i in range(10)],
})
# Aggregation
df = df.groupby("pred_text1").agg(
rouge_score=pd.NamedAgg("rouge_score", "first"),
pred_text1=pd.NamedAgg("pred_text1", "first"),
input_text=pd.NamedAgg("input_text", "first"),
target_text=pd.NamedAgg("target_text", "first"),
prefix=pd.NamedAgg("prefix", "first"),
pred_count=pd.NamedAgg("pred_text1", "count"),
)
Input:
(index)
pred_text1
rouge_score
input_text
target_text
prefix
0
A
A
A
A
A
1
B
B
B
B
B
2
C
C
C
C
C
3
A
D
D
D
D
4
B
E
E
E
E
5
C
A
F
F
F
6
A
B
G
G
G
7
B
C
A
H
H
8
C
D
B
I
I
9
A
E
C
A
J
Output:
(index)
rouge_score
pred_text1
input_text
target_text
prefix
pred_count
A
A
A
A
A
A
4
B
B
B
B
B
B
3
C
C
C
C
C
C
3

Compare consecutive rows and delete based on condition

I would like to compare consecutive rows from the column one and delete based on this condition:
if 2 or more consecutive rows are the same, keep them
If one row it's different from the previous and the next delete it
Example df:
a = [['A', 'B', 'C'], ['A', 'B', 'C'], ['B', 'B', 'C'],['C', 'B', 'C'],['C', 'B', 'C'],['C', 'B', 'C']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
print output would be:
one two three
0 A B C
1 A B C
2 B B C
3 D B C
4 C B C
5 C B C
Expected output would be:
one two three
0 A B C
1 A B C
3 c B C
4 C B C
5 C B C
So the line from index 2 will be deleted.
I've tried using shift but I am a stucked, because like I am doing now, it deletes also the first and last column. Can someone please tell me a better way of doing this? Or maybe how to apply shift but ignore the first and last row ?
#First I take only the one column
df = df['one']
#Then apply shift
df.loc[df.shift(-1) == df]
With the above code I get this. Which is not correct because it delets also the first and last row
0 A
3 C
4 C

Try shifting up and down:
mask = (df.one == df.one.shift(-1)) | (df.one == df.one.shift(1))
adj_df = df[mask]

You could use shift in both directions (and you need an all condition to check that all the columns are the same):
df[(df.shift(1) == df).all(axis=1) | (df.shift(-1) == df).all(axis=1)]

Change this row and previous row vectorised in pandas

I have a dataframe that encodes the last value of row 'this' in row 'last'. I want to match the column 'this' in the table according to value in a list, e.g. ['b', 'c'] and then change the preceding row's 'this', as well as this row's 'last' to the value 'd' on such a match.
For example, I want to change this:
this
last
a
b
a
a
b
c
a
a
c
Into this:
this
last
d
b
d
d
b
c
d
a
c
This is straightforward if iterating, but too slow:
for i, v in df['this'].iteritems():
if v in ['b', 'c']:
df['this'].iloc[i - 1] = 'd'
df['last'].iloc[i] = 'd'
I believe this can be done by assigning df.this.shift(-1) to column 'last', however I'm not sure how to do this when I'm matching values in the list ['b', 'c']. How can I do this without iterating?

df
this last
0 a NaN
1 b a
2 a b
3 c a
4 a c
You can use isin to get boolean index where the values belong to the list (l1). Then populate corresponding last with d. And then shift in upward direction the boolean index, to populate required this values with d
l1 = ['b', 'c']
this_in_l1 = df['this'].isin(l1)
df.loc[this_in_l1, 'last'] = 'd'
df.loc[this_in_l1.shift(-1, fill_value=False), 'this'] = 'd'
df
this last
0 d NaN
1 b d
2 d b
3 c d
4 a c

How two combine two columns of different dataframes such that they have unique values?

I have two different dataframes and I want to get the sorted
values of two columns.
Setup
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
'id': range(7),
'c': list('EDBBCCC')
})
df2 = pd.DataFrame({
'id': range(8),
'c': list('EBBCCCAA')
})
Desired Output
# notice that ABCDE appear in alphabetical order
c_first c_second
NAN A
B B
C C
D NAN
E E
What I've tried
pd.concat([df1.c.sort_values().drop_duplicates().rename('c_first'),
df2.c.sort_values().drop_duplicates().rename('c_second')
],axis=1)
How to get the output as given in required format?

Here one possible way to achive it:
t1 = df1.c.drop_duplicates()
t2 = df2.c.drop_duplicates()
tmp1 = pd.DataFrame({'id':t1, 'c_first':t1})
tmp2 = pd.DataFrame({'id':t2, 'c_second':t2})
result = pd.merge(tmp1,tmp2, how='outer').sort_values('id').drop('id', axis=1)
result
c_first c_second
4 NaN A
0 B B
1 C C
2 D NaN
3 E E

https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.concat.html
There is an argument in concat function.
Try to add sort=True.

Merge pandas dataframe with overwrite of columns

What is the quickest way to merge to python data frames in this manner?
I have two data frames with similar structures (both have a primary key id and some value columns).
What I want to do is merge the two data frames based on id. Are there any ways do this based on pandas operations? How I've implemented it right now is as coded below:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
a_dict = {e[id]: e for e in a.to_dict('record')}
b_dict = {e[id]: e for e in b.to_dict('record')}
c_dict = a_dict.copy()
c_dict.update(b_dict)
c = pd.DataFrame(list(c.values())
Here, c would be equivalent to
pd.DataFrame({'id': [1,2,3,4], 'letter':['A','b', 'C', 'D']})
id letter
0 1 A
1 2 b
2 3 C
3 4 D

combine_first
If 'id' is your primary key, then use it as your index.
b.set_index('id').combine_first(a.set_index('id')).reset_index()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
merge with groupby
a.merge(b, 'outer', 'id').groupby(lambda x: x.split('_')[0], axis=1).last()
id letter
0 1 A
1 2 b
2 3 C
3 4 D

One way may be as following:
append dataframe a to dataframe b
drop duplicates based on id
sort values on remaining by id
reset index and drop older index
You can try:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
c = b.append(a).drop_duplicates(subset='id').sort_values('id').reset_index(drop=True)
print(c)

Try this
c = pd.concat([a, b], axis=0).sort_values('letter').drop_duplicates('id', keep='first').sort_values('id')
c.reset_index(drop=True, inplace=True)
print(c)
id letter
0 1 A
1 2 b
2 3 C
3 4 D

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Length mismatch while applying a logic to Dataframe - python

You can use rename df a b c d e f 0 a b c d e f column_names = dict(zip(df.columns[::2], df.columns[::2].str.upper())) column_names {'a': 'A', 'c': 'C', 'e': 'E'} df = df.rename(columns=column_names) df A b C d E f 0 a b c d e f

Related

Group by using agg and count

Compare consecutive rows and delete based on condition

Change this row and previous row vectorised in pandas

How two combine two columns of different dataframes such that they have unique values?

Merge pandas dataframe with overwrite of columns

Categories

Resources