select the first N elements of each row in a column - python

I am looking to select the first two elements of each row in column a and column b.
Here is an example
df = pd.DataFrame({'a': ['A123', 'A567','A100'], 'b': ['A156', 'A266666','A35555']})
>>> df
a b
0 A123 A156
1 A567 A266666
2 A100 A35555
desired output
>>> df
a b
0 A1 A1
1 A5 A2
2 A1 A3
I have been trying to use df.loc but not been successful.

Use
In [905]: df.apply(lambda x: x.str[:2])
Out[905]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
Or,
In [908]: df.applymap(lambda x: x[:2])
Out[908]:
a b
0 A1 A1
1 A5 A2
2 A1 A3

In [107]: df.apply(lambda c: c.str.slice(stop=2))
Out[107]:
a b
0 A1 A1
1 A5 A2
2 A1 A3

Related

How to compute each cell as a function of index and column?

I have a use-case where it naturally fits to compute each cell of a pd.DataFrame as a function of the corresponding index and column i.e.
import pandas as pd
import numpy as np
data = np.empty((3, 3))
data[:] = np.nan
df = pd.DataFrame(data=data, index=[1, 2, 3], columns=['a', 'b', 'c'])
print(df)
> a b c
>1 NaN NaN NaN
>2 NaN NaN NaN
>3 NaN NaN NaN
and I'd like (this is only a mock example) to get a result that is a function f(index, column):
> a b c
>1 a1 b1 c1
>2 a2 b2 c2
>3 a3 b3 c3
In order to accomplish this I need a way different to apply or applymap where the lambda gets the coordinates in terms of the index and col i.e.
def my_cell_map(ix, col):
return col + str(ix)
Here is possible use numpy - add index values to columns with broadcasting and pass to DataFrame constructor:
a = df.columns.to_numpy() + df.index.astype(str).to_numpy()[:, None]
df = pd.DataFrame(a, index=df.index, columns=df.columns)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT: For processing by columns names is possible use x.name with index values:
def f(x):
return x.name + x.index.astype(str)
df = df.apply(f)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT1: For your function is necessary use another lambda function for loop by index values:
def my_cell_map(ix, col):
return col + str(ix)
def f(x):
return x.index.map(lambda y: my_cell_map(y, x.name))
df = df.apply(f)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT2: Also is possible loop by index and columns values and set by loc, if large DataFrame performance should be slow:
for c in df.columns:
for i in df.index:
df.loc[i, c] = my_cell_map(i, c)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3

How do I fill a column in ranges with values from the same column in Python?

I have dataframe df1:
col_1 col_2
A a1
A a2
B
A a3
C
A a4
A a5
D
A a6
A a7
B
For non-empty values from col_2, values from col_1 always have the value A. However, for empty values from col_2, values from col_1 are always different from A, but they can repeat.
And I need output like:
col_1 col_2
B a1
B a2
C a3
D a4
D a5
B a6
B a7
How can I do this? Thanks for the help.
Here is one way to do this:
mask = df.col_1 != 'A'
# Index col_1 based on the index of the mask's first true value
df.col_1 = df.col_1.iloc[[mask[i:].idxmax() for i in range(len(df))]].values
# Them simply drop the empty rows
df[~df.col_2.isnull()]
Output:
col_1 col_2
0 B a1
1 B a2
3 C a3
5 D a4
6 D a5
8 B a6
9 B a7

How to compare two data frames with same columns but different number of rows?

df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1

How to group by two column with swapped values in pandas?

I want to group by columns where the commutative rule applies.
For example
column 1, column 2 contains values (a,b) in the first row and (b,a) for another row, then I want to group these two records perform a group by operation.
Input:
From To Count
a1 b1 4
b1 a1 3
a1 b2 2
b3 a1 12
a1 b3 6
Output:
From To Count(+)
a1 b1 7
a1 b2 2
b3 a1 18
I tried to apply group by after swapping the elements. But I don't have any approach to solve this problem. Help me to solve this problem.
Thanks in advance.
Use numpy.sort for sorting each row:
cols = ['From','To']
df[cols] = pd.DataFrame(np.sort(df[cols], axis=1))
print (df)
From To Count
0 a1 b1 4
1 a1 b1 3
2 a1 b2 2
3 a1 b3 12
4 a1 b3 6
df1 = df.groupby(cols, as_index=False)['Count'].sum()
print (df1)
From To Count
0 a1 b1 7
1 a1 b2 2
2 a1 b3 18

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

Categories

Resources