Add two columns of different dataframes taking into account missing values - python

How can I add columns of two dataframes (A + B), so that the result (C) takes into account missing values ('---')?
DataFrame A
a = pd.DataFrame({'A': [1, 2, 3, '---', 5]})
A
0 1
1 2
2 3
3 ---
4 5
DataFrame B
b = pd.DataFrame({'B': [3, 4, 5, 6, '---']})
B
0 3
1 4
2 5
3 6
4 ---
Desired Result of A+B
C
0 4
1 6
2 8
3 ---
4 ---

Replace the '---' with np.nan, add the columns and fillna with '---'
(a['A'].replace('---', np.nan)+b['B'].replace('---', np.nan)).fillna('---')
You can assign the result to a new dataframe or an existing one:
df = pd.DataFrame()
df.assign(C = (a['A'].replace('---', np.nan)+b['B'].replace('---', np.nan)).fillna('---'))
OR
a.assign(C = (a['A'].replace('---', np.nan)+b['B'].replace('---', np.nan)).fillna('---'))

Related

Select rows of pandas dataframe in order of a given list with repetitions and keep the original index

After looking here and here and in the documentation, I still cannot find a way to select rows from a DataFrame according to all these criteria:
Return rows in an order given from a list of values from a given column
Return repeated rows (associated with repeated values in the list)
Preserve the original indices
Ignore values of the list not present in the DataFrame
As an example, let
df = pd.DataFrame({'A': [5, 6, 3, 4], 'B': [1, 2, 3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
and let
list_of_values = [3, 4, 6, 4, 3, 8]
Then I would like to get the following DataFrame:
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
How can I accomplish that? Zero's answer looks promising as it is the only one I found which preserves the original index, but it does not work with repetitions. Any ideas about how to modify/generalize it?
We have to preserve the index by assigning it as a column first so we can set_index after the mering:
list_of_values = [3, 4, 6, 4, 3, 8]
df2 = pd.DataFrame({'A': list_of_values, 'order': range(len(list_of_values))})
dfn = (
df.assign(idx=df.index)
.merge(df2, on='A')
.sort_values('order')
.set_index('idx')
.drop('order', axis=1)
)
A B
idx
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
If you want to get rid of the index name (idx), use rename_axis:
dfn = dfn.rename_axis(None)
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
Here's a way to do that using merge:
list_df = pd.DataFrame({"A": list_of_values, "order": range(len(list_of_values))})
pd.merge(list_df, df, on="A").sort_values("order").drop("order", axis=1)
The output is:
A B
0 3 3
2 4 5
4 6 2
3 4 5
1 3 3

Pandas: Split columns into multiple columns by two delimiters

I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2

pandas.DataFrame.groupby.nunique() does not drop the groupby column/s. Is this a bug?

Although I set the parameter as_index to True, pandas.DataFrame.groupby.nunique() keeps the columns I am grouping by in the result.
The pandas version is: 0.24.1
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 2],
'b': [1, 2, 3, 4, 4]}
)
df.groupby('a', as_index=True).nunique()
The output is:
# a b
# a
# 1 1 2
# 2 1 2
# 3 1 1
I expected:
# b
# a
# 1 2
# 2 2
# 3 1
As a counterexample that behaves as expected:
df.groupby('a', as_index=True).max()
results in:
# b
# a
# 1 2
# 2 4
# 3 4
If you run [print(df.to_string() + '\n') for i, df in df.groupby('a', as_index=True)], you get printed:
a b
0 1 1
1 1 2
a b
2 2 3
4 2 4
a b
3 3 4
The a column isn't set as the index for each data frame group. It is the output of the groupby which has its index set to the group indices when as_index=True (which also is the default), not the data frame groups themselves.

sum values in different rows and columns dataframe python

My Data Frame
A B C D
2 3 4 5
1 4 5 6
5 6 7 8
How do I add values of different rows and different columns
Column A Row 2 with Column B row 1
Column A Row 3 with Column B row 2
Similarly for all rows
If you only need do this with two columns (and I understand your question well), I think you can use the shift function.
Your data frame (pandas?) is something like:
d = {'A': [2, 1, 5], 'B': [3, 4, 6], 'C': [4, 5, 7], 'D':[5, 6, 8]}
df = pd.DataFrame(data=d)
So, it's possible to create a new data frame with B column shifted:
df2 = df['B'].shift(1)
which gives:
0 NaN
1 3.0
2 4.0
Name: B, dtype: float64
and then, merge this new data with the previous df and, for example, sum the values:
df = df.join(df2, rsuffix='shift')
df['out'] = df['A'] + df['Bshift']
The final output is in out column:
A B C D Bshift out
0 2 3 4 5 NaN NaN
1 1 4 5 6 3.0 4.0
2 5 6 7 8 4.0 9.0
But it's only an intuition, I'm not sure about your question!

pandas rearrange dataframe to have all values in ascending order per every column independently

The title should say it all, I want to turn this DataFrame:
A NaN 4 3
B 2 1 4
C 3 4 2
D 4 2 8
into this DataFrame:
A 2 1 2
B 3 2 3
C 4 4 4
D NaN 4 8
And I want to do it in a nice manner. The ugly solution would be to take every column and form a new DataFrame.
To test, use:
d = {'one':[None, 2, 3, 4],
'two':[4, 1, 4, 2],
'three':[3, 4, 6, 8],}
df = pd.DataFrame(d, index = list('ABCD'))
The desired sort ignores the index values, so the operation appears to be more
like a NumPy operation than a Pandas one:
import pandas as pd
d = {'one':[None, 2, 3, 4],
'two':[4, 1, 4, 2],
'three':[3, 4, 6, 8],}
df = pd.DataFrame(d, index = list('ABCD'))
# one three two
# A NaN 3 4
# B 2 4 1
# C 3 6 4
# D 4 8 2
arr = df.values
arr.sort(axis=0)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df)
yields
one three two
A 2 3 1
B 3 4 2
C 4 6 4
D NaN 8 4

Categories

Resources