Map a Pandas Series with duplicate keys to a DataFrame - python

Env: Python 3.9.6, Pandas 1.3.5
I have a DataFrame and a Series like below
df = pd.DataFrame({"C1" : ["A", "B", "C", "D"]})
sr = pd.Series(data = [1, 2, 3, 4, 5],
index = ["A", "A", "B", "C", "D"])
"""
[DataFrame]
C1
0 A
1 B
2 C
3 D
[Series]
A 1
A 2
B 3
C 4
D 5
"""
What I tried,
df["C2"] = df["C1"].map(sr)
But InvalidIndexError occurred because the series has duplicate keys ("A").
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Is there any method to make DF like below?
C1 C2
0 A 1
1 A 2
2 B 3
3 C 4
4 D 5
or
C1 C2
0 A 1
1 B 3
2 C 4
3 D 5
4 A 2
Row indices do not matter.

The question was heavily edited and now has a very different meaning.
You want a simple merge:
df.merge(sr.rename('C2'),
left_on='C1', right_index=True)
Output:
C1 C2
0 A 1
0 A 2
1 B 3
2 C 4
3 D 5
old answer
First, I don't reproduce your issue (tested with 3M rows on pandas 1.3.5).
Then why do you use slicing and not map? This would have the advantage of systematically outputting the correct number of rows (NaN if the key is absent):
Example:
sr = pd.Series({10:"A", 13:"B", 16:"C", 18:"D"})
df = pd.DataFrame({"C1":np.random.randint(10, 20, size=3000000)})
df['C2'] = df['C1'].map(sr)
print(df.head())
output:
C1 C2
0 10 A
1 18 D
2 10 A
3 13 B
4 15 NaN

Related

How to filter a dataframe by the mean of each group using a on-liner pandas code

I'm trying to filter my dataset so that only the rows that, for a given column, have values larger than the mean (or any other function) of that column.
For instance, suppose we have the following data frame:
import pandas as pd
df = pd.DataFrame({
"Group": ["A", "A", "A", "B", "B", "C"],
"C1": [1, 2, 3, 2, 3, 1],
"C2": [1, 1, 5, 1, 2, 1]
})
Group C1 C2
0 A 1 1
1 A 2 1
2 A 3 5
3 B 2 1
4 B 3 2
5 C 1 1
Now, I want to create other filtered data frames subsetting the original one based on some function. For example, let's use the mean as a baseline:
df.groupby("Group").mean()
C1 C2
Group
A 2.0 2.333333
B 2.5 1.500000
C 1.0 1.000000
Now, I want all points such values greater than or equal to the mean in column C1:
Group C1 C2
1 A 2 1
2 A 3 5
4 B 3 2
5 C 1 1
Or I want a subset such that the values are less than or equal to the mean in column C2:
Group C1 C2
0 A 1 1
1 A 2 1
3 B 2 1
5 C 1 1
To make this easier/more compact, it would be good to have a pandas on-liner fashion code, i.e., using the typical . pipeline, using something like:
df.groupby("Group").filter(lambda x : x["C1"] >= x["C1"].mean())
Note that the code above doesn't work because filter requires a function that returns returns True/False but not a data frame to be combined, as I intend to.
Obviously, I can iterate using groupby, filter the group, and then concatenate the results:
new_df = None
for _, group in df.groupby("Group"):
tmp = group[group["C1"] >= group["C1"].mean()]
new_df = pd.concat([new_df, tmp])
(Note: The >=, otherwise we will have empty data frames messing up with the concatenation)
Same thing I can do in the other case:
new_df = None
for _, group in df.groupby("Group"):
tmp = group[group["C2"] <= group["C2"].mean()]
new_df = pd.concat([new_df, tmp])
But do we have a pandas-idiomatic (maybe generic, short, and probably optimized) way to do that?
Just for curiosity, I can do it very easily in R:
r$> df <- tibble(
Group = c("A", "A", "A", "B", "B", "C"),
C1 = c(1, 2, 3, 2, 3, 1),
C2 = c(1, 1, 5, 1, 2, 1)
)
r$> df
# A tibble: 6 × 3
Group C1 C2
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 5
4 B 2 1
5 B 3 2
6 C 1 1
r$> df %>% group_by(Group) %>% filter(C1 >= mean(C1))
# A tibble: 4 × 3
# Groups: Group [3]
Group C1 C2
<chr> <dbl> <dbl>
1 A 2 1
2 A 3 5
3 B 3 2
4 C 1 1
r$> df %>% group_by(Group) %>% filter(C1 <= mean(C1))
# A tibble: 4 × 3
# Groups: Group [3]
Group C1 C2
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 B 2 1
4 C 1 1
Thanks!
IIUC, you can use groupby and transform to create a boolean series for boolean indexing your dataframe where column, C1 is greater than or equal to the mean of C1 by group:
df[df['C1']>=df.groupby("Group")['C1'].transform('mean')]
Output:
Group C1 C2
1 A 2 1
2 A 3 5
4 B 3 2
5 C 1 1

Filter dataframe based on the presence of multiple columns in another dataframe

I am curious what is the best practice to do the following:
Let's say I have 2 dataframes:
df1:
A B C D
0 1 2 3 4
1 1 3 5 5
2 1 2 3 4
3 3 5 6 7
4 9 7 6 5
df2:
A B C
0 1 2 3
1 9 7 6
I want to filter down df1 on columns A, B, C to only show records which are present in df2's A,B,C columns.
The result I want to see:
A B C D
0 1 2 3 4
1 1 2 3 4
2 9 7 6 5
As you can see I only need records from df1 where the combination of the first 3 columns are either 1,2,3 or 9,7,6.
What I tried is a bit overkill in my opinion:
merged_df = df1.merge(df2, how="left", on=["A", "B", "C"], indicator=True)
mask = merged_df["_merge"] == "both"
result = merged_df[mask].drop(["_merge"], axis=1)
Is there any better way to do this?
Merge on inner
import pandas as pd
df1 = pd.DataFrame(
{
"A": [1, 1, 1,3,9],
"B": [2,3,2,5,7],
"C": [3,5,3,6,6],
"D": [4,5,4,7,5]
}
)
df2 = pd.DataFrame(
{
"A": [1, 9],
"B": [2,7],
"C": [3,6],
}
)
#print(df1)
df =pd.merge(df1, df2, on=['A', 'B', 'C'], how='inner')
print(df)
output #
A B C D
0 1 2 3 4
1 1 2 3 4
2 9 7 6 5

String Formatting using many pandas columns to create a new one

I would like to create a new columns in a pandas DataFrame just like I would do using a python f-Strings or format function.
Here is an example:
df = pd.DataFrame({"str": ["a", "b", "c", "d", "e"],
"int": [1, 2, 3, 4, 5]})
print(df)
str int
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
I would like to obtain:
str int concat
0 a 1 a-01
1 b 2 b-02
2 c 3 c-03
3 d 4 d-04
4 e 5 e-05
So something like:
concat = f"{str}-{int:02d}"
but directly with elements of pandas columns. I imagine the solution is using pandas map, apply, agg but nothing successful.
Many thanks for your help.
Use lsit comprehension with f-strings:
df['concat'] = [f"{a}-{b:02d}" for a, b in zip(df['str'], df['int'])]
Or is possible use apply:
df['concat'] = df.apply(lambda x: f"{x['str']}-{x['int']:02d}", axis=1)
Or solution from comments with Series.str.zfill:
df["concat"] = df["str"] + "-" + df["int"].astype(str).str.zfill(2)
print (df)
str int concat
0 a 1 a-01
1 b 2 b-02
2 c 3 c-03
3 d 4 d-04
4 e 5 e-05
You could use a list comprehension to build the concat column:
import pandas as pd
df = pd.DataFrame({"str": ["a", "b", "c", "d", "e"],
"int": [1, 2, 3, 4, 5]})
df['concat'] = [f"{s}-{i:02d}" for s, i in df[['str', 'int']].values]
print(df)
Output
str int concat
0 a 1 a-01
1 b 2 b-02
2 c 3 c-03
3 d 4 d-04
4 e 5 e-05
I also just discovered that array indexing work on DataFrame columns
df["concat"] = df.apply(lambda x: f"{x[0]}-{x[1]:02d}", axis=1)
print(df)
str int concat
0 a 1 a-01
1 b 2 b-02
2 c 3 c-03
3 d 4 d-04
4 e 5 e-05
looks very sleek
You can use pandas' string concatenate method :
df['concat'] = df['str'].str.cat(df['int'].astype(str),sep='-0')
str int concat
0 a 1 a-01
1 b 2 b-02
2 c 3 c-03
3 d 4 d-04
4 e 5 e-05

Shuffle rows by a column in pandas

I have the following example of dataframe.
c1 c2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
Given a template c1 = [3, 2, 5, 4, 1], I want to change the order of the rows based on the new order of column c1, so it will look like:
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
I found the following thread, but the shuffle is random. Cmmiw.
Shuffle DataFrame rows
If values are unique in list and also in c1 column use reindex:
df = df.set_index('c1').reindex(c1).reset_index()
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
General solution working with duplicates in list and also in column:
c1 = [3, 2, 5, 4, 1, 3, 2, 3]
#create df from list
list_df = pd.DataFrame({'c1':c1})
print (list_df)
c1
0 3
1 2
2 5
3 4
4 1
5 3
6 2
7 3
#helper column for count duplicates values
df['g'] = df.groupby('c1').cumcount()
list_df['g'] = list_df.groupby('c1').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df).drop('g', axis=1)
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
5 3 c
merge
You can create a dataframe with the column specified in the wanted order then merge.
One advantage of this approach is that it gracefully handles duplicates in either df.c1 or the list c1. If duplicates not wanted then care must be taken to handle them prior to reordering.
d1 = pd.DataFrame({'c1': c1})
d1.merge(df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
searchsorted
This is less robust but will work if df.c1 is:
already sorted
one-to-one mapping
df.iloc[df.c1.searchsorted(c1)]
c1 c2
2 3 c
1 2 b
4 5 e
3 4 d
0 1 a

How can I add a column to a pandas DataFrame that uniquely identifies grouped data? [duplicate]

Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!
Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2

Categories

Resources