Combine 2 dataframes when the dataframes have different size

Combine 2 dataframes when the dataframes have different size - python

I have 2 df one is
df1 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df2 = {'col_1': [3, 2, 1, 3]}
I want the result as follows
df3 = {'col_1': [3, 2, 1, 3], 'col_2': ['a', 'b', 'c', 'a']}
The column 2 of the new df is the same as the column 2 of the df1 depending on the value of the df1.

Add the new column by mapping the values from df1 after setting its first column as index:
df3 = df2.copy()
df3['col_2'] = df2['col_1'].map(df1.set_index('col_1')['col_2'])
output:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a

You can do it with merge after converting the dicts to df with pd.DataFrame():
output = pd.DataFrame(df2)
output = output.merge(pd.DataFrame(df1),on='col_1',how='left')
Or in a one-liner:
output = pd.DataFrame(df2).merge(pd.DataFrame(df1),on='col_1',how='left')
Outputs:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a

This could be a simple way of doing it.
# use df1 to create a lookup dictionary
lookup = df1.set_index("col_1").to_dict()["col_2"]
# look up each value from df2's "col_1" in the lookup dict
df2["col_2"] = df2["col_1"].apply(lambda d: lookup[d])

Related

Returna value in Pandas by index row number and column name?

I have a DF where the index is equal strings.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=['a', 'a', 'a'], columns=['A', 'B', 'C'])
>>> df
A B C
a 0 2 3
a 0 4 1
a 10 20 30
Let's say I am trying to access the value in col 'B' at the first row. I am using something like this:
>>> df.iloc[0]['B']
2
Reading the post here it seems .at is recommended to be used for efficiency. Is there any better way in my example to return the value by the index row number and column name?

Try with iat with get_indexer
df.iat[0,df.columns.get_indexer(['B'])[0]]
Out[124]: 2

Compare Two Columns in Two Pandas Dataframes

I have two pandas dataframes:
df1:
a b c
1 1 2
2 1 2
3 1 3
df2:
a b c
4 0 2
5 5 2
1 1 2
df1 = {'a': [1, 2, 3], 'b': [1, 1, 1], 'c': [2, 2, 3]}
df2 = {'a': [4, 5, 6], 'b': [0, 5, 1], 'c': [2, 2, 2]}
df1= pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
I'm looking for a function that will display whether df1 and df2 contain the same value in column a.
In the example I provided df1.a and df2.a both have a=1.
If df1 and df2 do not have an entry where the the value in column a are equal then the function should return None or False.
How do I do this? I've tried a couple combinations of panda.merge

Define your own function by using isin and any
def yourf(x,y):
if any(x.isin(y)):
#print(x[x.isin(y)])
return x[x.isin(y)]
else:
return 'No match' # you can change here to None
Out[316]:
0 1
Name: a, dtype: int64
yourf(df1.b,df2.c)
Out[318]: 'No match'

You could use set intersection:
def col_intersect(df1, df2, col='a'):
s1 = set(df1[col])
s2 = set(df2[col])
return s1 & s2 else None
Using merge as you tried, you could try this:
def col_match(df1, df2, col='a'):
merged = df1.merge(df2, how='inner', on=col)
if len(merged):
return merged[col]
else:
return None

How to replace values in multiple categoricals in a pandas DataFrame

I want to replace certain values in a dataframe containing multiple categoricals.
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
If I apply .replace on a single column, the result is as expected:
>>> df.s1.replace('a', 1)
0 1
1 b
2 c
Name: s1, dtype: object
If I apply the same operation to the whole dataframe, an error is shown (short version):
>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions
If the dataframe contains integers as categories, the following happens:
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
>>> df.replace(1, 3)
s1 s2
0 3 3
1 2 3
2 3 4
But,
>>> df.replace(1, 2)
ValueError: Wrong number of dimensions
What am I missing?

Without digging, that seems to be buggy to me.
My Work Around
pd.DataFrame.apply with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)
s1 s2
0 2 2
1 2 3
2 3 4
Or
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)
s1 s2
0 1 1
1 b c
2 c d
#cᴏʟᴅsᴘᴇᴇᴅ's Work Around
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)
s1 s2
0 1 1
1 b c
2 c d

The reason for such behavior is different set of categorical values for each column:
In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')
In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')
so if you will replace to a value that is in both categories it'll work:
In [226]: df.replace('d','a')
Out[226]:
s1 s2
0 a a
1 b c
2 c a
As a solution you might want to make your columns categorical manually, using:
pd.Categorical(..., categories=[...])
where categories would have all possible values for all columns...

How to get back the index after groupby in pandas

I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?

I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c

Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c

Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d

Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))

How to get number of rows in a grouped-by category in pandas

In pandas, I have (app_categ_events is a dataframe):
> print(app_categ_events.label_id.unique().shape)
> print(app_categ_events.category.unique().shape)
Out:
(492,)
(458,)
I want to look at the label_category’s that have more than one label_id for each (because I thought there was supposed to be a one-to-one mapping).
In r data.table, I can do:
app_categ_events[, count_rows := .N, by = list(category, label_id)]
# (or smth of that sort...)
print(app_categ_events[counts_rows > 1])
What’s the best way of doing that in pandas?

We transform the dataset to create the 'count_rows' column after grouping by 'category', 'label_id'
app_categ_events['count_rows'] = app_categ_events.groupby(['category',
'label_id'])['label_id'].transform('count')
print(app_categ_events)
# category label_id count_rows
#0 a 1 2
#1 a 1 2
#2 b 2 1
#3 b 3 1
Now, the equivalent of data.table as showed in the OP's post would be
print(app_categ_events[app_categ_events.count_rows>1])
# category label_id count_rows
#0 a 1 2
#1 a 1 2
data
import pandas as pd;
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'], 'label_id': [1, 1, 2, 3]})

You can use filtration to return the desired results.
df = pd.DataFrame({'label_id': [1, 1, 2, 3],
'category': ['a', 'b', 'b', 'c']})
df.groupby(['category']).filter(lambda group: len(group) > 1)
category label_id
1 b 1
2 b 2

Given:
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'],
'label_id': [1, 1, 2, 3]})
Solution:
# identify categories with greater than 1 number of related label_id's
cat_mask = app_categ_events.groupby('category')['label_id'].nunique().gt(1)
cats = cat_mask[cat_mask]
# filter data
app_categ_events[app_categ_events.category.isin(cats.index)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combine 2 dataframes when the dataframes have different size - python

I have 2 df one is df1 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']} df2 = {'col_1': [3, 2, 1, 3]} I want the result as follows df3 = {'col_1': [3, 2, 1, 3], 'col_2': ['a', 'b', 'c', 'a']} The column 2 of the new df is the same as the column 2 of the df1 depending on the value of the df1.

Add the new column by mapping the values from df1 after setting its first column as index: df3 = df2.copy() df3['col_2'] = df2['col_1'].map(df1.set_index('col_1')['col_2']) output: col_1 col_2 0 3 a 1 2 b 2 1 c 3 3 a

This could be a simple way of doing it. # use df1 to create a lookup dictionary lookup = df1.set_index("col_1").to_dict()["col_2"] # look up each value from df2's "col_1" in the lookup dict df2["col_2"] = df2["col_1"].apply(lambda d: lookup[d])

Related

Returna value in Pandas by index row number and column name?

Compare Two Columns in Two Pandas Dataframes

How to replace values in multiple categoricals in a pandas DataFrame

How to get back the index after groupby in pandas

How to get number of rows in a grouped-by category in pandas

Categories

Resources