How to subset rows of df based on unique values? - python

I need to drop rows which do not match the criteria of equal values in nunique.
Every value in "lot" column is associated with two values in "shipment" column. For every value of "lot", number of unique "cargotype" values may/may not be different for each shipment. I want to drop the rows for "lot" which have unequal "cargotype" values for 2 shipments. col4-6 are irrelevant for subsetting and just need to be returned.
Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6,7,8,9,10],
"lot": ["dfg", "dfg", "dfg","dfg","ghj","ghj","ghj","abc","abc","abc","abc"],
"shipment": ["a", "b", "a","b","c","d","d","e","f","e","e"],
"cargotype": ["adam", "chris", "bob","tom","chris","hanna","chris","charlie","king","su","min"],
"col4": [ 777, 775, 767,715,772,712,712, 123, 122, 121,120],
"col5": [ 13, 12, 13,12,14,12,12, 15, 16, 17,18],
"col6": [4, 3, 4,3, 5, 8,8, 7,7,0,0]})
df
lot shipment cargotype col4 col5 col6
0 dfg a adam 777 13 4
1 dfg b chris 775 12 3
2 dfg a bob 767 13 4
3 dfg b tom 715 12 3
4 ghj c chris 772 14 5
5 ghj d hanna 712 12 8
6 ghj d chris 712 12 8
7 abc e charlie 123 15 7
8 abc f king 122 16 7
9 abc e su 121 17 0
10 abc e min 120 18 0
To check uniqueness in "cargotype" column, I use
pd.DataFrame((df.groupby(["lot","shipment"])["cargotype"].nunique()))
lot shipment cargotype
abc e 3
f 1
dfg a 2
b 2
ghj c 1
d 2
Answer df should be:
finaldf
lot shipment cargotype col4 col5 col6
0 dfg a adam 777 13 4
1 dfg b chris 775 12 3
2 dfg a bob 767 13 4
3 dfg b tom 715 12 3
Only "dfg" lot remains because unique "cargotype" values for 2 "shipments" are equal to each other.
Thank you!

Don't ask me how, but this creates your desired outcome
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df1 = df.groupby(["lot","shipment"])["cargotype"].nunique().unstack().apply(squeeze_nan, axis=1).dropna(how='all', axis=1)
lot = df1[df1['a'] == df1['b']].index
print(df[df['lot'].isin(lot)])
Caveat: not sure if this will work when a lot has more than 2 types of shipment values

Related

Python: Pandas dataframe, merge/join tabels on different keys

I have 3 tables of following form:
import pandas as pd
df1 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Value1': [2012, 2014, 2013, 2014],
'Value2': [55, 40, 84, 31]})
df1 = df1.set_index("ISIN")
df2 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Symbol': ['a', 'b', 'c', 'd']})
df2 = df2.set_index("ISIN")
df3 = pd.DataFrame({'Symbol': ['a', 'b', 'c', 'd'],
'01.01.2020': [1, 2, 3, 4],
'01.01.2021': [3,2,3,2]})
df3 = df3.set_index("Symbol")
My aim now is to merge all 3 tabels together. I would go the following way:
Step1 (merge df1 and df2):
result1 = pd.merge(df1, df2, on=["ISIN"])
print(result1)
The result is ok and gives me the table:
Value1 Value2 Symbol
ISIN
1 2012 55 a
4 2014 40 b
7 2013 84 c
10 2014 31 d
In next step I want to merge it with df3, so I did make a step between and merge df2 and df3:
print(result1)
result2 = pd.merge(df2, df3, on=["Symbol"])
print(result2)
My problem now, the output is:
Symbol 01.01.2020 01.01.2021
0 a 1 3
1 b 2 2
2 c 3 3
3 d 4 2
the column ISIN here is lost. And the step
result = pd.merge(result, result2, on=["ISIN"])
result.set_index("ISIN")
produces an error.
Is there an elegant way to merge this 3 tabels together (with key column ISIN) and why is the key column lost in the second merge process?
Just chain the merge operations:
result = df1.merge(df2.reset_index(), on='ISIN').merge(df3, on='Symbol')
Or using your syntax, use result1 as source for the second merge:
result1 = pd.merge(df1, df2.reset_index(), on=["ISIN"])
result2 = pd.merge(result1, df3, on=["Symbol"])
output:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2
You should not set the index prior to joining if you wish to keep it as part of the data in your dataframe. I suggest first merging, then setting the index to your desired value. In a single line:
output = df1.merge(df2,on='ISIN').merge(df3,on='Symbol')
Outputs:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2
You can now set the index to ISIN by adding .set_index('ISIN') to output:
Value1 Value2 Symbol 01.01.2020 01.01.2021
ISIN
1 2012 55 a 1 3
4 2014 40 b 2 2
7 2013 84 c 3 3
10 2014 31 d 4 2

Compare dataframes and only use unmatched values

I have two dataframes that I want to compare, but only want to use the values that are not in both dataframes.
Example:
DF1:
A B C
0 1 2 3
1 4 5 6
DF2:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
So, from this example I want to work with row index 2 and 3 ([7, 8, 9] and [10, 11, 12]).
The code I currently have (only remove duplicates) below.
df = pd.concat([di_old, di_new])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
print(df.reindex(idx))
I would do :
df_n = df2[df2.isin(df1).all(axis=1)]
ouput
A B C
0 1 2 3
1 4 5 6

pandas drop last group element

I have a DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e", "f","g","h"], 'col2': [1,1,1,2,2,3,3,3]}) that looks like
Input:
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
I want to drop the last row bases off of grouping "col2" which would look like...
Expected Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
I wrote df.groupby('col2').tail(1) which gets me what I want to delete but when I try to write df.drop(df.groupby('col2').tail(1)) I get an axis error. What would be a solution to this
Look like duplicated would work:
df[df.duplicated('col2', keep='last') |
(~df.duplicated('col2', keep=False)) # this is to keep all single-row groups
]
Or with your approach, you should drop the index:
# this would also drop all single-row groups
df.drop(df.groupby('col2').tail(1).index)
Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
try this:
df.groupby('col2', as_index=False).apply(lambda x: x.iloc[:-1,:]).reset_index(drop=True)

simply put data on top of another pandas python

i have 2 sample datasets dfa and dfb:
import pandas as pd
a = {
'unit': ['A', 'B', 'C', 'D'],
'count': [ 1, 12, 34, 52]
}
b = {
'department': ['E', 'F'],
'count': [ 6, 12]
}
dfa = pd.DataFrame(a)
dfb = pd.DataFrame(b)
they looks like:
dfa
count unit
1 A
12 B
34 C
52 D
dfb
count department
6 E
12 F
what I want is simply have dfa stack on top of dfb not based on any column or any index. i have checked this page: https://pandas.pydata.org/pandas-docs/stable/merging.html but couldn't find the right one for my purpose.
my desired output is to create a dfc that looks like below dataset, i want to keep the headers:
dfc:
count unit
1 A
12 B
34 C
52 D
count department
6 E
12 F
In [37]: pd.concat([dfa, pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Out[37]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
or
In [39]: dfa.append(pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)) \
.reset_index(drop=True)
Out[39]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
UPDATE: merging 3 DFs:
pd.concat([dfa,
pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns),
pd.DataFrame(dfc.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Option 1
You can construct it from scratch using np.vstack
pd.DataFrame(
np.vstack([dfa.values, dfb.columns, dfb.values]),
columns=dfa.columns
)
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
Option 2
You can export to csv and read it back
from io import StringIO
import pandas as pd
pd.read_csv(StringIO(
'\n'.join([d.to_csv(index=None) for d in [dfa, dfb]])
))
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
dfa.loc[len(dfa),:] = dfb.columns
dfb.columns = dfa.columns
dfa.append(dfb)

Assigning value to an observation from table of values

I have a large DataFrame of observations. i.e.
value 1,value 2
a,1
a,1
a,2
b,3
a,3
I now have an external DataFrame of values
_ ,a,b
1 ,10,20
2 ,30,40
3 ,50,60
What will be an efficient way to add to the first DataFrame the values from the indexed table? i.e.:
value 1,value 2, new value
a,1,10
a,1,10
a,2,30
b,3,60
a,3,50
An alternative solution using .lookup(). It's just one line, vectorized solution. suitable for large dataset.
import pandas as pd
import numpy as np
# generate some artificial data
# ================================
np.random.seed(0)
df1 = pd.DataFrame(dict(value1=np.random.choice('a b'.split(), 10), value2=np.random.randint(1, 10, 10)))
df2 = pd.DataFrame(dict(a=np.random.randn(10), b=np.random.randn(10)), columns=['a', 'b'], index=np.arange(1, 11))
df1
Out[178]:
value1 value2
0 a 6
1 b 3
2 b 5
3 a 8
4 b 7
5 b 9
6 b 9
7 b 2
8 b 7
9 b 8
df2
Out[179]:
a b
1 2.5452 0.0334
2 1.0808 0.6806
3 0.4843 -1.5635
4 0.5791 -0.5667
5 -0.1816 -0.2421
6 1.4102 1.5144
7 -0.3745 -0.3331
8 0.2752 0.0474
9 -0.9608 1.4627
10 0.3769 1.5350
# processing: one liner lookup function
# =======================================================
# df1.value2 is the index and df1.value1 is the column
df1['new_values'] = df2.lookup(df1.value2, df1.value1)
Out[181]:
value1 value2 new_values
0 a 6 1.4102
1 b 3 -1.5635
2 b 5 -0.2421
3 a 8 0.2752
4 b 7 -0.3331
5 b 9 1.4627
6 b 9 1.4627
7 b 2 0.6806
8 b 7 -0.3331
9 b 8 0.0474
Assuming your first and second dfs are df and df1 respectively, you can merge on the matching columns and then mask the 'a' and 'b' conditions:
In [9]:
df = df.merge(df1, left_on=['value 2'], right_on=['_'])
a_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'a')
b_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'b')
df.loc[a_mask, 'new value'] = df['a'].where(a_mask)
df.loc[b_mask, 'new value'] = df['b'].where(b_mask)
df
Out[9]:
value 1 value 2 _ a b new value
0 a 1 1 10 20 10
1 a 1 1 10 20 10
2 a 2 2 30 40 30
3 b 3 3 50 60 60
4 a 3 3 50 60 50
You can then drop the additional columns:
In [11]:
df = df.drop(['_','a','b'], axis=1)
df
Out[11]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
Another way is to define a func to perform the lookup:
In [15]:
def func(x):
row = df1[(df1['_'] == x['value 2'])]
return row[x['value 1']].values[0]
df['new value'] = df.apply(lambda x: func(x), axis = 1)
df
Out[15]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
EDIT
Using #Jianxun Li's lookup works but you have to offset the index as your index is 0 based:
In [20]:
df['new value'] = df1.lookup(df['value 2'] - 1, df['value 1'])
df
Out[20]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50

Categories

Resources