Subset-selection using Pandas "isin"-syntax - python

I have a question regarding a table (Table A - containing multiple values of three keys and some "value" columns) according to below:
ID TIME1 TIME2 VALUE_A VALUE_B
1 201501 201501 a 1a
1 201502 201502 a 1c
1 201502 201502 b 1d
1 201501 201501 b 2e
1 201501 201501 b 6a
1 201501 201501 b 1d
1 201502 201502 b 2e
1 201502 201502 b 6a
I have used a code creating unique values from another table, getting a reference of the rows I want to extract from table A, given the keys. This table (table B) has the appearance according to below:
ID TIME1 TIME2
1 201502 201502
2 201511 201511
I have manage to take out the values I want by doing a simple merge which gives the values I want from table A, given references. However, I would like to use the "isin"-function to make this happened also. I have my syntax according to below, and it gives me duplicate values. The only thing I want is to take out the rows from Table A, given reference from Table B. How can I gear it to do that?
Table C according to below:
ID TIME1 TIME2 VALUE_A VALUE_B
1 201502 201502 a 1c
1 201502 201502 b 1d
1 201502 201502 b 2e
1 201502 201502 b 6a
Syntax("isin"-version):
subset = df[df.ID.isin(df2['ID']) & (df.TIME1.isin(df2['TIME1']) & df.TIME2.isin(df2['TIME2']))]
Code for creating table A and table B is below:
df = DataFrame({'ID' : [1,1,1,1,1,1,1,1],
'TIME1' : [201501,201502,201502,201501,201501,201501,201502,201502],
'TIME2' : [201501,201502,201502,201501,201501,201501,201502,201502],
'VALUE_A' : ['a', 'a', 'b', 'b', 'b', 'b', 'b', 'b'],
'VALUE_B' : ['1a', '1c', '1d', '2e', '6a', '1d', '2e', '6a']})
df2 = DataFrame({'ID' : [1,2],
'TIME1' : [201502,201501],
'TIME2' : [201502,201501]
})
Many thanks in advance!

I believe you want to modify your boolean condition to this:
In [146]:
subset = df[df.ID.isin(df2['ID']) & (df.TIME1.isin(df2['TIME1']) | df.TIME2.isin(df2['TIME2'])) ]
subset
Out[146]:
ID TIME1 TIME2 VALUE_A VALUE_B
1 1 201502 201-02 a 1c
2 1 201502 201502 b 1d
6 2 201511 201511 b 2e
7 2 201511 201511 b 6a
So this checks that the ID is present and that either Time1 or Time2 is in the other df.

simply you can achieve this using isin() by
In [102]:
df[df.TIME1.isin(df2.TIME1) & df.TIME2.isin(df2.TIME2)]
Out[102]:
ID TIME1 TIME2 VALUE_A VALUE_B
1 201502 201502 a 1c
1 201502 201502 b 1d
2 201511 201511 b 2e
2 201511 201511 b 6a

Related

Merge Overlapping Intervals on a column in python

My dataFrame is:
df1 = {'ID': [1, 1,3], 'aa': [52, 52,55],'ab': [8285,2490,1000],'type': ['A','B','C'] }
df1 = pd.DataFrame(data=df1)
df1
ID aa ab type
0 1 52 8285 A
1 1 52 2490 B
2 3 55 1000 C
I want to merge overlapping intervals on the column "type" for each ID
Desired dataframe :
ID aa ab type
0 1 52 8285 A,B
1 1 52 2490 B,A
2 3 55 1000 C
A schema of the dataframe:
my schema

Python Data frame: If Column Name is contained in the String Row of Another Column Then 1 Otherwise 0

Column A 2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
Above is the dataframe I want to create.
The first row represents the field names. The logic I want to employ is as follows:
If the column name is in the "Column A" row, then 1 otherwise 0
I have scoured Google looking for code answering a question similar to mine so I can test it out and backward engineer a solution. Unfortunately, I have not been able to find anything.
Otherwise I would post some code that I attempted to solve this problem but I literally have no clue.
You can use a list comprehension to create the desire data based on the columns and rows:
In [39]: row =['2C 1B D2 6F ABC', '2C 1248 Bulers']
In [40]: columns=['2C', 'GAD', 'D2', '6F', 'ABCDE']
In [41]: df = pd.DataFrame([[int(k in r) for k in columns] for r in row], index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [42]: df
Out[42]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
If you want a pure Pandas approach you can use pd.Series() instead of list for preserving the columns and rows then use Series.apply and Series.str.contains to get the desire result:
In [73]: data = columns.apply(row.str.contains).astype(int).transpose()
In [74]: df = pd.DataFrame(data.values, index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [75]: df
Out[75]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0

How to order rows within groups in the big dataframe

I have a relatively big dataframe (1.5 Gb), and I want to group rows by ID and order rows by column VAL in ascending order within each group.
df =
ID VAL COL
1A 2 BB
1A 1 AA
2B 2 CC
3C 3 SS
3C 1 YY
3C 2 XX
This is the expected result:
df =
ID VAL COL
1A 1 AA
1A 2 BB
2B 2 CC
3C 1 YY
3C 2 XX
3C 3 SS
This is what I tried, but it runs very long time. Is there any faster solution?:
df = df.groupby("ID").apply(pd.DataFrame.sort, 'VAL')
If you have a big df and speed is important, try a little numpy
# note order of VAL first, then ID is intentional
# np.lexsort sorts by right most column first
df.iloc[np.lexsort((df.VAL.values, df.ID.values))]
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
super charged
v = df.values
i, j = np.searchsorted(df.columns.values, ['VAL', 'ID'])
s = np.lexsort((v[:, i], v[:, j]))
pd.DataFrame(v[s], df.index[s], df.columns)
timing
sort_values on 'ID', 'VAL' should give you
In [39]: df.sort_values(by=['ID', 'VAL'])
Out[39]:
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
Time it for your use-case
In [89]: dff.shape
Out[89]: (12000, 3)
In [90]: %timeit dff.sort_values(by=['ID', 'VAL'])
100 loops, best of 3: 2.62 ms per loop
In [91]: %timeit dff.iloc[np.lexsort((dff.VAL.values, dff.ID.values))]
100 loops, best of 3: 8.8 ms per loop

How to filter pandas dataframe on multiple columns based on a dictionary?

I have 3 dictionaries:
A, B, C
and a pandas dataframe with these columns:
['id',
't1',
't2',
't3',
't4']
Now all I want to do is keep only those rows whose t1 is present in dict A, t2 in dict B and t3 in dict C
I tried dataframe['t1'] in A
this gives an error: Series object is mutable cannot be hashed...
You can try something like this.
df.loc[(df['t1'].isin(A.keys()) & df['t2'].isin(B.keys()) & df['t3'].isin(C.keys()))]
I hope this is what you want.
In [51]: df
Out[51]:
t1 t2 t3 t4 max_value
0 1 4 5 2 5
1 34 70 1 5 70
2 43 89 4 11 89
3 22 76 4 3 76
In [52]: A = {34: 4}
In [53]: B = {70: 5, 89: 3}
In [54]: C = {1: 3, 5:1}
In [55]: df.loc[(df['t1'].isin(A.keys()) & df['t2'].isin(B.keys()) & df['t3'].isin(C.keys()))]
Out[55]:
t1 t2 t3 t4 max_value
1 34 70 1 5 70
To answer #EdChum, I have assumed OP wants to check the presence of values in dictionary keys.

Flatten DataFrame with multi-index columns

I'd like to convert a Pandas DataFrame that is derived from a pivot table into a row representation as shown below.
This is where I'm at:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'goods': ['a', 'a', 'b', 'b', 'b'],
'stock': [5, 10, 30, 40, 10],
'category': ['c1', 'c2', 'c1', 'c2', 'c1'],
'date': pd.to_datetime(['2014-01-01', '2014-02-01', '2014-01-06', '2014-02-09', '2014-03-09'])
})
# we don't care about year in this example
df['month'] = df['date'].map(lambda x: x.month)
piv = df.pivot_table(["stock"], "month", ["goods", "category"], aggfunc="sum")
piv = piv.reindex(np.arange(piv.index[0], piv.index[-1] + 1))
piv = piv.ffill(axis=0)
piv = piv.fillna(0)
print piv
which results in
stock
goods a b
category c1 c2 c1 c2
month
1 5 0 30 0
2 5 10 30 40
3 5 10 10 40
And this is where I want to get to.
goods category month stock
a c1 1 5
a c1 2 0
a c1 3 0
a c2 1 0
a c2 2 10
a c2 3 0
b c1 1 30
b c1 2 0
b c1 3 10
b c2 1 0
b c2 2 40
b c2 3 0
Previously, I used
piv = piv.stack()
piv = piv.reset_index()
print piv
to get rid of the multi-indexes, but this results in this because I pivot now on two columns (["goods", "category"]):
month category stock
goods a b
0 1 c1 5 30
1 1 c2 0 0
2 2 c1 5 30
3 2 c2 10 40
4 3 c1 5 10
5 3 c2 10 40
Does anyone know how I can get rid of the multi-index in the column and get the result into a DataFrame of the exemplified format?
>>> piv.unstack().reset_index().drop('level_0', axis=1)
goods category month 0
0 a c1 1 5
1 a c1 2 5
2 a c1 3 5
3 a c2 1 0
4 a c2 2 10
5 a c2 3 10
6 b c1 1 30
7 b c1 2 30
8 b c1 3 10
9 b c2 1 0
10 b c2 2 40
11 b c2 3 40
then all you need is to change last column name from 0 to stock.
It seems to me that melt (aka unpivot) is very close to what you want to do:
In [11]: pd.melt(piv)
Out[11]:
NaN goods category value
0 stock a c1 5
1 stock a c1 5
2 stock a c1 5
3 stock a c2 0
4 stock a c2 10
5 stock a c2 10
6 stock b c1 30
7 stock b c1 30
8 stock b c1 10
9 stock b c2 0
10 stock b c2 40
11 stock b c2 40
There's a rogue column (stock), that appears here that column header is constant in piv. If we drop it first the melt works OOTB:
In [12]: piv.columns = piv.columns.droplevel(0)
In [13]: pd.melt(piv)
Out[13]:
goods category value
0 a c1 5
1 a c1 5
2 a c1 5
3 a c2 0
4 a c2 10
5 a c2 10
6 b c1 30
7 b c1 30
8 b c1 10
9 b c2 0
10 b c2 40
11 b c2 40
Edit: The above actually drops the index, you need to make it a column with reset_index:
In [21]: pd.melt(piv.reset_index(), id_vars=['month'], value_name='stock')
Out[21]:
month goods category stock
0 1 a c1 5
1 2 a c1 5
2 3 a c1 5
3 1 a c2 0
4 2 a c2 10
5 3 a c2 10
6 1 b c1 30
7 2 b c1 30
8 3 b c1 10
9 1 b c2 0
10 2 b c2 40
11 3 b c2 40
I know that the question has already been answered, but for my dataset multiindex column problem, the provided solution was unefficient. So here I am posting another solution for unpivoting multiindex columns using pandas.
Here is the problem I had:
As one can see, the dataframe is composed of 3 multiindex, and two levels of multiindex columns.
The desired dataframe format was:
When I tried the options given above, the pd.melt function didn't allow to have more than one column in the var_name attribute. Therefore, every time that I tried a melt, I would end up losing some attribute from my table.
The solution I found was to apply a double stacking function over my dataframe.
Before the coding, it is worth notice that the desired var_name for my unpivoted table column was "Populacao residente em domicilios particulares ocupados" (see in the code below). Therefore, for all my value entries, they should be stacked in this newly created var_name new column.
Here is a snippet code:
import pandas as pd
# reading my table
df = pd.read_excel(r'my_table.xls', sep=',', header=[2,3], encoding='latin3',
index_col=[0,1,2], na_values=['-', ' ', '*'], squeeze=True).fillna(0)
df.index.names = ['COD_MUNIC_7', 'NOME_MUN', 'TIPO']
df.columns.names = ['sexo', 'faixa_etaria']
df.head()
# making the stacking:
df = pd.DataFrame(pd.Series(df.stack(level=0).stack(), name='Populacao residente em domicilios particulares ocupados')).reset_index()
df.head()
Another solution I found was to first apply a stacking function over the dataframe and then apply the melt.
Here is an alternative code:
df = df.stack('faixa_etaria').reset_index().melt(id_vars=['COD_MUNIC_7', 'NOME_MUN','TIPO', 'faixa_etaria'],
value_vars=['Homens', 'Mulheres'],
value_name='Populacao residente em domicilios particulares ocupados',
var_name='sexo')
df.head()
Sincerely yours,
Philipe Riskalla Leal

Categories

Resources