Python Pandas Match Vlookup columns based on header values - python

I have the following dataframe df:
Customer_ID | 2015 | 2016 |2017 | Year_joined_mailing
ABC 5 6 10 2015
BCD 6 7 3 2016
DEF 10 4 5 2017
GHI 8 7 10 2016
I would like to look up the value of the customer in the year they joined the mailing list and save it in a new column.
Output would be:
Customer_ID | 2015 | 2016 |2017 | Year_joined_mailing | Purchases_1st_year
ABC 5 6 10 2015 5
BCD 6 7 3 2016 7
DEF 10 4 5 2017 5
GHI 8 9 10 2016 9
I have found some solutions for match vlookup in python, but none that would use the headers of other columns.

Deprecation Notice: lookup was deprecated in v1.2.0
Use pd.DataFrame.lookup
Keep in mind that I'm assuming Customer_ID is the index.
df.lookup(df.index, df.Year_joined_mailing)
array([5, 7, 5, 7])
df.assign(
Purchases_1st_year=df.lookup(df.index, df.Year_joined_mailing)
)
2015 2016 2017 Year_joined_mailing Purchases_1st_year
Customer_ID
ABC 5 6 10 2015 5
BCD 6 7 3 2016 7
DEF 10 4 5 2017 5
GHI 8 7 10 2016 7
However, you have to be careful with comparing possible strings in the column names and integers in the first year column...
Nuclear option to ensure type comparisons are respected.
df.assign(
Purchases_1st_year=df.rename(columns=str).lookup(
df.index, df.Year_joined_mailing.astype(str)
)
)
2015 2016 2017 Year_joined_mailing Purchases_1st_year
Customer_ID
ABC 5 6 10 2015 5
BCD 6 7 3 2016 7
DEF 10 4 5 2017 5
GHI 8 7 10 2016 7

you can apply "apply" to each row
df.apply(lambda x: x[x['Year_joined_mailing']],axis=1)

I would do it like this, assuming that the column headers and the Year_joined_mailing are the same data type and that all Year_joined_mailing values are valid columns. If the datatypes are not the same, you could convert it by adding str() or int() where appropriate.
df['Purchases_1st_year'] = [df[df['Year_joined_mailing'][i]][i] for i in df.index]
What we're doing here is iterating over the indexes in the dataframe to get the 'Year_joined_mailing' field for that index, then using that to get the column we want, and again selecting that index from the column, pushing it all to a list and assigning this to our new column 'Year_joined_mailing'
If your 'Year_joined_mailing' column will not always be a valid column name, then try:
from numpy import nan
new_col = []
for i in df.index:
try:
new_col.append(df[df['Year_joined_mailing'][i]][i])
except IndexError:
new_col.append(nan) #or whatever null value you want here)
df['Purchases_1st_year'] = new_col
This longer code snippet accomplishes the same thing, but will not break if 'Year_joined_mailing' is not in df.columns

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

Python pandas: add index column based on existing columns, with duplicates sharing the same index

I would like to add an index column based on existing columns. Duplicates would share the same index. For example,
enter image description here
If the values for the two columns ['old_index','year'] are the same, then the new index would be same. The value in the column 'num' does not matter.
I'm wondering if anyone can help. Thank you very much!
df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1
df
index year id new_id
0 1 2000 5 1
1 2 1996 3 2
2 2 1996 3 2
3 4 1994 2 3
4 4 1999 4 4
5 4 1999 4 4
6 12 1989 1 5
7 12 1989 1 5
8 12 1985 0 6
9 12 2011 6 7
Give this a try, but let me know if it isn't fully what you are looking for.

Grabbing data from previous year in a Pandas DataFrame

I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0

Split int64 Pandas column in two

I've been given a dataset that has dates as an integer using the format 52019 for May 2019. I've put it into a Pandas DataFrame, and I need to extract that date format into a month column and year column, but I can't figure out how to do that for an int64 datatype or how to handle it for the two digit months. So I want to take something like
ID Date
1 22019
2 32019
3 52019
5 102019
and make it become
ID Month Year
1 2 2019
2 3 2019
3 5 2019
5 10 2019
What should I do?
divmod
df['Month'], df['Year'] = np.divmod(df.Date, 10000)
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Without mutating original dataframe using assign
df.assign(**dict(zip(['Month', 'Year'], np.divmod(df.Date, 10000))))
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Using // and %
df['Month'], df['Year'] = df.Date//10000,df.Date%10000
df
Out[528]:
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Use:
s=pd.to_datetime(df.pop('Date'),format='%m%Y') #convert to datetime and pop deletes the col
df['Month'],df['Year']=s.dt.month,s.dt.year #extract month and year
print(df)
ID Month Year
0 1 2 2019
1 2 3 2019
2 3 5 2019
3 5 10 2019
str.extract can handle the tricky part of figuring out whether the Month has 1 or 2 digits.
(df['Date'].astype(str)
.str.extract(r'^(?P<Month>\d{1,2})(?P<Year>\d{4})$')
.astype(int))
Month Year
0 2 2019
1 3 2019
2 5 2019
3 10 2019
You may also use string slicing if it's guaranteed your numbers have only 5 or 6 digits (if not, use str.extract above):
u = df['Date'].astype(str)
df['Month'], df['Year'] = u.str[:-4], u.str[-4:]
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019

Get first index from multiindex grouping by level

I have a multiindex pandas dataframe:
SHOPPING_COUNT
CLIENT YEAR MONTH
1000063 2013 12 9
2014 1 9
2 7
3 9
2015 4 6
5 5
6 9
1001327 2014 5 1
6 1
2015 2 7
3 1
4 3
1001399 2013 8 1
And I would to know the first index of each client, ordering by level 0.
I mean, I would want to get:
1000063 2013 12
1001327 2014 5
1001399 2013 8
Let df be your dataframe, you can do something like:
df = df.groupby(level=0).apply(lambda x: x.iloc[0:1])
df.index = df.index.droplevel(0)
actually this should be more easy to do maybe, but I think that this method works.
It's not very programmatic, but if you look at the result of:
client = 1000063
df.loc[client].index
Then the following would work:
year = df.loc[client].index.levels[0][df.loc[client].index.labels[0][0]]
month = df.loc[client].index.levels[1][df.loc[client].index.labels[1][0]]

Categories

Resources