I have a dataframe like this-
element id year month days tmax tmin
0 MX17004 2010 1 d1 NaN NaN
1 MX17004 2010 1 d10 NaN NaN
2 MX17004 2010 1 d11 NaN NaN
3 MX17004 2010 1 d12 NaN NaN
4 MX17004 2010 1 d13 NaN NaN
where I want to further break days column like this
**
days
1
10
11
12
13
**
I have tried a couple of ways, but not successful in getting the output. Can someone please help or some clue?
By using str slice
df.days=df.days.str[1:]
df
Out[759]:
element id year month days tmax tmin
0 0 MX17004 2010 1 1 NaN NaN
1 1 MX17004 2010 1 10 NaN NaN
2 2 MX17004 2010 1 11 NaN NaN
3 3 MX17004 2010 1 12 NaN NaN
4 4 MX17004 2010 1 13 NaN NaN
Use extract with regex:
df['days'] = df.days.str.extract('d(\d+)', expand=False)
print(df)
Output:
element id year month days tmax tmin
0 0 MX17004 2010 1 1 NaN NaN
1 1 MX17004 2010 1 10 NaN NaN
2 2 MX17004 2010 1 11 NaN NaN
3 3 MX17004 2010 1 12 NaN NaN
4 4 MX17004 2010 1 13 NaN NaN
Related
I have pivoted the Customer ID against their year of purchase, so that I know how many times each customer purchased in different years:
Customer ID 1996 1997 ... 2019 2020
100000000000001 7 7 ... NaN NaN
100000000000002 8 8 ... NaN NaN
100000000000003 7 4 ... NaN NaN
100000000000004 NaN NaN ... 21 24
100000000000005 17 11 ... 18 NaN
My desired result is to append the column names with the latest year of purchase, and thus the number of years since their last purchase:
Customer ID 1996 1997 ... 2019 2020 Last Recency
100000000000001 7 7 ... NaN NaN 1997 23
100000000000002 8 8 ... NaN NaN 1997 23
100000000000003 7 4 ... NaN NaN 1997 23
100000000000004 NaN NaN ... 21 24 2020 0
100000000000005 17 11 ... 18 NaN 2019 1
Here is what I tried:
df_pivot["Last"] = 2020
k = 2020
while math.isnan(df_pivot2[k]):
df_pivot["Last"] = k-1
k = k-1
df_pivot["Recency"] = 2020 - df_pivot["Last"]
However what I got is "TypeError: cannot convert the series to <class 'float'>"
Could anyone help me to get the result I need?
Thanks a lot!
Dennis
You can get last year of purchase using notna + cumsum and idxmax along axis=1 then subtract this last year of purchase from the max year to compute Recency:
c = df.filter(regex=r'\d+').columns
df['Last'] = df[c].notna().cumsum(1).idxmax(1)
df['Recency'] = c.max() - df['Last']
Customer ID 1996 1997 2019 2020 Last Recency
0 100000000000001 7.0 7.0 NaN NaN 1997 23
1 100000000000002 8.0 8.0 NaN NaN 1997 23
2 100000000000003 7.0 4.0 NaN NaN 1997 23
3 100000000000004 NaN NaN 21.0 24.0 2020 0
4 100000000000005 17.0 11.0 18.0 NaN 2019 1
one idea is to apply "applymap(float)" to your dataFrame
Documentation from pandas
I have several different data frames, that I need to drop certain rows from. Each data frame has the same sequence of rows but located in different areas
Summary Results Report Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 DEM President NaN NaN NaN NaN
1 Vote For 1 NaN NaN NaN NaN
2 NaN NaN Ballots By NaN Election
3 TOTAL NaN NaN Early Voting NaN
4 NaN NaN Mail NaN Day
5 Tom Steyer NaN 0 0 0 0
6 Andrew Yang NaN 0 0 0 0
7 John K. Delaney NaN 0 0 0 0
8 Cory Booker NaN 0 0 0 0
9 Michael R. Bloomberg NaN 4 1 1 2
10 Julian Castro NaN 0 0 0 0
11 Elizabeth Warren NaN 1 0 1 0
12 Marianne Williamson NaN 0 0 0 0
13 Deval Patrick NaN 0 0 0 0
14 Robby Wells NaN 0 0 0 0
15 Amy Klobuchar NaN 3 1 2 0
16 Tulsi Gabbard NaN 0 0 0 0
17 Michael Bennet NaN 0 0 0 0
18 Bernie Sanders NaN 4 0 1 3
19 Pete Buttigieg NaN 0 0 0 0
20 Joseph R. Biden 21.0 0 3 18
21 Roque "Rocky" De La Fuente NaN 0 0 0 0
22 Total Votes Cast 33.0 2 8 23
Summary Results Report Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 DEM US Senator NaN NaN NaN NaN NaN NaN
1 Vote For 1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN Ballots By NaN Election NaN
3 TOTAL NaN NaN NaN Early Voting NaN NaN
4 NaN NaN NaN Mail NaN Day NaN
5 Jack Daniel Foster, Jr. 4.0 NaN 0 0 4 NaN
6 Mary (MJ) Hegar 4.0 NaN 1 3 0 NaN
7 Amanda K. Edwards 4.0 NaN 1 1 2 NaN
8 D. R. Hunter 1.0 NaN 0 0 1 NaN
9 Michael Cooper 3.0 NaN 0 0 3 NaN
10 Chris Bell 1.0 NaN 0 0 1 NaN
11 Royce West 3.0 NaN 0 0 3 NaN
12 Cristina Tzintzun Ramirez 5.0 NaN 0 3 2 NaN
13 Victor Hugo Harris 1.0 NaN 0 0 1 NaN
14 Sema Hernandez 1.0 NaN 0 0 1 NaN
15 Adrian Ocegueda 0.0 NaN 0 0 0 NaN
16 Annie "Mama" Garcia 3.0 NaN 0 1 2 NaN
17 Total Votes Cast 30 NaN NaN 2 8 20 NaN
18 DEM US Representative, Dist 1 NaN NaN NaN NaN NaN NaN
19 Vote For 1 NaN NaN NaN NaN NaN NaN
20 NaN NaN NaN Ballots By NaN Election NaN
21 TOTAL NaN NaN NaN Early Voting NaN NaN
22 NaN NaN NaN Mail NaN Day NaN
23 Hank Gilbert 26 NaN NaN 1 6 19 NaN
24 Total Votes Cast 26 NaN NaN 1 6 19 NaN
What I want to remove is the row that contains Vote for 1 in the first column, as well as the following 3 rows. The problem is that they can show in multiple areas, and in occasion multiple times (such as the second data frame). What I have seems to be working, in the aspect that it removes the required rows, however, at the end, it will give me a key error which tells me that it is re-looping through without any data.
for x in range(len(df)):
if 'Vote For 1' in str(df.iloc[:,0][x]):
y = x+3
df = df.drop(df.loc[x:y].index)
df.reset_index(drop=True,inplace=True)
df.index.name=None
print(df)
the code produces the following output:
Summary Results Report Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 DEM US Senator NaN NaN NaN NaN NaN NaN
1 Jack Daniel Foster, Jr. 4.0 NaN 0 0 4 NaN
2 Mary (MJ) Hegar 4.0 NaN 1 3 0 NaN
3 Amanda K. Edwards 4.0 NaN 1 1 2 NaN
4 D. R. Hunter 1.0 NaN 0 0 1 NaN
5 Michael Cooper 3.0 NaN 0 0 3 NaN
6 Chris Bell 1.0 NaN 0 0 1 NaN
7 Royce West 3.0 NaN 0 0 3 NaN
8 Cristina Tzintzun Ramirez 5.0 NaN 0 3 2 NaN
9 Victor Hugo Harris 1.0 NaN 0 0 1 NaN
10 Sema Hernandez 1.0 NaN 0 0 1 NaN
11 Adrian Ocegueda 0.0 NaN 0 0 0 NaN
12 Annie "Mama" Garcia 3.0 NaN 0 1 2 NaN
13 Total Votes Cast 30 NaN NaN 2 8 20 NaN
14 DEM US Representative, Dist 1 NaN NaN NaN NaN NaN NaN
15 Hank Gilbert 26 NaN NaN 1 6 19 NaN
16 Total Votes Cast 26 NaN NaN 1 6 19 NaN
It errors out at the end with KeyError: 17. Any advice is greatly appreciated.
####EDIT####
I just wanted to give an update on the code that finally solved my problem. I know that it is probably a little robust, but it does work.
remove_strings=['Vote For 1','TOTAL']
remove_strings_list = df.index[df['Summary Results Report'].isin(remove_strings)].tolist()
df = df.drop(df.index[remove_strings_list])
Not exactly sure what your column names are, but is the summary column contains the names and the few names you want to remove, this should work. Else you may have to change the column name accordingly.
strings_to_remove = ['Vote for 1', 'Total', 'NaN']
df[~df.summary.isin(strings_to_remove)]
i have the following dataframe
id value year audit
1 21 2007 NaN
1 36 2008 2011
1 7 2009 Nan
2 44 2007 NaN
2 41 2008 Nan
2 15 2009 Nan
3 51 2007 NaN
3 15 2008 2011
3 51 2009 Nan
4 10 2007 NaN
4 12 2008 Nan
4 24 2009 2011
5 30 2007 2011
5 35 2008 Nan
5 122 2009 Nan
Basically, I want to create another variable audit2 where all the cells are 2011, if at least one audit is 2011, for each id.
I tried to put an if-statement inside a loop, but I cannot get any results
I would like to get this new dataframe
id value year audit audit2
1 21 2007 NaN 2011
1 36 2008 2011 2011
1 7 2009 Nan 2011
2 44 2007 NaN NaN
2 41 2008 Nan NaN
2 15 2009 Nan NaN
3 51 2007 NaN 2011
3 15 2008 2011 2011
3 51 2009 Nan 2011
4 10 2007 NaN 2011
4 12 2008 Nan 2011
4 24 2009 2011 2011
5 30 2007 2011 2011
5 35 2008 Nan 2011
5 122 2009 Nan 2011
Could you help me please?
df.groupby('id')['audit'].transform(lambda s: s[s.first_valid_index()] if s.first_valid_index() else np.nan)
output:
>>> df
0 2011.0
1 2011.0
2 2011.0
3 NaN
4 NaN
5 NaN
6 2011.0
7 2011.0
8 2011.0
9 2011.0
10 2011.0
11 2011.0
12 2011.0
13 2011.0
14 2011.0
Name: audit, dtype: float64
I am attempting to transpose and merge two pandas dataframes, one containing accounts, the segment which they received their deposit, their deposit information, and what day they received the deposit; the other has the accounts, and withdrawal information. The issue is, for indexing purposes, the segment information from one dataframe should line up with the information of the other, regardless of there being a withdrawal or not.
Notes:
There will always be an account for every person
There will not always be a withdrawal for every person
The accounts and data for the withdrawal dataframe only exist if a withdrawal occurs
Account Dataframe Code
accounts = DataFrame({'person':[1,1,1,1,1,2,2,2,2,2],
'segment':[1,2,3,4,5,1,2,3,4,5],
'date_received':[10,20,30,40,50,11,21,31,41,51],
'amount_received':[1,2,3,4,5,6,7,8,9,10]})
accounts = accounts.pivot_table(index=["person"], columns=["segment"])
Account Dataframe
amount_received date_received
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
Withdrawal Dataframe Code
withdrawals = DataFrame({'person':[1,1,1,2,2],
'withdrawal_segment':[1,1,5,2,3],
'withdraw_date':[1,2,3,4,5],
'withdraw_amount':[10,20,30,40,50]})
withdrawals = withdrawals.reset_index().pivot_table(index = ['index', 'person'], columns = ['withdrawal_segment'])
Since there can only be unique segments for a person it is required that my column only consists of a unique number once, while still holding all of the data, which is why this dataframe looks so much different.
Withdrawal Dataframe
withdraw_date withdraw_amount
withdrawal_segment 1 2 3 5 1 2 3 5
index person
0 1 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2.0 NaN NaN NaN 20.0 NaN NaN NaN
2 1 NaN NaN NaN 3.0 NaN NaN NaN 30.0
3 2 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
4 2 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
Merge
merge = accounts.merge(withdrawals, on='person', how='left')
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 5 1 2 3 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN 20.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN 3.0 NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
The problem with the merged dataframe is that segments from the withdrawal dataframe aren't lined up with the accounts segments.
The desired dataframe should look something like:
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN NaN 10.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN NaN 20.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN NaN 3.0 NaN NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN NaN 40.0 NaN NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN NaN 50.0 NaN NaN
My problem is that I can't seem to merge across both person and segments. I've thought about inserting a row and column, but because I don't know which segments are and aren't going to have a withdrawal this gets difficult. Is it possible to merge the dataframes so that they line up across both people and segments? Thanks!
Method 1 , using reindex
withdrawals=withdrawals.reindex(pd.MultiIndex.from_product([withdrawals.columns.levels[0],accounts.columns.levels[1]]),axis=1)
merge = accounts.merge(withdrawals, on='person', how='left')
merge
Out[79]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN
Method 2 , using unstack and stack
merge = accounts.merge(withdrawals, on='person', how='left')
merge.stack(dropna=False).unstack()
Out[82]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN
I have a dataframe like this:
index = [0,1,2,3,4,5]
s = pd.Series([1,1,1,2,2,2],index= index)
t = pd.Series([2007,2008,2011,2006,2007,2009],index= index)
f = pd.Series([2,4,6,8,10,12],index= index)
pp =pd.DataFrame(np.c_[s,t,f],columns = ["group","year","amount"])
pp
group year amount
0 1 2007 2
1 1 2008 4
2 1 2011 6
3 2 2006 8
4 2 2007 10
5 2 2009 12
I want to add lines in between missing years for each group. My desire dataframe is like this:
group year amount
0 1.0 2007 2.0
1 1.0 2008 4.0
2 1.0 2009 NaN
3 1.0 2010 NaN
4 1.0 2011 6
5 1.0 2006 8.0
6 2.0 2007 10.0
7 2.0 2008 NaN
8 2.0 2009 12.0
Is there any way to do it for a large dataframe?
First change year to datetime:
df.year = pd.to_datetime(df.year, format='%Y')
set_index with resample
df.set_index('year').groupby('group').amount.resample('Y').mean().reset_index()
group year amount
0 1 2007-12-31 2.0
1 1 2008-12-31 4.0
2 1 2009-12-31 NaN
3 1 2010-12-31 NaN
4 1 2011-12-31 6.0
5 2 2006-12-31 8.0
6 2 2007-12-31 10.0
7 2 2008-12-31 NaN
8 2 2009-12-31 12.0