Want to join last row of two dataframe on condition - python

quantity:
a b c
3 1 nan
3 2 8
7 5 9
4 8 nan
price
34
I have two dataframes quantity and price and I want to join last row of quantity dataframe to price where c is not nan
I wrote these query but didn't got the desired output:
price = pd.concat(price,quantity["a","b","c"].tail(1).isnotnull())
what I want is like:
price a b c
34 7 5 9

If your dfs are these:
df = pd.DataFrame([[3,1,np.nan], [3,2,8], [7,5,9], [4,8,np.nan]], columns=['a','b','c'])
df2 = pd.DataFrame([34], columns=['price'])
You can do in this way:
final_df = pd.concat([df.dropna(subset=['c']).tail(1).reset_index(drop=True), df2], axis=1)
Output:
a b c price
0 7 5 9.0 34

I believe you need remove missing values and for last row - added double [] for one row DataFrame:
df=pd.concat([price.reset_index(drop=True),
quantity[["a","b","c"]].dropna(subset=['c']).iloc[[-1]].reset_index(drop=True)],
axis=1)
print (df)
price a b c
0 34 7 5 9.0
Detail:
print (quantity[["a","b","c"]].dropna().iloc[[-1]])
a b c
2 7 5 9.0

I would filter the df on not null then simply add the price to it:
new_df = df[df['c'].notnull()]
Where c is your column name.
new_df['price'] = 32 # or the price from your df

Related

pandas - groupby a column and get the max length of another string column with nulls

I have a pandas DataFrame like this:
source text_column
0 a abcdefghi
1 a abcde
2 b qwertyiop
3 c plmnkoijb
4 a NaN
5 c abcde
6 b qwertyiop
7 b qazxswedcdcvfr
and I would like to get the length of text_column after grouping source column, like below:
source something
a 9
b 14
c 9
Here's what I have tried till now and all of them generate error:
>>> # first creating the group by object
>>> text_group = mydf.groupby(by=['source'])
>>> # now try to get the max length of "text_column" by each "source"
>>> text_group['text_column'].map(len).max()
>>> text_group['text_column'].len().max()
>>> text_group['text_column'].str.len().max()
How do I get the max length of text_column with another column grouped by.
and to avoid creating new question, how do I also get the 2nd biggest length and the respective values(1st and 2nd largest sentences in text_column).
First idea is use lambda function with Series.str.len and max:
df = (df.groupby('source')['text_column']
.agg(lambda x: x.str.len().max())
.reset_index(name='something'))
print (df)
source something
0 a 9.0
1 b 14.0
2 c 9.0
Or you can first use Series.str.len and then aggregate max:
df = (df['text_column'].str.len()
.groupby(df['source'])
.max()
.reset_index(name='something'))
print (df)
Also if need integers first use DataFrame.dropna:
df = (df.dropna(subset=['text_column'])
.assign(text_column=lambda x: x['text_column'].str.len())
.groupby('source', as_index=False)['text_column']
.max())
print (df)
source text_column
0 a 9
1 b 14
2 c 9
EDIT: for first and second top values use DataFrame.sort_values with GroupBy.head:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.sort_values(['source','something'], ascending=[True, False])
.groupby('source', as_index=False)
.head(2))
print (df1)
source text_column something
0 a abcdefghi 9
1 a abcde 5
7 b qazxswedcdcvfr 14
2 b qwertyiop 9
3 c plmnkoijb 9
5 c abcde 5
Alternative solution with SeriesGroupBy.nlargest, obviously slowier:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.groupby('source')['something']
.nlargest(2)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
source something
0 a 9
1 a 5
2 b 14
3 b 9
4 c 9
5 c 5
Last solution for new columns by top1, top2:
df=df.dropna(subset=['text_column']).assign(something=lambda x: x['text_column'].str.len())
df = df.sort_values(['source','something'], ascending=[True, False])
df['g'] = df.groupby('source').cumcount().add(1)
df = (df[df['g'].le(2)].pivot('source','g','something')
.add_prefix('top')
.rename_axis(index=None, columns=None))
print (df)
top1 top2
a 9 5
b 14 9
c 9 5
Just get the lengths first with assign and str.len:
df.assign(text_column=df['text_column'].str.len()).groupby('source', as_index=False).max()
source text_column
0 a 9.0
1 b 14.0
2 c 9.0
>>>
The easiest solution to me looks sth like this (tested) - you do not actually need a groupby:
df['str_len'] = df.text_column.str.len()
df.sort_values(['str_len'], ascending=False)\
.drop_duplicates(['source'])\
.drop(columns='text_column')
source str_len
7 b 14.0
0 a 9.0
3 c 9.0
regarding your 2nd question, I think a groupby serves you well:
top_x = 2
df.groupby('source', as_index=False)\
.apply(lambda sourcedf: sourcedf.sort_values('str_len').nlargest(top_x, columns='str_len', keep='all'))\
.drop(columns='text_column')

Python: Calculate mathematical values in new row in dataframe based on few specific previous rows

I have the below pandas dataframe:
Input:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
My Expected Output is:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
Total Exp 10 12
The last tow is basically total of row 1 and row 3. This is a very simplified example, i actually have to perform complex calculation on a huge dataframe.
Is there a way in python to perform such calculation?
You can select rows by positions with DataFrame.iloc and sum, then assign to new row:
df.loc[len(df.index)] = df.iloc[0] + df.iloc[2]
Or:
df.loc[len(df.index)] = df.iloc[[0,2]].sum()
print (df)
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 8 10 12
EDIT: First idea is create index by A column, so you can use loc with new value of A, but last step is convert index to column by reset_index:
df = df.set_index('A')
df.loc['Total Exp'] = df.iloc[[0,2]].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Similar is possible selecting by loc by labels - here Expense and Travel:
df = df.set_index('A')
df.loc['Total Exp'] = df.loc[['Expense', 'Travel']].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or is possible filter out first column with 1: and add value back by Series.reindex:
df.loc[len(df.index)] = df.iloc[[0,2], 1:].sum().reindex(df.columns, fill_value='Total Exp')
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or you can set value of A separately:
s = df.iloc[[0,2]].sum()
s.loc['A'] = 'Total Exp'
df.loc[len(df.index)] = s
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12

Sum duplicated rows on a multi-index pandas dataframe

Hello I'm having troubles dealing with Pandas. I'm trying to sum duplicated rows on a multiindex Dataframe.
I tryed with df.groupby(level=[0,1]).sum() , also with df.stack().reset_index().groupby(['year', 'product']).sum() and some others, but I cannot get it to work.
I'd also like to add every unique product for each given year and give them a 0 value if they weren't listed.
Example: dataframe with multi-index and 3 different products (A,B,C):
volume1 volume2
year product
2010 A 10 12
A 7 3
B 7 7
2011 A 10 10
B 7 6
C 5 5
Expected output : if there are duplicated products for a given year then we sum them.
If one of the products isnt listed for a year, we create a new row full of 0.
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Any idea ? Thanks
You can make the second level of the index a CategoricalIndex and when you use groupby it will include all of the categories.
df.index.set_levels(pd.CategoricalIndex(df.index.levels[1]), 1, inplace=True)
df.groupby(level=[0, 1]).sum().fillna(0, downcast='infer')
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Use sum with unstack and stack:
df = df.sum(level=[0,1]).unstack(fill_value=0).stack()
#same as
#df = df.groupby(level=[0,1]).sum().unstack(fill_value=0).stack()
Alternative with reindex:
df = df.sum(level=[0,1])
#same as
#df = df.groupby(level=[0,1]).sum()
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0)
Alternative1, thanks #Wen:
df = df.sum(level=[0,1]).unstack().stack(dropna=False)
print (df)
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5

Is there no syntax suger for dynamic creating columns with multiindexed pandas dataframe?

First, I show the pandas dataframe to elucidate my problem.
import pandas as pd
mi = pd.MultiIndex.from_product([["A","B"],["c","d"]], names=['lv1', 'lv2'])
df1 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=mi)
this python code creates dataframe(df1) like this:
#input dataframe
lv1 A B
lv2 c d c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
I want to create columns 'c*d' on lv2 by using df1's data. like this:
#output dataframe after calculation
lv1 A B
lv2 c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
For this problem,I wrote some code like this:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
df1.sort_index(1,inplace=True)
Although this code almost solved my problem, but I really want to write without 'for' statement like this:
df1.loc[:,(slice(None),"c*d")]=df1.loc[:,(slice(None),"c")]*df1.loc[:,(slice(None),"d")]
With this statement,I got Key error that says 'c*d' is missing.
Is there no syntax sugar for this calculation? Or can I achieve better performance by other code?
A bit improved your solution:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c*d']])
df1 = df1.reindex(columns=mux)
print (df1)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
Another solution with stack and unstack:
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c_d']])
df1 = df1.stack(0)
.assign(c_d = lambda x: x.sum(1))
.unstack()
.swaplevel(0,1,1)
.reindex(columns=mux)
print (df1)
A B
c d c_d c d c_d
0 1 2 3 3 4 7
1 5 6 11 7 8 15
2 9 10 19 11 12 23
df2 = df1.xs("c", axis=1, level=1).mul(df1.xs("d", axis=1, level=1))
df2.columns = pd.MultiIndex.from_product([df2.columns, ['c*d']])
print (df2)
A B
c*d c*d
0 2 12
1 30 56
2 90 132
mux = pd.MultiIndex.from_product([df2.columns.levels[0], ['c','d','c*d']])
df = df1.join(df2).reindex(columns=mux)
print (df)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
Explanation of jezrael's answer using stack which is may be the most idiomatic way in pandas.
output = (df1
# "Stack" data, by moving the top level ('lv1') of the
# column MultiIndex into row index,
# now the rows are a MultiIndex and the columns
# are a regular Index.
.stack(0)
# Since we only have 2 columns now, 'lv2' ('c' & 'd')
# we can multiply them together along the row axis.
# The assign method takes key=value pairs mapping new column
# names to the function used to calculate them. Here we're
# wrapping them in a dictionary and unpacking them using **
.assign(**{'c*d': lambda x: x.product(axis=1)})
# Undos the stack operation, moving 'lv1', back to the
# column index, but now as the bottom level of the column index
.unstack()
# This sets the order of the column index MultiIndex levels.
# Since they are named we can use the names, you can also use
# their integer positions instead. Here axis=1 references
# the column index
.swaplevel('lv1', 'lv2', axis=1)
# Sort the values in both levels of the column MultiIndex.
# This will order them as c, c*d, d which is not what you
# specified above, however having a sorted MultiIndex is required
# for indexing via .loc[:, (...)] to work properly
.sort_index(axis=1)
)

merge two dataframes without repeats pandas

I am trying to merge two dataframes, one with columns: customerId, full name, and emails and the other dataframe with columns: customerId, amount, and date. I want to have the first dataframe be the main dataframe and the other dataframe information be included but only if the customerIds match up; I tried doing:
merge = pd.merge(df, df2, on='customerId', how='left')
but the dataframe that is produced contains a lot of repeats and looks wrong:
customerId full name emails amount date
0 002963338 Star shine star.shine#cdw.com $2,910.94 2016-06-14
1 002963338 Star shine star.shine#cdw.com $9,067.70 2016-05-27
2 002963338 Star shine star.shine#cdw.com $6,507.24 2016-04-12
3 002963338 Star shine star.shine#cdw.com $1,457.99 2016-02-24
4 986423367 palm tree tree.palm#snapchat.com,tree#.com $4,604.83 2016-07-16
this cant be right, please help!
There is problem you have duplicates in customerId column.
So solution is remove them, e.g. by drop_duplicates:
df2 = df2.drop_duplicates('customerId')
Sample:
df = pd.DataFrame({'customerId':[1,2,1,1,2], 'full name':list('abcde')})
print (df)
customerId full name
0 1 a
1 2 b
2 1 c
3 1 d
4 2 e
df2 = pd.DataFrame({'customerId':[1,2,1,2,1,1], 'full name':list('ABCDEF')})
print (df2)
customerId full name
0 1 A
1 2 B
2 1 C
3 2 D
4 1 E
5 1 F
merge = pd.merge(df, df2, on='customerId', how='left')
print (merge)
customerId full name_x full name_y
0 1 a A
1 1 a C
2 1 a E
3 1 a F
4 2 b B
5 2 b D
6 1 c A
7 1 c C
8 1 c E
9 1 c F
10 1 d A
11 1 d C
12 1 d E
13 1 d F
14 2 e B
15 2 e D
df2 = df2.drop_duplicates('customerId')
merge = pd.merge(df, df2, on='customerId', how='left')
print (merge)
customerId full name_x full name_y
0 1 a A
1 2 b B
2 1 c A
3 1 d A
4 2 e B
I do not see repeats as a whole row but there are repetetions in customerId. You could remove them using:
df.drop_duplicates('customerId', inplace = 1)
where df could be the dataframe corresponding to amount or one obtained post merge. In case you want fewer rows (say n), you could use:
df.groupby('customerId).head(n)

Categories

Resources