Transforming Dataframe Columns in Python

Transforming Dataframe Columns in Python - python

If I have a pandas Dataframe like such
and I want to transform it in a way that it results in
Is there a way to achieve this on the most correct way? a good pattern

Use a pivot table:
pd.pivot_table(df,index='name',columns=['property'],aggfunc=sum).fillna(0)
Output:
price
Property boat dog house
name
Bob 0 5 4
Josh 0 2 0
Sam 3 0 0
Sidenote: Pasting in your df's helps so people can use pd.read_clipboard instead of generating the df themselves.

Related

How can I merge aggregate two dataframes in Pandas while subtracting column values?

I'm working on a rudimentary inventory system and am having trouble finding a solution to this obstacle. I've got two Pandas dataframes, both sharing two columns: PLU and QTY. PLU acts as an item identifier, and QTY is the quantity of the item in one dataframe, while being the quantity sold in another. Here are two very simple examples of what the data looks like:
final_purch:
PLU QTY
12345678 12
90123456 7
78901234 2
pmix_diff:
PLU QTY
12345678 9
90123456 3
78901234 1
In this case, I'd want to find any matching PLUs and subtract the pmix_df QTY from the final_purch QTY.
In an earlier part of the project, I used aggregate functions to get rid of duplicates while summing the QTY column. It worked great, but I can't find a way to do something similar here with subtraction. I'm fairly new to Python/Pandas, so any help is greatly appreciated. :)

here is one way to do that
Using assign and merge
df.assign(QTY = df['QTY'] - df.merge(df2, on='PLU', suffixes=('','_y'), how='left')['QTY_y'].fillna(0))
PLU QTY
0 12345678 3
1 90123456 4
2 78901234 1

You may do:
df = final_purch.set_index('PLU').join(pmix_df.set_index('PLU'), lsuffix='final', rsuffix='pmix')
df['QTYdiff'] = df['QTYfinal']-df['QTYpmix']
output:
QTYfinal QTYpmix QTYdiff
PLU
12345678 12 9 3
90123456 7 3 4
78901234 2 1 1

Canonical way for Pandas set value based on reference table

I have two dataframes, a reference table and a main table. I want to map the values in the reference table to the main table, overwriting if necessary. In visual form:
import pandas as pd
ref_data = {'Fruit':['Apple','Pear','Orange'],
'Price':[50,60,70]}
reference_table = pd.DataFrame(ref_data)
main_data = {'col1':[1,2,3,4,5],
'col2':[5,5,5,5,5],
'Fruit':['Durian','Pineapple','Apple','Orange','Pear'],
'Price':[40,120,454,12,43]}
main_data = pd.DataFrame(main_data)
This seems like quite a common use case.I found the following question that seems to exactly fit, but it seems a bit "hacky" in a sense. Just wondering if theres a proper way to do this?
Pandas -- set row values based on values in another table
Thanks!

We usually use np.where
s=reference_table.set_index('Fruit').Price.reindex(main_data.Fruit).values
main_data['Price']=np.where(np.isnan(s),main_data['Price'],s)

You can also merge and assign then drop the unused columns
main_data = main_data.merge(reference_table, on='Fruit', how='left').assign(Price=lambda x: x['Price_y'].fillna(x['Price_x'])).drop(['Price_x', 'Price_y'], axis=1)
Result
Fruit col1 col2 Price
0 Durian 1 5 40.0
1 Pineapple 2 5 120.0
2 Apple 3 5 50.0
3 Orange 4 5 70.0
4 Pear 5 5 60.0

Dataframe specific transposition optimisation

I would like to transpose a Pandas Dataframe from row to columns, where number of rows is dynamic. Then, transposed Dataframe must have dynamic number of columns also.
I succeeded using iterrows() and concat() methods, but I would like to optimize my code.
Please find my current code:
import pandas as pd
expected_results_transposed = pd.DataFrame()
for i, r in expected_results.iterrows():
t = pd.Series([r.get('B')], name=r.get('A'))
expected_results_transposed = pd.concat([expected_results_transposed, t], axis=1)
print("CURRENT CASE EXPECTED RESULTS TRANSPOSED:\n{0}\n".format(expected_results_transposed))
Please find an illustration of expected result :
picture of expected result
Do you have any solution to optimize my code using "standards" Pandas dataframes methods/options ?
Thank you for your help :)

Use DataFrame.transpose + DataFrame.set_index:
new_df=df.set_index('A').T.reset_index(drop=True)
new_df.columns.name=None
Example
df2=pd.DataFrame({'A':'Mike Ana Jon Jenny'.split(),'B':[1,2,3,4]})
print(df2)
A B
0 Mike 1
1 Ana 2
2 Jon 3
3 Jenny 4
new_df=df2.set_index('A').T.reset_index(drop=True)
new_df.columns.name=None
print(new_df)
Mike Ana Jon Jenny
0 1 2 3 4

How to concatenate partially sequential occurring rows in data frame using pandas

I have a csv as follows. which is broken into multiple rows.
like as follows
Names,text,conv_id
tim,hi,1234
jon,hello,1234
jon,how,1234
jon,are you,1234
tim,hey,1234
tim,i am good,1234
pam, me too,1234
jon,great,1234
jon,hows life,1234
So i want to concatenate the sequentially occuring elements into one row
as follows and make it more meaningful
Expected output:
Names,text,conv_id
tim,hi,1234
jon,hello how are you,1234
tim,hey i am good,1234
pam, me too,1234
jon,great hows life,1234
I tried a couple of things but I failed and couldn't do can anyone please guide me how to do this?
Thanks in advance.

You can use Series.shift + Series.cumsum
to be able to create the appropriate groups through groupby and then use join applied to each group using groupby.apply.'conv_id', an 'Names' are added to the groups so that they can be retrieved using Series.reset_index. Finally, DataFrame.reindex is used to place the columns in the initial order
groups=df['Names'].rename('groups').ne(df['Names'].shift()).cumsum()
new_df=( df.groupby([groups,'conv_id','Names'])['text']
.apply(lambda x: ','.join(x))
.reset_index(level=['Names','conv_id'])
.reindex(columns=df.columns) )
print(new_df)
Names text conv_id
1 tim hi 1234
2 jon hello,how,are you 1234
3 tim hey,i am good 1234
4 pam me too 1234
5 jon great,hows life 1234
Detail:
print(groups)
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
8 5
dtype: int64

Using Panda's groupby just to drop repeated items

I'm sure this is a basic question, but I am unable to find the correct path here.
Let's suppose a dataframe like this, telling how many fruits each person eats per week:
Name Fruit Amount
1 Jack Lemon 3
2 Mary Banana 6
3 Sophie Lemon 1
4 Sophie Cherry 10
5 Daniel Banana 2
6 Daniel Cherry 4
Let's suppose now that I just want to create a bar plot with matplotlib, to show the total amount of each fruit eaten per week in the whole town. To do that, I must groupby the fruits
In his book, pandas author describes groupby as the first part of a split-apply-combine operation:
So, first of all groupby transforms the DataFrame into a DataFrameGroupBy object. Then, ussing a method such as sum, the result is combined into a new DataFrame object. Perfect, I can create my fruit plot now.
But the problem I'm facing is what happens when I do not want to sum, diff or apply any operation at all to each group members. What happens when I just want to use groupby to keep a DataFrame with only one row per fruit type (of course, for an example as simple as this one, I could just get a list of fruits with unique, but that's not the point).
If I do that, the return of groupby is a DataFrameGroupBy object, and many operations which work with DataFrame do not with DataFrameGroupBy.
This problem, which I'm sure its pretty simple to avoid, is giving me a lot of headaches. How can I get a DataFrame from groupby without having to apply any aggregation function? Is there a different workaround without even using groupby which I'm missing due to being lost in translation?

If you just want some row, you can use a combination of groupby-first() + reset_index - it will retain the first row per group:
import pandas as pd
df = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 3]})
>>> df.groupby(df.a).first().reset_index()
a b
0 1 1
1 2 3

This bit make me think this could be the answer you are looking for:
Is there a different workaround without even using groupby
If you just want to drop duplicated rows based on Fruit, .drop_duplicates is the way to go.
df.drop_duplicates(subset='Fruit')
Name Fruit Amount
1 Jack Lemon 3
2 Mary Banana 6
4 Sophie Cherry 10
You have limited control about which rows are preserved, see the docstring.
This is faster and more readable than groupby + first.

IIUC you could use pivot_table which will return DataFrame:
In [140]: df.pivot_table(index='Fruit')
Out[140]:
Amount
Fruit
Banana 4
Cherry 7
Lemon 2
In [141]: type(df.pivot_table(index='Fruit'))
Out[141]: pandas.core.frame.DataFrame
If you want to keep first element you could define your function and pass it to aggfunc argument:
In [144]: df.pivot_table(index='Fruit', aggfunc=lambda x: x.iloc[0])
Out[144]:
Amount Name
Fruit
Banana 6 Mary
Cherry 10 Sophie
Lemon 3 Jack
If you don't want your Fruit to be an index you could also use reset_index:
In [147]: df.pivot_table(index='Fruit', aggfunc=lambda x: x.iloc[0]).reset_index()
Out[147]:
Fruit Amount Name
0 Banana 6 Mary
1 Cherry 10 Sophie
2 Lemon 3 Jack

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.