Python Pandas Group by - python

I've the below code
import pandas as pd
Orders = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Orders')
Returns = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Returns')
Sum_value = pd.DataFrame(Orders['Sales']).sum
Orders_Year = pd.DatetimeIndex(Orders['Order Date']).year
Orders.merge(Returns, how="inner", on="Order ID")
which gives the output as below
My Requirement is i would like to use groupby and would like to see the output as below
Can some one please help me how to use groupby in my above code, it means i would like to see everything in the single line by using groupby
Regards,
Bharath

You can do by selecting column then define to a new dataframe
grouped = pd.DataFrame()
groupby = ['Year','Segment','Sales']
for i in groupby:
grouped[i] = Orders[i]

Related

Pandas rank values within groupby, starting a new rank if diff is greater than 1

I have a sample dataframe as follows:
data={'Store':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'Week':[1,2,3,4,5,6,19,20,21,22,1,2,50,51,52,60,61,62,70,71,72,73]}
df=pd.DataFrame.from_dict(data)
df['WeekDiff'] = df.groupby('Store')['Week'].diff().fillna(1)
I added a difference column to find the gaps in the Week column within my data.
I have been trying to groupby Store and somehow use the differences column to achieve the below output but with no success. I need the ranks to start from each occurence of a value greater than one until the next such value. Please see a sample output I'd like to achieve.
result_data={'Store':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'Week':[1,2,3,4,5,6,19,20,21,22,1,2,50,51,52,60,61,62,70,71,72,73],
'Rank':[1,1,1,1,1,1,2,2,2,2,1,1,2,2,2,3,3,3,4,4,4,4]}
I am new to python and pandas and I've been trying to google this all day, but couldn't find a solution. Could you please help me how to do this?
Thank you in advance!
You could try as follows:
import pandas as pd
data={'Store':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'Week':[1,2,3,4,5,6,19,20,21,22,1,2,50,51,52,60,61,62,70,71,72,73]}
df = pd.DataFrame(data)
df['Rank'] = df.groupby('Store')['Week'].diff()>1
df['Rank'] = df.groupby('Store')['Rank'].cumsum().add(1)
# check with expected output:
result_data={'Store':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'Week':[1,2,3,4,5,6,19,20,21,22,1,2,50,51,52,60,61,62,70,71,72,73],
'Rank':[1,1,1,1,1,1,2,2,2,2,1,1,2,2,2,3,3,3,4,4,4,4]}
result_df = pd.DataFrame(result_data)
df.equals(result_df)
# True
Or as a (lengthy) one-liner:
df['Rank'] = df.set_index('Store').groupby(level=0)\
.agg(Rank=('Week','diff')).gt(1).groupby(level=0)\
.cumsum().add(1).reset_index(drop=True)

Fuzzy matching two dataframes and joining on result

Im trying to join two dataframes on string columns which are not identical. I realise this has been asked a lot but I am struggling to find anything relevant to my need. The code I have is as follows
import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
xls = pd.ExcelFile(filename)
df_1 = pd.read_excel(xls, sheet_name="Sheet 1")
df_2 = pd.read_excel(xls, sheet_name="Sheet 2")
df_2['key'] = df_2['Name'].apply(lambda x : [process.extract(x, df_1['Name'], limit=1)][0][0][0])
The idea would then be to joing the two datframes based on df_2['key'], However when I run this code it runs but does not return anything. The df sizes are as follows: df_1 (3366, 8) and df_2(1771, 6)
Is there a better way to do this?
This code returns nothing, because it is exactly what it should do.
df_2['key'] = ... just appends 'key' columns to df_2 dataframe.
If you want to merge dataframes, your code should look similar to this:
name_list_1 = df_1['Name'].tolist()
name_list_2 = df_2['Name'].tolist()
matches = list(map(lambda x: process.extractOne(
x, name_list_1, scorer=fuzz.token_set_ratio)[:2], name_list_2))
df_keys = pd.DataFrame(matches, columns=['key', 'score'])
df_2 = pd.merge(df_2, df_keys, left_index=True, right_index=True)
df_2 = df_2[df_2['score'] > 70]
df_3 = pd.merge(df_1, df_2, left_on='Name', right_on='key', how='outer')
print(df_3)
I use extractOne method, which I guess better suits your situation. It is important to play with scorer parameter as it heavily affects matching result.
you can better use process.extractOne() instead. you code will look like
name_list_1=df_1['Name'].tolist()
name_list_2=df_2['Name'].tolist()
key = map(lambda x : process.extractOne(x,name_list_1)[0],name_list_2)
df_1['key']=key
then you can make the join on the key column.

how to append a dataframe without overwriting existing dataframe using for loop in python

i have an empty dataframe[] and want to append additional dataframes using for loop without overwriting existing dataframes, regular append method is overwriting the existing dataframe and showing only the last appended dataframe in output.
use concat() from the pandas module.
import pandas as pd
df_new = pd.concat([df_empty, df_additional])
read more about it in the pandas Docs.
regarding the question in the comment...
df = pd.DataFrame(insert columns which your to-be-appended-df has too)
for i in range(10):
function_to_get_df_new()
df = pd.concat([df, df_new])
Let you have list of dataframes list_of_df = [df1, df2, df3].
You have empty dataframe df = pd.Dataframe()
If you want to append all dataframes in list into that empty dataframe df:
for i in list_of_df:
df = df.append(i)
Above loop will not change df1, df2, df3. But df will change.
Note that doing df.append(df1) will not change df, unless you assign it back to df so that df = df.append(df1)
You can't also use set:
df_new = pd.concat({df_empty, df_additional})
Because pandas.DataFrame objects can't be hashed, set needs hashed so that's why
Or tuple:
df_new = pd.concat((df_empty, df_additional))
They are little quicker...
Update for for loop:
df = pd.DataFrame(data)
for i in range(your number):
df_new=function_to_get_df_new()
df = pd.concat({df, df_new}) # or tuple: df = pd.concat((df, df_new))
The question is already well answered, my 5cts are the suggestion to use ignore_index=True option to get a continuous new index, not duplicate the older ones.
import pandas as pd
df_to_append = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) # sample
df = pd.DataFrame() # this is a placeholder for the destination
for i in range(3):
df = df.append(df_to_append, ignore_index=True)
I don't think you need to use for loop here, try concat()
import pandas
result = pandas.concat([emptydf,additionaldf])
pandas.concat documentation

Extracting specific columns from pandas.dataframe

I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out

How to filter pandas dataframe columns by partial label

I am trying to filter pandas dataframe columns (with type pandas.core.index.Index) by a partial label.
I am searching for a builtin method that achieve the same result as:
partial_label = 'partial_lab'
columns = df.columns
columns = [c for c in columns if c.startswith(partial_label)]
df = df[columns]
Is there anything builtin to obtain this?
Thanks
possible solutions:
df.filter(regex='partial_lab.*')
or
idx = df.columns.to_series().str.startswith('partial_lab')
df.loc[:,idx]

Categories

Resources