pandas groupby using dictionary values, applying sum - python

I have a defaultdict:
dd = defaultdict(list,
{'Tech': ['AAPL','GOOGL'],
'Disc': ['AMZN', 'NKE'] }
and a dataframe that looks like this:
AAPL AMZN GOOGL NKE
1/1/10 100 200 500 200
1/2/10 100 200 500 200
1/310 100 200 500 200
and the output I'd like is to SUM the dataframe based on the values of the dictionary, with the keys as the columns:
TECH DISC
1/1/10 600 400
1/2/10 600 400
1/3/10 600 400
The pandas groupby documentation says it does this if you pass a dictionary but all I end up with is an empty df using this code:
df.groupby(by=dd).sum() ##returns empty df

Create the dict in the right way , you can using by with axis=1
# map each company to industry
dd_rev = {w: k for k, v in dd.items() for w in v}
# {'AAPL': 'Tech', 'GOOGL': 'Tech', 'AMZN': 'Disc', 'NKE': 'Disc'}
# group along columns
df.groupby(by=dd_rev,axis=1).sum()
Out[160]:
Disc Tech
1/1/10 400 600
1/2/10 400 600
1/310 400 600

you can create a new dataframe using the defaultdict and dictionary comprehension in 1 line
pd.DataFrame({x: df[dd[x]].sum(axis=1) for x in dd})
# output:
Disc Tech
1/1/10 400 600
1/2/10 400 600
1/310 400 600

Related

How do I convert multiple columns to individual rows/values if there is no other unique column in pandas?

Following is the example
Rates, values 2019Q01
Rates, values 2019Q02
Rates, values 2019Q03
Rates, values 2019Q04
Sales, values 2019Q01
Sales, values 2019Q02
Sales, values 2019Q03
Sales, values 2019Q04
100
150
200
300
400
450
500
600
Resultant should be
Period
Rates, values
Sales, values
2019Q01
100
400
2019Q02
150
450
2019Q03
200
500
2019Q04
300
600
I've tried melt and wide_to_long, but unable to get the result. Thanks
try via columns attribute and then stack():
df.columns=df.columns.str.replace('values','').str.split(', ',expand=True)
df=df.stack().droplevel(0).rename_axis(index='Period').add_suffix(', values').reset_index()
OR as suggested by #Cytorak
df.columns = df.columns.str.rsplit(' ', 1, expand=True)
df=df.stack().droplevel(0).rename_axis(index='Period').reset_index()
output of df:
Period Rates, values Sales, values
0 2019Q01 100 400
1 2019Q02 150 450
2 2019Q03 200 500
3 2019Q04 300 600

how to get values in pandas dataframe which is equivalen with LIKE Operator in SQL?

can i get specific column value in dataframe like the SQL like operator that can find any values then count the value to store it in the new column. here is the code for my dataframe
import pandas as pd
dataku = pd.DataFrame()
dataku['CIF'] = ['789', '290', '789', '789','290']
dataku['NAMA'] = ['de','ra','de','de','ra']
dataku['SALDO'] = [100,500,800,200,500]
dataku ['PRODUK']=['tabungan','deposito','deposito','tabungan','deposito usd']
dataku.groupby(['CIF','NAMA','PRODUK']).agg({'SALDO':'sum', 'PRODUK':'count'}).rename(columns={'SALDO':'TOTAL SALDO','PRODUK':'TOTAL PRODUK'})
the result i want for the new dataframe is like this
CIF NAMA PRODUK TOTAL_SALDO TOTAL_PRODUK GT_SALDO GT_PRODUK
290 ra deposito 500 1 1000 2
deposito usd 500 1
789 de tabungan 300 2 300 2
deposito 800 1 800 1
how i can get the value of GT_SALDO column and GT_PRODUK like the table above as the final result?
I am not enturely sure this is what you want, but you can groupby on parts of the strings stored in a column. For example, this is your original groupby, stored in df1
df1 = dataku.groupby(['CIF','NAMA','PRODUK']).agg({'SALDO':'sum', 'PRODUK':'count'}).rename(columns={'SALDO':'TOTAL SALDO','PRODUK':'TOTAL PRODUK'})
This is a groupby that only uses first 8 characters of the 'PRODUK' column:
df2 = dataku.groupby(['CIF','NAMA',dataku['PRODUK'].str.slice(stop=8)]).agg({'SALDO':'sum', 'PRODUK':'count'}).rename(columns={'SALDO':'GT_SALDO','PRODUK':'GT_PRODUK'})
df2 looks like this
GT_SALDO GT_PRODUK
CIF NAMA PRODUK
290 ra deposito 1000 2
789 de deposito 800 1
tabungan 300 2
You can join the two to get something that looks like your desired output:
df1.join(df2)
produces
TOTAL SALDO TOTAL PRODUK GT_SALDO GT_PRODUK
CIF NAMA PRODUK
290 ra deposito 500 1 1000.0 2.0
deposito usd 500 1 NaN NaN
789 de deposito 800 1 800.0 1.0
tabungan 300 2 300.0 2.0
you can fillna NaNs if they bother you

Pandas Relative Time Pivot

I have the last eight months of my customers' data, however these months are not the same months, just the last months they happened to be with us. Monthly fees and penalties are stored in rows, but I want each of the last eight months to be a column.
What I have:
Customer Amount Penalties Month
123 500 200 1/7/2017
123 400 100 1/6/2017
...
213 300 150 1/4/2015
213 200 400 1/3/2015
What I want:
Customer Month-8-Amount Month-7-Amount ... Month-1-Amount Month-1-Penalties ...
123 500 400 450 300
213 900 250 300 200
...
What I've tried:
df = df.pivot(index=num, columns=[amount,penalties])
I got this error:
ValueError: all arrays must be same length
Is there some ideal way to do this?
You can do it with unstack and set_index
# assuming all date is sort properly , then we do cumcount
df['Month']=df.groupby('Customer').cumcount()+1
# slice the most recent 8 one
df=df.loc[df.Month<=8,:]# slice the most recent 8 one
# doing unstack to reshape your df
s=df.set_index(['Customer','Month']).unstack().sort_index(level=1,axis=1)
# flatten multiple index to one
s.columns=s.columns.map('{0[0]}-{0[1]}'.format)
s.add_prefix("Month-")
Out[189]:
Month-Amount-1 Month-Penalties-1 Month-Amount-2 Month-Penalties-2
Customer
123 500 200 400 100
213 300 150 200 400

Python Groupby and Plot

With the following groupby how can I ultimately group the data so that I can plot the price (x-axis) and size (y-axis) while iterating through every symbol and exchange? Thanks.
df_group = df.groupby(['symbol','exchange','price'])["size"].sum()
symbol exchange price
AAPL ARCA 154.630 800
154.640 641
154.650 100
154.660 300
154.670 400
154.675 100
154.680 300
154.690 1390
154.695 100
154.700 360
154.705 100
154.710 671
154.720 190
154.725 100
154.730 400
...
XOM PSX 80.67 1300
80.68 2721
80.69 1901
80.7 700
80.71 800
80.72 200
80.73 700
80.74 500
80.75 600
80.76 300
80.77 900
80.78 100
80.79 1000
80.8 1000
symbol exch price sizesizesizesizesizesizesizesizesizesizesizesi...
you can use aggregate functions
fun={'symbol':{'size':'count'}
df_group = df.groupby(['symbol','exchange','price']).agg(fun).reset_index()
df_group.columns=df_group.columns.droplevel(1)
df_group

Pandas Column mathematical operations No error no answer

I am trying to perform some simple mathematical operations on the files.
The columns in below file_1.csv are dynamic in nature the number of columns will increased from time to time. So we cannot have fixed last_column
master_ids.csv : Before any pre-processing
Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345
master_count.csv : Before any processing
Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300
master_Ids.csv : after one pre-processing
Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500
master_count.csv: expected Output (Append/merge)
Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750
Eg: Ids: 1234 appears 2 times so the value of ids:1234 at current time (00:30:00) is 500 which is to be divided by count of ids occurrence and then add to the corresponding values from ref1 and create a new column with the current time.
master_Ids.csv : After another pre-processing
Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600
master_count.csv: expected output after another execution (Merge/append)
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600
So here current time is 00:45:00, and we divide the current time value by the count of ids occurrences, and then add to the corresponding ref1 values by creating an new column with new current time.
Program: By Jianxun Li
import pandas as pd
import numpy as np
csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
# do the division by number of occurence of each Ids
# and add column any time series
def my_func(group):
num_obs = len(group)
# process with column name after next timeseries (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
The program executes with no errors and no output. Need some fixing suggestions please.
This program assumes updating of both master_counts.csv and master_ids.csv over time and should be robust to the timing of the updates. That is, it should produce correct results if run multiple times on the same update or if an update is missed.
# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]
# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')
for i in range( 2, len(master_ids.columns) ):
master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
count = master_counts.groupby('Ids')['ref1'].transform('count')
master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count
master_counts.to_csv('master_counts.csv',index=False)
%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0
import pandas as pd
import numpy as np
csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/lat_lon_master.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
Out[53]:
00:00:00 00:30:00 00:45:00
Ids
1234 1000 500 100
8435 5243 300 200
2341 563 400 400
7352 345 500 600
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
Out[81]:
Name lat lon 00:00:00
Ids
1234 London 40.4 10.1 500
1234 Prague 40.4 10.1 500
2341 NewYork 60.6 30.3 700
2341 Austria 60.6 30.3 700
7352 Japan 70.7 80.8 500
7352 China 70.7 80.8 500
8435 Paris 50.5 20.2 400
8435 Berlin 50.5 20.2 400
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
Out[55]:
Name lat lon 00:00:00 00:30:00 00:45:00
Ids
1234 London 40.4 10.1 500 500 100
1234 Prague 40.4 10.1 500 500 100
2341 NewYork 60.6 30.3 700 400 400
2341 Austria 60.6 30.3 700 400 400
7352 Japan 70.7 80.8 500 500 600
7352 China 70.7 80.8 500 500 600
8435 Paris 50.5 20.2 400 300 200
8435 Berlin 50.5 20.2 400 300 200
# do the division by number of occurence of each Ids
# and add column 00:00:00
def my_func(group):
num_obs = len(group)
# process with column name after 00:30:00 (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
Out[104]:
Name lat lon 00:00:00 00:30:00 00:45:00
Ids
1234 London 40.4 10.1 500 750 550
1234 Prague 40.4 10.1 500 750 550
2341 NewYork 60.6 30.3 700 900 900
2341 Austria 60.6 30.3 700 900 900
7352 Japan 70.7 80.8 500 750 800
7352 China 70.7 80.8 500 750 800
8435 Paris 50.5 20.2 400 550 500
8435 Berlin 50.5 20.2 400 550 500
My suggestion is to reformat your data so that it's like this:
Ids,ref0,current_time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
Then after your "first preprocess" it will become like this:
Ids,ref0,time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
1234,1000,00:30:00,500
8435,5243,00:30:00,300
2341,563,00:30:00,400
7352,345,00:30:00,500
. . . and so on. The idea is that you should make a single column to hold the time information, and then for each preprocess, insert the new data into new rows, and give those rows a value in the time column indicating what time period they come from. You may or may not want to keep the initial rows with "None" in this table; maybe you just want to start with the "00:30:00" values and keep the "master ids" in a separate file.
I haven't totally followed exactly how you're computing the new ref1 values, but the point is that doing this is likely to greatly simplify your life. In general, instead of adding an unbounded number of new columns, it can be much nicer to add a single new column whose values will then be the values you were going to use as headers for the open-ended new columns.

Categories

Resources