Pandas: sum of every N columns - python

I have dataframe
ID 2016-01 2016-02 ... 2017-01 2017-02 ... 2017-10 2017-11 2017-12
111 12 34 0 12 3 0 0
222 0 32 5 5 0 0 0
I need to count every 12 columns and get
ID 2016 2017
111 46 15
222 32 10
I try to use
(df.groupby((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
But it returns to all columns
But when I try to use
df.groupby['ID']((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
It returns
TypeError: 'method' object is not subscriptable
How can I fix that?

First set_index of all columns without dates:
df = df.set_index('ID')
1. groupby by splited columns and selected first:
df = df.groupby(df.columns.str.split('-').str[0], axis=1).sum()
2. lambda function for split:
df = df.groupby(lambda x: x.split('-')[0], axis=1).sum()
3. converted columns to datetimes and groupby years:
df.columns = pd.to_datetime(df.columns)
df = df.groupby(df.columns.year, axis=1).sum()
4. resample by years:
df.columns = pd.to_datetime(df.columns)
df = df.resample('A', axis=1).sum()
df.columns = df.columns.year
print (df)
2016 2017
ID
111 46 15
222 32 10

The above code has a slight syntax error and throws the following error:
ValueError: No axis named 1 for object type
Basically, the groupby condition needs to be wrapped by []. So I'm rewriting the code correctly for convenience:
new_df = df.groupby([[i//n for i in range(0,m)]], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.

If you don't mind losing the labels, you can try this:
new_df = df.groupby([i//n for i in range(0,m)], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.

Related

Pandas: Search and match based on two conditions

I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))

DataFrame insert row

I have some troubles with my Python work,
my steps are:
1)add the list to ordinary Dataframe
2)delete the columns which is min in the list
my list is called 'each_c' and my ordinary Dataframe is called 'df_col'
I want it to become like this:
hope someone can help me, thanks!
This is clearly described in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df_col.drop(columns=[3])
Convert each_c to Series, append by DataFrame.append and then get indices by minimal value by Series.idxmin and pass to drop - it remove only first minimal column:
s = pd.Series(each_c)
df = df_col.append(s, ignore_index=True).drop(s.idxmin(), axis=1)
If need remove all columns if multiple minimals:
each_c = [-0.025,0.008,-0.308,-0.308]
s = pd.Series(each_c)
df_col = pd.DataFrame(np.random.random((10,4)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
0 1
0 0.602312 0.641220
1 0.586233 0.634599
2 0.294047 0.339367
3 0.246470 0.546825
4 0.093003 0.375238
5 0.765421 0.605539
6 0.962440 0.990816
7 0.810420 0.943681
8 0.307483 0.170656
9 0.851870 0.460508
10 -0.025000 0.008000
EDIT: If solution raise error:
IndexError: Boolean index has wrong length:
it means there is no default columns name by range - 0,1,2,3. Possible solution is set index values in Series by rename:
each_c = [-0.025,0.008,-0.308,-0.308]
df_col = pd.DataFrame(np.random.random((10,4)), columns=list('abcd'))
s = pd.Series(each_c).rename(dict(enumerate(df.columns)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
a b
0 0.321498 0.327755
1 0.514713 0.575802
2 0.866681 0.301447
3 0.068989 0.140084
4 0.069780 0.979451
5 0.629282 0.606209
6 0.032888 0.204491
7 0.248555 0.338516
8 0.270608 0.731319
9 0.732802 0.911920
10 -0.025000 0.008000

I want to replace every value in the age column with its middle value

I have a column that looks like this:
Age
[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)
and want to remove the "[","-" and ")". Instead of showing the range such as 0-10, I would like to show the middle value instead for every row in the column
Yet another solution:
The dataframe:
df = pd.DataFrame({'Age':['[0-10)','[10-20)','[20-30)','[30-40)','[40-50)','[50-60)','[60-70)','[70-80)']})
df
Age
0 [0-10)
1 [10-20)
2 [20-30)
3 [30-40)
4 [40-50)
5 [50-60)
6 [60-70)
7 [70-80)
The code:
df['Age'] = df.Age.str.extract('(\d+)-(\d+)').astype('int').mean(axis=1).astype('int')
The result:
df
Age
0 5
1 15
2 25
3 35
4 45
5 55
6 65
7 75
If you want to explode a row into multiple rows where each row carries a value from the range, you can do this:
data = '''[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)'''
df = pd.DataFrame({'Age': data.splitlines()})
df['Age'] = df['Age'].str.extract(r'\[(\d+)-(\d+)\)').astype(int).apply(lambda r: list(range(r[0], r[1])), axis=1)
df.explode('Age')
Note that I assume your Age column is string typed, so I used extract to get the boundaries of the ranges, and convert them to a real list of integers. Finally explode your dataframe for the modified Age column will get you a new row for each integer in the list. Values in other columns will be copied accordingly.
I tried this:
import pandas as pd
import re
data = {
'age_range': [
'[0-10)',
'[10-20)',
'[20-30)',
'[30-40)',
'[40-50)',
'[50-60)',
'[60-70)',
'[70-80)',
]
}
df = pd.DataFrame(data=data)
def get_middle_age(age_range):
pattern = r'(\d+)'
ages = re.findall(pattern, age_range)
return int((int(ages[0])+int(ages[1]))/2)
df['age'] = df.apply(lambda row: get_middle_age(row['age_range']), axis=1)

Convert duplicate row to independant columns

I have a dataframe that looks like the following:
ID,CUSTOMER_ID,ACC_NUMBER,TRANSACTION_ID,PACK_DESC,PACK_VALIDITY,PACK_NUMBER
1,ABCVRXJ,1027,1248,PackA,30,PACKA-XXXX
2,ABCVRXJ,1029,1249,PackC,32,PACKC-XXXX
3,XUVZ200,1028,12491,PackB,31,PACKB-XXXX
4,XUVZ200,1030,12421,PackD,33,PACKD-XXXX
I want the final dataframe to look something like:
ID,CUSTOMER_ID,ACC_NUMBER,TRANSACTION_ID,PACK_DESC,PACK_VALIDITY,PACK_NUMBER_1,PACK_NUMBER_2
1,ABCVRXJ,1027,1248,PackA,30,PACKA-XXXX,PACKC-XXXX
3,XUVZ200,1028,12491,PackB,31,PACKB-XXXX,PACKD-XXXX
Each CUSTOMER_ID who has opted for 2 packs should be converted into a single row, with both the PACK_NUMBERs being 2 new columns.
I tried:
df['index'] = df.groupby('CUSTOMER_ID').cumcount()
df_vchrNumber = df.pivot(index='CUSTOMER_ID', columns='index', values='PACK_NUMBER').rename(columns=lambda x: 'PACK_NUMBER_'+str(x + 1))
df_vchrNumber = df_vchrNumber.fillna('').reset_index()
but this returns,
CUSTOMER_ID,PACK_NUMBER_1,PACK_NUMBER_2
0123456789,PACKA-XXXX,PACKC-XXXX
9876543210,PACKB-XXXX,PACKD-XXXX
**but this is not the expected output as i'm not sure how to include the other columns **
Would somebody mind helping me out a bit?
If need only first and last value of PACK_NUMBER use DataFrame.drop_duplicates for first values per groups and last value of PACK_NUMBER per groups:
s = (df.drop_duplicates('CUSTOMER_ID', keep='last')
.set_index('CUSTOMER_ID')['PACK_NUMBER']
.rename('PACK_NUMBER_2'))
df = (df.drop_duplicates('CUSTOMER_ID')
.rename(columns={'PACK_NUMBER':'PACK_NUMBER_1'})
.join(s, on='CUSTOMER_ID'))
print (df)
ID CUSTOMER_ID ACC_NUMBER TRANSACTION_ID PACK_DESC PACK_VALIDITY \
0 1 ABCVRXJ 1027 1248 PackA 30
2 3 XUVZ200 1028 12491 PackB 31
PACK_NUMBER_1 PACK_NUMBER_2
0 PACKA-XXXX PACKC-XXXX
2 PACKB-XXXX PACKD-XXXX
Your solution should be changed with remove duplicates and join Series:
df['index']= df.groupby('CUSTOMER_ID').cumcount()
df_vchrNumber = (df.pivot(index='CUSTOMER_ID', columns='index', values='PACK_NUMBER')
.rename(columns=lambda x: 'PACK_NUMBER_'+str(x + 1)))
df=df.drop_duplicates('CUSTOMER_ID').drop('PACK_NUMBER',1).join(df_vchrNumber,on='CUSTOMER_ID')
And if need processes all columns:
df['index']= df.groupby('CUSTOMER_ID').cumcount() + 1
df = df.set_index(['CUSTOMER_ID', 'index']).unstack()
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
CUSTOMER_ID ID_1 ID_2 ACC_NUMBER_1 ACC_NUMBER_2 TRANSACTION_ID_1 \
0 ABCVRXJ 1 2 1027 1029 1248
1 XUVZ200 3 4 1028 1030 12491
TRANSACTION_ID_2 PACK_DESC_1 PACK_DESC_2 PACK_VALIDITY_1 PACK_VALIDITY_2 \
0 1249 PackA PackC 30 32
1 12421 PackB PackD 31 33
PACK_NUMBER_1 PACK_NUMBER_2
0 PACKA-XXXX PACKC-XXXX
1 PACKB-XXXX PACKD-XXXX
Use groupby with agg to select the first row of the group. Then groupby again and get the last row, finally merge the two dataframes together to get your wanted output:
a = df.groupby('CUSTOMER_ID', as_index=False).agg('first')
b = df.groupby('CUSTOMER_ID', as_index=False).agg({'PACK_NUMBER':'last'})
df_final = a.merge(b, on='CUSTOMER_ID', suffixes=['_1', '_2'])
CUSTOMER_ID ID ACC_NUMBER TRANSACTION_ID PACK_DESC PACK_VALIDITY PACK_NUMBER_1 PACK_NUMBER_2
0 ABCVRXJ 1 1027 1248 PackA 30 PACKA-XXXX PACKC-XXXX
1 XUVZ200 3 1028 12491 PackB 31 PACKB-XXXX PACKD-XXXX

Flatten / Remove hierarchical column headers

I have the following dataframe which is the result of doing a groupby + aggregate sum:
df.groupby(['id', 'category']).agg([pd.Series.sum])
supply stock
sum sum
id category
4 abc 161 -0.094
6 sde -76 0.150
23 hgv 64 -0.054
1 wcd -14 0.073
76 jhf -8 -0.057
Because of the groupby and agg, the column headings are now tuples. How do I change the column headings back into single values, ie: the column headings need to be supply and stock. I just need to get rid of sum from the headings
If you use sum the "agg function name" won't be created as part of the columns (as a MultiIndex):
df.groupby(['id', 'category']).sum()
To remove them, you can drop the level:
df.columns = df.columns.droplevel(1)
For example:
In [11]: df
Out[11]:
supply stock
sum sum
0 0.501176 0.482497
1 0.442689 0.008664
2 0.885112 0.512066
3 0.724619 0.418720
In [12]: df.columns.droplevel(1)
Out[12]: Index(['supply', 'stock'], dtype='object')
In [13]: df.columns = df.columns.droplevel(1)
In [14]: df
Out[14]:
supply stock
0 0.501176 0.482497
1 0.442689 0.008664
2 0.885112 0.512066
3 0.724619 0.418720
You can explicitly set the columns attribute to whatever you'd like it to be. For example:
>>> df = pd.DataFrame(np.random.random((4, 2)),
... columns=pd.MultiIndex.from_arrays([['supply', 'stock'],
['sum', 'sum']]))
>>> df
supply stock
sum sum
0 0.170950 0.314759
1 0.632121 0.147884
2 0.955682 0.127857
3 0.776764 0.318614
>>> df.columns = df.columns.get_level_values(0)
>>> df
supply stock
0 0.170950 0.314759
1 0.632121 0.147884
2 0.955682 0.127857
3 0.776764 0.318614

Categories

Resources