I am seeking clarity as to why my code cannot access specific column values using dummie values using the following example data:
df
shop category subcategory season
date
2013-09-04 abc weddings shoes winter
2013-09-04 def jewelry watches summer
2013-09-05 ghi sports sneakers spring
2013-09-05 jkl jewelry necklaces fall
Here is my basic code:
wedding_df = df[["weddings","winter","summer","spring","fall"]]
I'm using Python 2 with my notebook, so it very well may be a version issue and require get_dummies(), but some guidance would be helpful. Idea is to create a dummy dataframe that uses binary to say if a row had a wedding category and what season.
Here is an example of what I'm looking to achieve:
weddings winter summer spring fall
71654 1.0 0.0 1.0 0.0 0.0
72168 1.0 0.0 1.0 0.0 0.0
72080 1.0 0.0 1.0 0.0 0.0
with corr():
weddings fall spring summer winter
weddings NaN NaN NaN NaN NaN
fall NaN 1.000000 0.054019 -0.331866 -0.012122
spring NaN 0.054019 1.000000 -0.857205 0.072420
summer NaN -0.331866 -0.857205 1.000000 -0.484578
winter NaN -0.012122 0.072420 -0.484578 1.000000
You can try using prefix and prefix_sep assign them to blank , then you are able to df[["weddings","winter","summer","spring","fall"]]
df = pd.get_dummies(df,prefix = '', prefix_sep = '' )
df
abc def ghi jkl jewelry sports weddings necklaces shoes \
date
2013-09-04 1 0 0 0 0 0 1 0 1
2013-09-04 0 1 0 0 1 0 0 0 0
2013-09-05 0 0 1 0 0 1 0 0 0
2013-09-05 0 0 0 1 1 0 0 1 0
sneakers watches fall spring summer winter
date
2013-09-04 0 0 0 0 0 1
2013-09-04 0 1 0 0 1 0
2013-09-05 1 0 0 1 0 0
2013-09-05 0 0 1 0 0 0
Update
pd.get_dummies(df.loc[df['category']=='weddings',['category','season']],prefix = '', prefix_sep = '' )
Out[820]:
weddings winter
date
2013-09-04 1 1
Related
The objective is to multiply some constant value to a column in Pandas. Each column has its own constant value.
For example, the columns 'a_b_c','dd_ee','ff_ff','abc','devb' are multiply with constant 15,20,15,15,20, respectively.
The constants values and its associated column is store in a dict const_val
const_val=dict(a_b_c=15,
dd_ee=20,
ff_ff=15,
abc=15,
devb=20,)
Currently, I am using a for-loop to multiply each column to its associate constant value which is shown in code below
for dpair in const_val:
df[('per_a',dpair)]=df[dpair]*const_val[dpair]/reval
However, I wonder whether there is more elagent ways of doing this.
The full code is provided below
import pandas as pd
import numpy as np
np.random.seed(0)
const_val=dict(a_b_c=15,
dd_ee=20,
ff_ff=15,
abc=15,
devb=20,)
df = pd.DataFrame(data=np.random.randint(5, size=(3, 6)),
columns=['id','a_b_c','dd_ee','ff_ff','abc','devb'])
reval=6
for dpair in const_val:
df[('per_a',dpair)]=df[dpair]*const_val[dpair]/reval
The expected output is as below
id a_b_c dd_ee ... (per_a, ff_ff) (per_a, abc) (per_a, devb)
0 4 0 3 ... 7.5 7.5 3.333333
1 3 2 4 ... 0.0 0.0 13.333333
2 2 1 0 ... 2.5 2.5 0.000000
Please note that the
(per_a, ff_ff) (per_a, abc) (per_a, devb)
are multiindex column. The representative might be different in your compiler
p.s., I am using IntelliJ IDEA
If you only have numbers in your DataFrame:
out = df.mul(pd.Series(const_val).reindex(df.columns, fill_value=1), axis=1)
If you have a mix non numeric and non-numeric:
out = df.select_dtypes('number').mul(pd.Series(const_val), axis=1).combine_first(df)
update:
out = df.join(df[list(const_val)].mul(pd.Series(const_val), axis=1)
.div(reval).add_prefix('per_a_'))
Output
id a_b_c dd_ee ff_ff abc devb per_a_a_b_c per_a_dd_ee per_a_ff_ff per_a_abc per_a_devb
0 1 4 3 0 3 0 10.0 10.000000 0.0 7.5 0.0
1 2 3 0 1 3 3 7.5 0.000000 2.5 7.5 10.0
2 3 0 1 1 1 0 0.0 3.333333 2.5 2.5 0.0
Update for multiindex/tuple column headers:
cols = pd.Index(const_val.keys())
mi = pd.MultiIndex.from_product([['per_a'], cols])
df[mi] = df[cols] * pd.Series(const_val) / reval
print(df)
Output:
id a_b_c dd_ee ff_ff abc devb (per_a, a_b_c) (per_a, dd_ee) (per_a, ff_ff) (per_a, abc) (per_a, devb)
0 4 0 3 3 3 1 0.0 10.000000 7.5 7.5 3.333333
1 3 2 4 0 0 4 5.0 13.333333 0.0 0.0 13.333333
2 2 1 0 1 1 0 2.5 0.000000 2.5 2.5 0.000000
Try this using pandas intrinsic data alignment tenants to align data using indexing:
cols = pd.Index(const_val.keys())
df[cols + '_per_a'] = df[cols] * pd.Series(const_val) / reval
Output:
id a_b_c dd_ee ff_ff abc devb a_b_c_per_a dd_ee_per_a ff_ff_per_a abc_per_a devb_per_a
0 4 0 3 3 3 1 0.0 10.000000 7.5 7.5 3.333333
1 3 2 4 0 0 4 5.0 13.333333 0.0 0.0 13.333333
2 2 1 0 1 1 0 2.5 0.000000 2.5 2.5 0.000000
df
id a_b_c dd_ee ff_ff abc devb
0 4 0 3 3 3 1
1 3 2 4 0 0 4
2 2 1 0 1 1 0
make const_val to series
s = pd.Series(const_val)
s
a_b_c 15
dd_ee 20
ff_ff 15
abc 15
devb 20
dtype: int64
use broadcasting
out = df[['id']].join(df[df.columns[1:]].mul(s))
out
id a_b_c dd_ee ff_ff abc devb
0 4 0 60 45 45 20
1 3 30 80 0 0 80
2 2 15 0 15 15 0
I will need to pivot a column in pandas, would greatly appreciate any help.
Input:
ID
Status
Date
1
Online
2022-06-31
1
Offline
2022-07-28
2
Online
2022-08-01
3
Online
2022-07-03
3
None
2022-07-05
4
Offline
2022-05-02
5
Online
2022-04-04
5
Online
2022-04-06
Output: Pivot on Status
ID
Date
Online
Offline
None
1
2022-06-31
1
0
0
1
2022-07-28
0
1
0
2
2022-08-01
1
0
0
3
2022-07-03
1
0
0
3
2022-07-05
1
0
0
4
2022-05-02
0
0
1
5
2022-04-04
1
0
0
5
2022-04-06
1
0
0
Or even better output if I am able to merge the counts for example:
Output: Pivot on Status & merge
ID
Online
Offline
None
1
1
1
0
2
1
0
0
3
2
0
0
4
0
0
1
5
2
0
0
The main issue here is that I won't know the status values i.e. Offline, Online, None.
I believe doing it in pandas might be easier due to the dynamic nature of not knowing column values for the column I want to pivot on.
df.assign(seq=1).pivot_table(index='ID', columns='Status', values='seq', aggfunc='sum').fillna(0)
Status None Offline Online
ID
1 0.0 1.0 1.0
2 0.0 0.0 1.0
3 1.0 0.0 1.0
4 0.0 1.0 0.0
5 0.0 0.0 2.0
I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0
I am dealing with a multi-index dataframe because I've used pd.pivot_table. There are two levels in my column header.
I am currently processing it and would like to sum two columns together.
I would like to make my code cleaner by processing the df in one chain using .pipe()
What I have come up with is this:
reg_cat =
1 or 0 total_orders year
0 1 2000 2011
1 0 5500 2012
2 1 6000 2013
3 0 1000 2014
4 0 3000 2015
pivot = (
reg_cat
.pivot_table(values=['total_orders'],index=['year'],columns=['1 or 0'], aggfunc=np.sum)
.reset_index()
.fillna(0)
.pipe(lambda x: x.assign(total_orders_total = x['total_orders',0] + x['total_orders',1]))
)
The output is presented as such:
year total_orders total_orders total_orders_total
1 or 0 0 1
0 2011 0.0 2000.0 2000.0
1 2012 5500.0 0.0 5500.0
2 2013 0.0 6000.0 6000.0
3 2014 1000.0 0.0 1000.0
4 2015 3000.0 0.0 3000.0
How can I insert a 2nd level column name for the column 'total_orders_total' with this method? So that it will look something like this:
year total_orders total_orders total_orders_total
1 or 0 0 1 total
0 2011 0.0 2000.0 2000.0
1 2012 5500.0 0.0 5500.0
2 2013 0.0 6000.0 6000.0
3 2014 1000.0 0.0 1000.0
4 2015 3000.0 0.0 3000.0
Simplified situation:
I've got a file with list of some countries and I load it to dataframe df.
Then I've got data concerning those countries (and many more) in many .xls files. I try to read each of those files to df_f, subset the data I'm interested in and then find countries from the original file and if any of them is present, copy the data to dataframe df.
The problem is that only some of the values are assigned correctly. Most of them are inserted as NaNs. (see below)
for filename in os.listdir(os.getcwd()):
df_f = pd.read_excel(filename, sheetname = 'Data', parse_cols = "D,F,H,J:BS", skiprows = 2, skip_footer = 2)
df_f = df_f.fillna(0)
df_ss = [SUBSETTING df_f here]
countries = df_ss['Country']
for c in countries:
if (c in df['Country'].values):
row_idx = df[df['Country'] == c].index
df_h = df_ss[quarters][df_ss.Country == c]
df.loc[row_idx, quarters] = df_h
The result I get is:
Country Q1 2000 Q2 2000 Q3 2000 Q4 2000 Q1 2001 Q2 2001 Q3 2001 \
0 Albania NaN NaN NaN NaN NaN NaN NaN
1 Algeria NaN NaN NaN NaN NaN NaN NaN
2 Argentina NaN NaN NaN NaN NaN NaN NaN
3 Armenia NaN NaN NaN NaN NaN NaN NaN
4 Australia NaN NaN NaN NaN NaN NaN NaN
5 Austria 4547431 5155839 5558963 6079089 6326217 6483130 6547780
6 Azerbaijan NaN NaN NaN NaN NaN NaN NaN
etc...
The loading and subsetting is done correctly, data is not corrupted - I print df_h for each iteration and it shows regular numbers. The point is that after assigning them to df dataframe they become NaNs...
Any idea?
EDIT: sample data
df:
Country Country group Population Development coefficient Q1 2000 \
0 Albania group II 2981000 -1 0
1 Algeria group I 39106000 -1 0
2 Argentina group III 42669000 -1 0
3 Armenia group II 3013000 -1 0
4 Australia group IV 23520000 -1 0
5 Austria group IV 8531000 -1 0
6 Azerbaijan group II 9538000 -1 0
7 Bangladesh group I 158513000 -1 0
8 Belarus group III 9470000 -1 0
9 Belgium group III 11200000 -1 0
(...)
Q2 2013 Q3 2013 Q4 2013 Q1 2014 Q2 2014 Q3 2014 Q4 2014 Q1 2015
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
and df_ss of one of files:
Country Q1 2000 Q2 2000 Q3 2000 Q4 2000 Q1 2001 \
5 Guam 11257 17155 23063 29150 37098
10 Kiribati 323 342 361 380 398
15 Marshall Islands 425 428 433 440 449
17 Micronesia 0 0 0 0 0
19 Nauru 0 0 0 0 0
22 Northern Mariana Islands 2560 3386 4499 6000 8037
27 Palau 1513 1672 1828 1980 2130
(...)
Q3 2013 Q4 2013 Q1 2014 Q2 2014 Q3 2014 Q4 2014 Q1 2015
5 150028 151152 152244 153283 154310 155333 156341
10 19933 20315 20678 21010 21329 21637 21932
15 17536 19160 20827 22508 24253 26057 27904
17 18646 17939 17513 17232 17150 17233 17438
19 7894 8061 8227 8388 8550 8712 8874
22 27915 28198 28481 28753 29028 29304 29578
27 17602 17858 18105 18337 18564 18785 19001
Try setting the values like the following (see this post):
df.ix[quaters,...] = 10
By #joris:
Can you try
df.loc[row_idx, quarters] = df_h.values
for the last line (note the extra .values at the end)?
This one worked, thanks :-)