Simplified situation:
I've got a file with list of some countries and I load it to dataframe df.
Then I've got data concerning those countries (and many more) in many .xls files. I try to read each of those files to df_f, subset the data I'm interested in and then find countries from the original file and if any of them is present, copy the data to dataframe df.
The problem is that only some of the values are assigned correctly. Most of them are inserted as NaNs. (see below)
for filename in os.listdir(os.getcwd()):
df_f = pd.read_excel(filename, sheetname = 'Data', parse_cols = "D,F,H,J:BS", skiprows = 2, skip_footer = 2)
df_f = df_f.fillna(0)
df_ss = [SUBSETTING df_f here]
countries = df_ss['Country']
for c in countries:
if (c in df['Country'].values):
row_idx = df[df['Country'] == c].index
df_h = df_ss[quarters][df_ss.Country == c]
df.loc[row_idx, quarters] = df_h
The result I get is:
Country Q1 2000 Q2 2000 Q3 2000 Q4 2000 Q1 2001 Q2 2001 Q3 2001 \
0 Albania NaN NaN NaN NaN NaN NaN NaN
1 Algeria NaN NaN NaN NaN NaN NaN NaN
2 Argentina NaN NaN NaN NaN NaN NaN NaN
3 Armenia NaN NaN NaN NaN NaN NaN NaN
4 Australia NaN NaN NaN NaN NaN NaN NaN
5 Austria 4547431 5155839 5558963 6079089 6326217 6483130 6547780
6 Azerbaijan NaN NaN NaN NaN NaN NaN NaN
etc...
The loading and subsetting is done correctly, data is not corrupted - I print df_h for each iteration and it shows regular numbers. The point is that after assigning them to df dataframe they become NaNs...
Any idea?
EDIT: sample data
df:
Country Country group Population Development coefficient Q1 2000 \
0 Albania group II 2981000 -1 0
1 Algeria group I 39106000 -1 0
2 Argentina group III 42669000 -1 0
3 Armenia group II 3013000 -1 0
4 Australia group IV 23520000 -1 0
5 Austria group IV 8531000 -1 0
6 Azerbaijan group II 9538000 -1 0
7 Bangladesh group I 158513000 -1 0
8 Belarus group III 9470000 -1 0
9 Belgium group III 11200000 -1 0
(...)
Q2 2013 Q3 2013 Q4 2013 Q1 2014 Q2 2014 Q3 2014 Q4 2014 Q1 2015
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
and df_ss of one of files:
Country Q1 2000 Q2 2000 Q3 2000 Q4 2000 Q1 2001 \
5 Guam 11257 17155 23063 29150 37098
10 Kiribati 323 342 361 380 398
15 Marshall Islands 425 428 433 440 449
17 Micronesia 0 0 0 0 0
19 Nauru 0 0 0 0 0
22 Northern Mariana Islands 2560 3386 4499 6000 8037
27 Palau 1513 1672 1828 1980 2130
(...)
Q3 2013 Q4 2013 Q1 2014 Q2 2014 Q3 2014 Q4 2014 Q1 2015
5 150028 151152 152244 153283 154310 155333 156341
10 19933 20315 20678 21010 21329 21637 21932
15 17536 19160 20827 22508 24253 26057 27904
17 18646 17939 17513 17232 17150 17233 17438
19 7894 8061 8227 8388 8550 8712 8874
22 27915 28198 28481 28753 29028 29304 29578
27 17602 17858 18105 18337 18564 18785 19001
Try setting the values like the following (see this post):
df.ix[quaters,...] = 10
By #joris:
Can you try
df.loc[row_idx, quarters] = df_h.values
for the last line (note the extra .values at the end)?
This one worked, thanks :-)
Related
This is what my dataframe looks like:
Year
State
Var1
Var2
2018
1
1
3
2018
1
2
Nan
2018
1
NaN
1
2018
2
NaN
1
2018
2
NaN
2
2018
3
3
NaN
2019
1
1
NaN
2019
1
3
NaN
2019
1
2
NaN
2019
1
NaN
NaN
2019
2
NaN
NaN
2019
2
3
NaN
2020
1
1
NaN
2020
2
NaN
1
2020
2
NaN
3
2020
3
3
NaN
2020
4
NaN
NaN
2020
4
1
NaN
Desired Output
Year 2018 2019 2020
Var1 Num of States w/ non-null 2 2 3
Var2 Num of States w/ non-null 2 0 1
I want to count the number of unique values of the variable State that have at least one non null response for each variable.
IIUC you are looking for:
out = pd.concat([
df.dropna(subset='Var1').pivot_table(columns='Year',
values='State',
aggfunc='nunique'),
df.dropna(subset='Var2').pivot_table(columns='Year',
values='State',
aggfunc='nunique')
]).fillna(0).astype(int)
out.index = ['Var1 Num of States w/non-null', 'Var2 Num of states w/non-null']
print(out):
Year 2018 2019 2020
Var1 Num of States w/non-null 2 2 3
Var2 Num of states w/non-null 2 0 1
I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0
I have the following dataframe:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016
2 20 2014 2014
1 30 2017 2016
1 40 2016 2016
4 300 2015 2000
5 150 2005 2002
What I'm looking for is the Amount Paid should appear in the withNYears column if the payment was made within n years of start date otherwise you get NaN.
N years can be any number but let's say 2 for this example (as I will be playing with this to see findings).
so basically the above dataframe would come out like this if the amount was paid within 2 years:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016 100
2 20 2014 2014 20
1 30 2017 2016 30
1 40 2016 2016 40
4 300 2015 2000 NaN
5 150 2005 2002 NaN
does anyone know how to achieve this? cheers.
Subtract columns and compare by scalar for boolean mask and then set value by numpy.where, Series.where or DataFrame.loc:
m = (df['PaymentReceivedDate'] - df['StartDate']) < 2
df['withinNYears'] = np.where(m, df['AmountPaid'], np.nan)
#alternatives
#df['withinNYears'] = df['AmountPaid'].where(m)
#df.loc[m, 'withinNYears'] = df['AmountPaid']
print (df)
PersonID AmountPaid PaymentReceivedDate StartDate \
0 1 100 2017 2016
1 2 20 2014 2014
2 1 30 2017 2016
3 1 40 2016 2016
4 4 300 2015 2000
5 5 150 2005 2002
withinNYears
0 100.0
1 20.0
2 30.0
3 40.0
4 NaN
5 NaN
EDIT:
If StartDate column have datetimes:
m = (df['PaymentReceivedDate'] - df['StartDate'].dt. year) < 2
Just do with assign using loc
df.loc[(df['PaymentReceivedDate'] - df['StartDate']<2),'withinNYears']=df.AmountPaid
df
Out[37]:
PersonID AmountPaid ... StartDate withinNYears
0 1 100 ... 2016 100.0
1 2 20 ... 2014 20.0
2 1 30 ... 2016 30.0
3 1 40 ... 2016 40.0
4 4 300 ... 2000 NaN
5 5 150 ... 2002 NaN
[6 rows x 5 columns]
I am seeking clarity as to why my code cannot access specific column values using dummie values using the following example data:
df
shop category subcategory season
date
2013-09-04 abc weddings shoes winter
2013-09-04 def jewelry watches summer
2013-09-05 ghi sports sneakers spring
2013-09-05 jkl jewelry necklaces fall
Here is my basic code:
wedding_df = df[["weddings","winter","summer","spring","fall"]]
I'm using Python 2 with my notebook, so it very well may be a version issue and require get_dummies(), but some guidance would be helpful. Idea is to create a dummy dataframe that uses binary to say if a row had a wedding category and what season.
Here is an example of what I'm looking to achieve:
weddings winter summer spring fall
71654 1.0 0.0 1.0 0.0 0.0
72168 1.0 0.0 1.0 0.0 0.0
72080 1.0 0.0 1.0 0.0 0.0
with corr():
weddings fall spring summer winter
weddings NaN NaN NaN NaN NaN
fall NaN 1.000000 0.054019 -0.331866 -0.012122
spring NaN 0.054019 1.000000 -0.857205 0.072420
summer NaN -0.331866 -0.857205 1.000000 -0.484578
winter NaN -0.012122 0.072420 -0.484578 1.000000
You can try using prefix and prefix_sep assign them to blank , then you are able to df[["weddings","winter","summer","spring","fall"]]
df = pd.get_dummies(df,prefix = '', prefix_sep = '' )
df
abc def ghi jkl jewelry sports weddings necklaces shoes \
date
2013-09-04 1 0 0 0 0 0 1 0 1
2013-09-04 0 1 0 0 1 0 0 0 0
2013-09-05 0 0 1 0 0 1 0 0 0
2013-09-05 0 0 0 1 1 0 0 1 0
sneakers watches fall spring summer winter
date
2013-09-04 0 0 0 0 0 1
2013-09-04 0 1 0 0 1 0
2013-09-05 1 0 0 1 0 0
2013-09-05 0 0 1 0 0 0
Update
pd.get_dummies(df.loc[df['category']=='weddings',['category','season']],prefix = '', prefix_sep = '' )
Out[820]:
weddings winter
date
2013-09-04 1 1
Now, my dataset looks like this:
tconst Actor1 Actor2 Actor3 Actor4 Actor5 Actor6 Actor7 Actor8 Actor9 Actor10
0 tt0000001 NaN GreaterEuropean, WestEuropean, French GreaterEuropean, British NaN NaN NaN NaN NaN NaN NaN
1 tt0000002 NaN GreaterEuropean, WestEuropean, French NaN NaN NaN NaN NaN NaN NaN NaN
2 tt0000003 NaN GreaterEuropean, WestEuropean, French GreaterEuropean, WestEuropean, French GreaterEuropean, WestEuropean, French NaN NaN NaN NaN NaN NaN
3 tt0000004 NaN GreaterEuropean, WestEuropean, French NaN NaN NaN NaN NaN NaN NaN NaN
4 tt0000005 NaN GreaterEuropean, British GreaterEuropean, British NaN NaN NaN NaN NaN NaN NaN
I used replace and map function to get here.
I want to create a dataframe from the above data frames such as I can get resulting dataframe as below.
tconst GreaterEuropean WestEuropean French GreaterEuropean British Arab British ............
tt0000001 2 1 0 4 1 0 2 .....
tt0000002 0 2 4 0 1 3 0 .....
GreaterEuropean British WestEuropean Italian French ... represents number of ehnicities of different actors in a particlular movie specified by tconst.
That would be like a count matrix, such as for a movie tt00001 there are 5 Arabs, 2 British, 1 WestEuropean and so on such that in a movie, how many actors are there who belong to these ethnicities.
Link to data - https://drive.google.com/open?id=1oNfbTpmLA0imPieRxGfU_cBYVfWN3tZq
import numpy as np
import pandas as pd
df_melted = pd.melt(df, id_vars = 'tconst',
value_vars = df.columns[2:].tolist(),
var_name = 'actor',
value_name = 'ethnicities').dropna()
print(df_melted.ethnicities.str.get_dummies(sep = ',').sum())
Output:
British 169
EastAsian 9
EastEuropean 17
French 73
Germanic 9
GreaterEastAsian 13
Hispanic 9
IndianSubContinent 2
Italian 7
Japanese 4
Jewish 25
Nordic 7
WestEuropean 105
Asian 15
GreaterEuropean 316
dtype: int64
This is close to what you wanted, but not exact. To get what you wanted, without typing out the lists of columns or values, is more complicated.
From: https://stackoverflow.com/a/48120674/6672746
def change_column_order(df, col_name, index):
cols = df.columns.tolist()
cols.remove(col_name)
cols.insert(index, col_name)
return df[cols]
def split_df(dataframe, col_name, sep):
orig_col_index = dataframe.columns.tolist().index(col_name)
orig_index_name = dataframe.index.name
orig_columns = dataframe.columns
dataframe = dataframe.reset_index() # we need a natural 0-based index for proper merge
index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
df_split = pd.DataFrame(
pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
.stack().reset_index(level=1, drop=1), columns=[col_name])
df = dataframe.drop(col_name, axis=1)
df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
df = df.set_index(index_col_name)
df.index.name = orig_index_name
# merge adds the column to the last place, so we need to move it back
return change_column_order(df, col_name, orig_col_index)
Using those excellent functions:
df_final = split_df(df_melted, 'ethnicities', ',')
df_final.set_index(['tconst', 'actor'], inplace = True)
df_final.pivot_table(index = ['tconst'],
columns = 'ethnicities',
aggfunc = pd.Series.count).fillna(0).astype('int')
Output:
ethnicities British EastAsian EastEuropean French Germanic GreaterEastAsian Hispanic IndianSubContinent Italian Japanese Jewish Nordic WestEuropean Asian GreaterEuropean
tconst
tt0000001 1 0 0 1 0 0 0 0 0 0 0 0 1 0 2
tt0000002 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1
tt0000003 0 0 0 3 0 0 0 0 0 0 0 0 3 0 3
tt0000004 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1
tt0000005 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2
Pandas have it all.
title_principals["all"] = title_principals["Actor1"].astype(str)+','+title_principals["Actor2"].astype(str)+','+title_principals["Actor3"].astype(str)+','+title_principals["Actor4"].astype(str)+','+title_principals["Actor5"].astype(str)+','+title_principals["Actor6"].astype(str)+','+title_principals["Actor7"].astype(str)+','+title_principals["Actor8"].astype(str)+','+title_principals["Actor9"].astype(str)+','+title_principals["Actor10"].astype(str)
and then, from the string, make the count and drop the other variables.
title_principals["GreaterEuropean"] = title_principals["all"].str.contains(r'GreaterEuropean').sum()