Creating a new column using certain conditions

Creating a new column using certain conditions - python

I have a dataframe like so:
Year RS Team RS_target
1 1962 599 WSA
2 1962 774 STL
3 1963 747 WSA
4 1963 725 STL
5 1964 702 WSA
6 1964 800 STL
I'd like to create a new column (RS_target) that will have the RS value for the next year (i.e. index 1: Year = 1962, RS = 599, RS_target = 747). The aim is to get next year's RS for the team and place that value in the new column "RS_target".
I've been trying a combination of conditionals and apply(), but having trouble getting the output I want. Looking for an efficient alternative method, or any other way to get the desired outcome. Thanks!

You need to first apply dataframe.groupby() on Team column and then use shift() to get next RS value for the Team.
df = pd.DataFrame({'Year':[1962,1962,1963,1963,1964,1964], 'RS':[599,774,747,725,702,800], 'Team':['WSA','STL','WSA','STL','WSA','STL']})
df['RS_Target'] = df.groupby('Team')['RS'].shift(-1)
print(df)
Output:
Year RS Team RS_Target
0 1962 599 WSA 747.0
1 1962 774 STL 725.0
2 1963 747 WSA 702.0
3 1963 725 STL 800.0
4 1964 702 WSA NaN
5 1964 800 STL NaN
EDIT:
If your Year column contains random values b. Sort the column using below before applying groupby operation:
df.sort_values(['Year'], inplace=True)

Related

Iterate over specific rows, sum results and store in new row

I have a DataFrame in which I have already defined rows to be summed up and store the results in a new row.
For example in Year 1990:
Category
A
B
C
D
Year
E
147
78
476
531
1990
F
914
356
337
781
1990
G
117
874
15
69
1990
H
45
682
247
65
1990
I
20
255
465
19
1990
Here, the rows G - H should be summed up and the results stored in a new row. The same categories repeat every year from 1990 - 2019
I have already tried it with .iloc e.g. [4:8], [50:54] [96:100] and so on, but with iloc I can not specify multiple index. I can't manage to make a loop over the single years.
Is there a way to sum the values in categories (G-H) for each year (1990 -2019)?

I'm not sure the multiple index what you mean.
It usually appear after some group and aggregate function.
At your table, it looks just multiple column
So, if I understand correctly.
Here a complete code to show how to use the multiple condition of DataFrame
import io
import pandas as pd
data = """Category A B C D Year
E 147 78 476 531 1990
F 914 356 337 781 1990
G 117 874 15 69 1990
H 45 682 247 65 1990
I 20 255 465 19 1990"""
table = pd.read_csv(io.StringIO(data), delimiter="\t")
years = table["Year"].unique()
for year in years:
row = table[((table["Category"] == "G") | (table["Category"] == "H")) & (table["Year"] == year)]
row = row[["A", "B", "C", "D"]].sum()
row["Category"], row["Year"] = "sum", year
table = table.append(row, ignore_index=True)

If you are only interested in G/H, you can slice with isin combined with boolean indexing, then sum:
df[df['Category'].isin(['G', 'H'])].sum()
output:
Category GH
A 162
B 1556
C 262
D 134
Year 3980
dtype: object
NB. note here the side effect of sum that combines the two "G"/"H" strings into one "GH".
Or, better, set Category as index and slice with loc:
df.set_index('Category').loc[['G', 'H']].sum()
output:
A 162
B 1556
C 262
D 134
Year 3980
dtype: int64

Create new Pandas.DataFrame with .groupby(...).agg(sum) then recover unsummed columns

I'm starting with a dataframe of baseabll seasons a section of which looks similar to this:
Name Season AB H SB playerid
13047 A.J. Pierzynski 2013 503 137 1 746
6891 A.J. Pierzynski 2006 509 150 1 746
1374 Rod Carew 1977 616 239 23 1001942
1422 Stan Musial 1948 611 230 7 1009405
1507 Todd Helton 2000 580 216 5 432
1508 Nomar Garciaparra 2000 529 197 5 190
1509 Ichiro Suzuki 2004 704 262 36 1101
From these seasons, I want to create a dataframe of career stats; that is, one row for each player which is a sum of their AB, H, etc. This dataframe should still include the names of the players. The playerid in the above is a unique key for each player and should either be an index or an unchanged value in a column after creating the career stats dataframe.
My hypothetical starting point is df_careers = df_seasons.groupby('playerid').agg(sum) but this leaves out all the non-numeric data. With numeric_only = False I can get some sort of mess in the names columns like 'Ichiro SuzukiIchiro SuzukiIchiro Suzuki' from concatenation, but that just requires a bunch of cleaning. This is something I'd like to be able to do with other data sets and the actually data I have is more like 25 columns, so I'd rather understand a specific routine for getting the Name data back or preserving it from the outset rather than write a specific function and use groupby('playerid').agg(func) (or a similar process) to do it, if possible.
I'm guessing there's a fairly simply way to do this, but I only started learning Pandas a week ago, so there are gaps in my knowledge.

You can write your own condition how do you want to include non summed columns.
col = df.columns.tolist()
col.remove('playerid')
df.groupby('playerid').agg({i : lambda x: x.iloc[0] if x.dtypes=='object' else x.sum() for i in df.columns})
df:
Name Season AB H SB playerid
playerid
190 Nomar_Garciaparra 2000 529 197 5 190
432 Todd_Helton 2000 580 216 5 432
746 A.J._Pierzynski 4019 1012 287 2 1492
1101 Ichiro_Suzuki 2004 704 262 36 1101
1001942 Rod_Carew 1977 616 239 23 1001942
1009405 Stan_Musial 1948 611 230 7 1009405

If there is a one-to-one relationship between 'playerid' and 'Name', as appears to be the case, you can just include 'Name' in the groupby columns:
stat_cols = ['AB', 'H', 'SB']
groupby_cols = ['playerid', 'Name']
results = df.groupby(groupby_cols)[stat_cols].sum()
Results:
AB H SB
playerid Name
190 Nomar Garciaparra 529 197 5
432 Todd Helton 580 216 5
746 A.J. Pierzynski 1012 287 2
1101 Ichiro Suzuki 704 262 36
1001942 Rod Carew 616 239 23
1009405 Stan Musial 611 230 7
If you'd prefer to group only by 'playerid' and add the 'Name' data back in afterwards, you can instead create a 'playerId' to 'Name' mapping as a dictionary, and look it up using map:
results = df.groupby('playerid')[stat_cols].sum()
name_map = pd.Series(df.Name.to_numpy(), df.playerid).to_dict()
results['Name'] = results.index.map(name_map)
Results:
AB H SB Name
playerid
190 529 197 5 Nomar Garciaparra
432 580 216 5 Todd Helton
746 1012 287 2 A.J. Pierzynski
1101 704 262 36 Ichiro Suzuki
1001942 616 239 23 Rod Carew
1009405 611 230 7 Stan Musial

groupy.agg() can accept a dictionary that maps column names to functions. So, one solution is to pass a dictionary to agg, specifying which functions to apply to each column.
Using the sample data above, one might use
mapping = { 'AB': sum,'H': sum, 'SB': sum, 'Season': max, 'Name': max }
df_1 = df.groupby('playerid').agg(mapping)
The choice to use 'max' for those that shouldn't be summed is arbitrary. You could define a lambda function to apply to a column if you want to handle it in a certain way. DataFrameGroupBy.agg can work with any function that will work with DataFrame.apply.
To expand this to larger data sets, you might use a dictionary comprehension. This would work well:
dictionary = { x : sum for x in df.columns}
dont_sum = {'Name': max, 'Season': max}
dictionary.update(dont_sum)
df_1 = df.groupby('playerid').agg(dictionary)

How to unify (collapse) multiple columns into one assigning unique values

Edited my previous question:
Want to distinguish each Devices (FOUR types) that are attached to a particular Building's particular Elevator (represented by height).
As there is no unique IDs for the devices, want to identify them and assign unique IDs to each of them by Grouping ('BldID', 'BldHt', 'Deivce') to identify any particular 'Device'.
Count their testing results, i.e. how many times it failed (NG) out of total number of testing (NG + OK) for any particular date for the entire duration consisting of few months.
Original dataframe looks like this
BldgID BldgHt Device Date Time Result
1074 34.0 790 2018/11/20 10:30 OK
1072 31.0 780 2018/11/19 11:10 NG
1072 36.0 780 2018/11/17 05:30 OK
1074 10.0 790 2018/11/19 06:10 OK
1074 10.0 790 2018/12/20 11:50 NG
1076 17.0 760 2018/08/15 09:20 NG
1076 17.0 760 2018/09/20 13:40 OK
As 'Time' is irrelevant, dropped it. Want to find the number of [NG] per day for each set (consists of 'BldgID', 'BlgHt', 'Device'].
#aggregate both functions only once by groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), \
('ALL','count')]).round(2).reset_index()
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1),
index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Now the filtered DataFrame looks like:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 2
8 002 1076 17.0 760 2018/11/20 1 1
If I groupby ['BldgID', 'BldgHt', 'Device', 'Date'] then I get per day 'NG'.
But it would consider every day differently and if I assign 'unique' IDs I can plot how the unique Devices behave in every other single day.
If I groupby ['BldgId', 'BldgHt', 'Device'] then I get the overall 'NG' for that set (or unique Device), which is not my goal.
What I want to achieve is:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
001 1072 31.0 780 2018/11/19 1 2
1072 31.0 780 2018/12/30 3 4
002 1076 17.0 760 2018/11/20 1 1
1076 17.0 760 2018/09/20 2 4
003 1072 36.0 780 2018/08/15 1 3
Any tips would be very much appreciated.

Use:
#aggregate both aggregate function only in once groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), ('ALL','count')]).round(2).reset_index()
#filter non 0 rows
mel_df2 = df1[df1.NG != 0]
#filter first rows by Date
mel_df2 = mel_df2.drop_duplicates('Date')
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1), index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Output from data from question:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 1
8 002 1076 17.0 780 2018/11/20 1 1

How to merge some data in dataframe

I need to merge some data in dataframe because I will code [sequential association rule] in python.
How can I merge the data and what algorithm I should use in python?
Apriori? FP growth?
I can't find [sequential association rule] using apriori in python.
They use R
visit places are 250. unique id numbers are 116807 and total row is 1.7millions. and, each id has country_code(111 countries but I will classify them to 10 countries).. so I will merge them one more.
Previous Data
index date_ymd id visit_nm country
1 20170801 123123 seoul 460
2 20170801 123123 tokyo 460
3 20170801 124567 seoul 440
4 20170802 123123 osaka 460
5 20170802 123123 seoul 460
... ... ... ...
What I need
index Transaction visit_nm country
1 20170801123123 {seoul,tokyo} 460
2 20170802123123 {osaka,seoul} 460

From what i understood seeing the data, Use groupby agg:
s=pd.Series(df.date_ymd.astype(str)+df.id.astype(str),name='Transaction')
(df.groupby(s)
.agg({'visit_nm':lambda x: set(x),'country':'first'}).reset_index())
Transaction visit_nm country
0 20170801123123 {seoul, tokyo} 460
1 20170801124567 {seoul} 440
2 20170802123123 {osaka, seoul} 460

Also you could use:
df['Transaction'] = df['date_ymd'].map(str)+df['id'].map(str)
df.groupby('Transaction').agg({'visit_nm': lambda x: set(x), 'country': 'first'}).reset_index()

Scipy Stats ttest_1samp Hypothesis Testing For Comparing Previous Performance To Sample

My Problem I'm Trying To Solve
I have 11 months worth of performance data:
Month Branded Non-Branded Shopping Grand Total
0 2/1/2015 1330 334 161 1825
1 3/1/2015 1344 293 197 1834
2 4/1/2015 899 181 190 1270
3 5/1/2015 939 208 154 1301
4 6/1/2015 1119 238 179 1536
5 7/1/2015 859 238 170 1267
6 8/1/2015 996 340 183 1519
7 9/1/2015 1138 381 172 1691
8 10/1/2015 1093 395 176 1664
9 11/1/2015 1491 426 199 2116
10 12/1/2015 1539 530 156 2225
Let's say it's February, 1 2016 and I'm asking "are the results in January statistically different from the past 11 months?"
Month Branded Non-Branded Shopping Grand Total
11 1/1/2016 1064 408 106 1578
I came across a blog...
I came across iaingallagher's blog. I will reproduce here (in case the blog goes down).
1-sample t-test
The 1-sample t-test is used when we want to compare a sample mean to a
population mean (which we already know). The average British man is
175.3 cm tall. A survey recorded the heights of 10 UK men and we want to know whether the mean of the sample is different from the
population mean.
# 1-sample t-test
from scipy import stats
one_sample_data = [177.3, 182.7, 169.6, 176.3, 180.3, 179.4, 178.5, 177.2, 181.8, 176.5]
one_sample = stats.ttest_1samp(one_sample_data, 175.3)
print "The t-statistic is %.3f and the p-value is %.3f." % one_sample
Result:
The t-statistic is 2.296 and the p-value is 0.047.
Finally, to my question...
In iaingallagher's example, he knows the population mean and is comparing a sample (one_sample_data). In MY example, I want to see if 1/1/2016 is statistically different from the previous 11 months. So in my case, the previous 11 months is an array (instead of a single population mean value) and my sample is one data point (instead of an array)... so it's kind of backwards.
QUESTION
If I was focused on the Shopping column data:
Will scipy.stats.ttest_1samp([161,197,190,154,179,170,183,172,176,199,156], 106) produce a valid result even though my sample (first parameters) is a list of previous results and I'm comparing it to a popmean that's not the population mean but instead one sample.
If this is not the correct stats function, any recommendation on what to use for this hypothesis test situation?

If you are only interested in the "Shopping" column, try to create a .xlsx or .csv file containing the data from only the "Shopping"column.
This way you could import this data and make use of pandas to perform the same T-test for each column individually.
import pandas as pd
from scipy import stats
data = pd.read_excel("datafile.xlxs")
one_sample_data = data["Shopping"]
one_sample = stats.ttest_1samp(one_sample_data, 175.3)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.