This is my dataset where I have different countries, different models for the different countries, years and the price and volume.
data_dic = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9]
}
Country Model Year Price Volume
0 1 A 2005 100 4
4 2 A 2005 350 12
3 1 A 2020 953 10
7 2 A 2020 896 9
1 1 B 2005 172 8
5 2 B 2005 452 6
2 1 B 2020 852 9
6 2 B 2020 658 8
I would like to obtain the following where 1) column "Division_Price" is the division of price for Country 1 of Model A between the year 2005 and 2020 and 2) column "Division_Volume" is the division in volume for Country 1 of Model A between the year 2005 and 2020.
data_dic2 = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9],
"Division_Price": [0.953,4.95,4.95,0.953,2.56,1.45,1.45,2.56],
"Division_Volume": [2.5,1.125,1.125,2.5,1,1.33,1.33,1],
}
print(data_dic2)
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 0.953 2.500
4 2 A 2005 350 12 2.560 1.000
3 1 A 2020 953 10 0.953 2.500
7 2 A 2020 896 9 2.560 1.000
1 1 B 2005 172 8 4.950 1.125
5 2 B 2005 452 6 1.450 1.330
2 1 B 2020 852 9 4.950 1.125
6 2 B 2020 658 8 1.450 1.330
My whole dataset has up to 50 countries and I have up to 10 models with years ranging 1990 to 2030.
I am still unsure how to account for the multiple conditions of three columns so that I can divide automatically the column Price and Volume based on the three conditions (i.e., Country, Year and Models)?
Thanks !
You can try the following, using df.pivot, df.stack() and df.merge:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.diff().bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Difference_')
)
>>> df2
Difference_Price Difference_Volume
Year Country Model
2005 1 A 853 6
2 A 546 3
2020 1 A 853 6
2 A 546 3
2005 1 B 680 1
2 B 206 2
2020 1 B 680 1
2 B 206 2
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Difference_Price Difference_Volume
0 1 A 2005 100 4 853 6
1 2 A 2005 350 12 546 3
2 1 A 2020 953 10 853 6
3 2 A 2020 896 9 546 3
4 1 B 2005 172 8 680 1
5 2 B 2005 452 6 206 2
6 1 B 2020 852 9 680 1
7 2 B 2020 658 8 206 2
EDIT:
For your new dataframe, I think the 0.953 would be 9.530, if so, you can use pct_change and add 1:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.pct_change(1).add(1).bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Division_').round(3)
)
>>> df2
Division_Price Division_Volume
Year Country Model
2005 1 A 9.530 2.500
2 A 2.560 0.750
2020 1 A 9.530 2.500
2 A 2.560 0.750
2005 1 B 4.953 1.125
2 B 1.456 1.333
2020 1 B 4.953 1.125
2 B 1.456 1.333
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 9.530 2.500
1 2 A 2005 350 12 2.560 0.750
2 1 A 2020 953 10 9.530 2.500
3 2 A 2020 896 9 2.560 0.750
4 1 B 2005 172 8 4.953 1.125
5 2 B 2005 452 6 1.456 1.333
6 1 B 2020 852 9 4.953 1.125
7 2 B 2020 658 8 1.456 1.333
I am using spyder. I have imported pandas and did
df = pd.read_csv
This is my csv file:
Year Public Specialist Public Non-Specialist Private Specialist Private Non-Specialist
0 2010 1996 3184 1151 2159
1 2011 2165 3456 1229 2220
2 2012 2342 3789 1293 2222
3 2013 2511 4150 1351 2327
4 2014 2829 4501 1411 2379
5 2015 3052 4857 1470 2444
6 2016 3299 5059 1485 2494
7 2017 3523 5050 1528 2579
8 2018 3741 5078 1565 2660
9 2019 3864 5166 1682 2757
Here is the code:
print(f"Highest percentage change is {df.iloc[:,4].pct_change().max(axis=0):.2f}% in "
f"{df.loc[df['Private Non-Specialist'] == df.iloc[:,4].pct_change().max(axis=0), 'Year'].values[0]}")
After in, I'm trying to find the year which has the highest percentage change, but the output shows:
index 0 is out of bounds for axis 0 with size 0
Here's a solution. Added a couple of steps for clarity:
max_change = df["Private Non-Specialist"].pct_change().max()
max_inx = df["Private Non-Specialist"].pct_change().idxmax()
max_year = df.iloc[max_inx]["Year"]
print(f"Highest percentage change is {max_change:.2f}% in {max_year}")
This prints:
Highest percentage change is 0.05% in 2013
I have a dataframe df which i need to groupby multiple column based on a condition.
df
user_id area_id group_id key year value new
10835 48299 1 5 2011 0 ?
10835 48299 1 2 2010 0
10835 48299 2 102 2013 13100
10835 48299 2 5 2016 0
10836 48299 1 78 2017 67100
10836 48299 1 1 2012 54000
10836 48299 1 12 2018 0
10836 48752 1 7 2014 0
10836 48752 2 103 2015 5000
10837 48752 2 102 2016 5000
10837 48752 1 3 2017 0
10837 48752 1 103 2017 0
10837 49226 1 2 2011 4000
10837 49226 1 83 2011 4000
10838 49226 2 16 2011 0
10838 49226 1 75 2012 0
10838 49226 1 2 2012 4000
10838 49226 1 12 2013 1000
10839 49226 1 3 2015 6500
10839 49226 1 102 2016 7900
10839 49226 1 16 2017 0
10839 49226 2 6 2017 5500
22489 49226 2 89 2017 5000
22489 49226 1 102 2017 5000
my goal is to create a new column df['new']
Current solution:
df['new'] =df['user_id'].map(df[df['key'].eq(102)].groupby(['user_id', 'area_id', 'group_id', 'year'])['value'].sum())
I get NaN for all df['new'] values. I'm guessing is not possible to use the the map function to grouped multiple columns this way. Is there a proper way to accomplish this? Thanks in advance for tip to the right direction.
You can add as_index=False for new DataFrame:
df1 = (df[df['key'].eq(102)]
.groupby(['user_id', 'area_id', 'group_id', 'year'], as_index=False)['value']
.sum())
print (df1)
user_id area_id group_id year value
0 10835 48299 2 2013 13100
1 10837 48752 2 2016 5000
2 10839 49226 1 2016 7900
3 22489 49226 1 2017 5000
Then if possible duplicated user_id first get unique rows by DataFrame.drop_duplicates, create Series by DataFrame.set_index and map:
df['new'] = df['user_id'].map(df1.drop_duplicates('user_id').set_index('user_id')['value'])
#if never duplicates
#df['new'] = df['user_id'].map(df1.set_index('user_id')['value'])
print (df)
user_id area_id group_id key year value new
0 10835 48299 1 5 2011 0 13100.0
1 10835 48299 1 2 2010 0 13100.0
2 10835 48299 2 102 2013 13100 13100.0
3 10835 48299 2 5 2016 0 13100.0
4 10836 48299 1 78 2017 67100 NaN
5 10836 48299 1 1 2012 54000 NaN
6 10836 48299 1 12 2018 0 NaN
7 10836 48752 1 7 2014 0 NaN
8 10836 48752 2 103 2015 5000 NaN
9 10837 48752 2 102 2016 5000 5000.0
10 10837 48752 1 3 2017 0 5000.0
11 10837 48752 1 103 2017 0 5000.0
12 10837 49226 1 2 2011 4000 5000.0
13 10837 49226 1 83 2011 4000 5000.0
14 10838 49226 2 16 2011 0 NaN
15 10838 49226 1 75 2012 0 NaN
16 10838 49226 1 2 2012 4000 NaN
17 10838 49226 1 12 2013 1000 NaN
18 10839 49226 1 3 2015 6500 7900.0
19 10839 49226 1 102 2016 7900 7900.0
20 10839 49226 1 16 2017 0 7900.0
21 10839 49226 2 6 2017 5500 7900.0
22 22489 49226 2 89 2017 5000 5000.0
23 22489 49226 1 102 2017 5000 5000.0
Below is the Sales table which has following data:
Sales:
S_ID S_QTY S_PRD S_ST_DT S_END_DT
1 223 AA 2018-06-02 2018-06-04
2 442 FO 2018-05-10 2018-05-12
3 771 WA 2018-07-07 2018-07-10
4 663 AAG 2018-03-02 2018-03-03
I am trying to get the dates between the S_ST_DT and S_END_DT.
Expecting the following O/P in DB2-SQL and Pandas:
Expected O/P:
S_ID S_QTY S_PRD S_DT
1 223 AA 2018-06-02
1 223 AA 2018-06-03
1 223 AA 2018-06-04
2 442 FO 2018-05-10
2 442 FO 2018-05-11
2 442 FO 2018-05-12
3 771 WA 2018-07-07
3 771 WA 2018-07-08
3 771 WA 2018-07-09
3 771 WA 2018-07-10
4 663 AAG 2018-03-02
4 663 AAG 2018-03-03
Any suggestions here?
Use pop and extract the last two columns
Compute the date range using pd.date_range
Reshape your data using np.repeat
Create the dataFrame, flatten the dates list and assign to the newly created df
from itertools import chain
v = [pd.date_range(x, y)
for x, y in zip(df.pop('S_ST_DT'), df.pop('S_END_DT'))]
df = (pd.DataFrame(df.values.repeat([len(u) for u in v], axis=0),
columns=df.columns)
.assign(S_DT=list(chain.from_iterable(v))))
print(df)
S_ID S_QTY S_PRD S_DT
0 1 223 AA 2018-06-02
1 1 223 AA 2018-06-03
2 1 223 AA 2018-06-04
3 2 442 FO 2018-05-10
4 2 442 FO 2018-05-11
5 2 442 FO 2018-05-12
6 3 771 WA 2018-07-07
7 3 771 WA 2018-07-08
8 3 771 WA 2018-07-09
9 3 771 WA 2018-07-10
10 4 663 AAG 2018-03-02
11 4 663 AAG 2018-03-03
Comprehension
pd.DataFrame(
[t + [d] for *t, s, e in df.itertuples(index=False)
for d in pd.date_range(s, e)],
columns=df.columns[:-2].tolist() + ['S_DT']
)
S_ID S_QTY S_PRD S_DT
0 1 223 AA 2018-06-02
1 1 223 AA 2018-06-03
2 1 223 AA 2018-06-04
3 2 442 FO 2018-05-10
4 2 442 FO 2018-05-11
5 2 442 FO 2018-05-12
6 3 771 WA 2018-07-07
7 3 771 WA 2018-07-08
8 3 771 WA 2018-07-09
9 3 771 WA 2018-07-10
10 4 663 AAG 2018-03-02
11 4 663 AAG 2018-03-03
Alternate tuple iteration
pd.DataFrame(
[t + [d] for *t, s, e in zip(*map(df.get, df))
for d in pd.date_range(s, e)],
columns=df.columns[:-2].tolist() + ['S_DT']
)
If two date columns aren't at the end, do this ahead of time.
cols = ['S_ST_DT', 'S_END_DT']
df = df.drop(cols, 1).join(df[cols])
for Legacy Python (<=2.7)
pd.DataFrame(
[t[:-2] + (d,) for t in zip(*map(df.get, df))
for d in pd.date_range(*t[-2:])],
columns=df.columns[:-2].tolist() + ['S_DT']
)
Borrowed cold's setting up for v
from collections import ChainMap
d=dict(ChainMap(*map(dict.fromkeys, v, df.index)))
#df=df.reindex(d.values()).assign(DT=d.keys()).sort_index()
df.reindex(d.values()).assign(DT=d.keys()).sort_index()
Out[281]:
S_ID S_QTY S_PRD DT
0 1 223 AA 2018-06-03
0 1 223 AA 2018-06-04
0 1 223 AA 2018-06-02
1 2 442 FO 2018-05-10
1 2 442 FO 2018-05-12
1 2 442 FO 2018-05-11
2 3 771 WA 2018-07-09
2 3 771 WA 2018-07-08
2 3 771 WA 2018-07-07
2 3 771 WA 2018-07-10
3 4 663 AAG 2018-03-02
3 4 663 AAG 2018-03-03
DB2:
with a (S_ID, S_QTY, S_PRD, S_DT, S_END_DT) as (
select S_ID, S_QTY, S_PRD, S_ST_DT, S_END_DT from sales
union all
select S_ID, S_QTY, S_PRD, S_DT + 1 day, S_END_DT from a where S_DT<S_END_DT
)
select S_ID, S_QTY, S_PRD, S_DT
from a
order by S_ID, S_DT;
Word ControversialPost TopPost
0 to 5756 4169
1 I 5717 4360
2 the 5416 4298
3 a 4929 3467
4 and 4071 2679
5 in 2814 1988
6 of 2771 1835
7 my 2325 1883
8 for 1989 1487
9 is 1961 1364
10 have 1713 1291
11 that 1552 1042
12 it 1452 1059
13 on 1404 1021
14 be 1302 1104
Above is my DataFrame, I want to sort by the difference between ControversialPost and TopPost. How would I do this?
I'm trying to do a sentiment analysis and see what words are most common and what words are not. Thanks!
You could just add an extra column to your Dataframe called "Difference" and sort by that column. Something like this df['Score_diff'] = df['ControversialPost'] - df['TopPost']
print(
df.assign(tmp=df["ControversialPost"] - df["TopPost"])
.sort_values(by="tmp")
.drop(columns="tmp")
)
Prints:
Word ControversialPost TopPost
14 be 1302 1104
13 on 1404 1021
12 it 1452 1059
10 have 1713 1291
7 my 2325 1883
8 for 1989 1487
11 that 1552 1042
9 is 1961 1364
5 in 2814 1988
6 of 2771 1835
2 the 5416 4298
1 I 5717 4360
4 and 4071 2679
3 a 4929 3467
0 to 5756 4169
Add a col for the difference, then sort_values will do it.
df = pd.DataFrame([[1000, 950], [400, 300], [100, 80]], columns=['ControversialPost', 'TopPost'])
df['difference'] = df['ControversialPost'] - df['TopPost']
df = df.sort_values('difference', ascending=False)
print(df)
ControversialPost TopPost difference
1 400 300 100
0 1000 950 50
2 100 80 20