This is my dataset where I have different countries, different models for the different countries, years and the price and volume.
data_dic = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9]
}
Country Model Year Price Volume
0 1 A 2005 100 4
4 2 A 2005 350 12
3 1 A 2020 953 10
7 2 A 2020 896 9
1 1 B 2005 172 8
5 2 B 2005 452 6
2 1 B 2020 852 9
6 2 B 2020 658 8
I would like to obtain the following where 1) column "Division_Price" is the division of price for Country 1 of Model A between the year 2005 and 2020 and 2) column "Division_Volume" is the division in volume for Country 1 of Model A between the year 2005 and 2020.
data_dic2 = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9],
"Division_Price": [0.953,4.95,4.95,0.953,2.56,1.45,1.45,2.56],
"Division_Volume": [2.5,1.125,1.125,2.5,1,1.33,1.33,1],
}
print(data_dic2)
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 0.953 2.500
4 2 A 2005 350 12 2.560 1.000
3 1 A 2020 953 10 0.953 2.500
7 2 A 2020 896 9 2.560 1.000
1 1 B 2005 172 8 4.950 1.125
5 2 B 2005 452 6 1.450 1.330
2 1 B 2020 852 9 4.950 1.125
6 2 B 2020 658 8 1.450 1.330
My whole dataset has up to 50 countries and I have up to 10 models with years ranging 1990 to 2030.
I am still unsure how to account for the multiple conditions of three columns so that I can divide automatically the column Price and Volume based on the three conditions (i.e., Country, Year and Models)?
Thanks !
You can try the following, using df.pivot, df.stack() and df.merge:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.diff().bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Difference_')
)
>>> df2
Difference_Price Difference_Volume
Year Country Model
2005 1 A 853 6
2 A 546 3
2020 1 A 853 6
2 A 546 3
2005 1 B 680 1
2 B 206 2
2020 1 B 680 1
2 B 206 2
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Difference_Price Difference_Volume
0 1 A 2005 100 4 853 6
1 2 A 2005 350 12 546 3
2 1 A 2020 953 10 853 6
3 2 A 2020 896 9 546 3
4 1 B 2005 172 8 680 1
5 2 B 2005 452 6 206 2
6 1 B 2020 852 9 680 1
7 2 B 2020 658 8 206 2
EDIT:
For your new dataframe, I think the 0.953 would be 9.530, if so, you can use pct_change and add 1:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.pct_change(1).add(1).bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Division_').round(3)
)
>>> df2
Division_Price Division_Volume
Year Country Model
2005 1 A 9.530 2.500
2 A 2.560 0.750
2020 1 A 9.530 2.500
2 A 2.560 0.750
2005 1 B 4.953 1.125
2 B 1.456 1.333
2020 1 B 4.953 1.125
2 B 1.456 1.333
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 9.530 2.500
1 2 A 2005 350 12 2.560 0.750
2 1 A 2020 953 10 9.530 2.500
3 2 A 2020 896 9 2.560 0.750
4 1 B 2005 172 8 4.953 1.125
5 2 B 2005 452 6 1.456 1.333
6 1 B 2020 852 9 4.953 1.125
7 2 B 2020 658 8 1.456 1.333
I have performed a groupby on my dataframe.
grouped = data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
I am getting the below output :
data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Out[81]:
Cluster Visit Number Final
0 1 21846
2 1485
3 299
4 95
5 24
6 8
7 3
1 1 33600
2 2283
3 404
4 117
5 34
6 7
2 1 5858
2 311
3 55
4 14
5 6
6 3
7 1
3 1 19699
2 1101
3 214
4 78
5 14
6 8
7 3
4 1 10086
2 344
3 59
4 14
5 3
6 1
Name: Visitor_ID, dtype: int64
Now i want to compress the rows whose Visit Number Final >3(Add a new row which has the summation for visit number final 4,5,6). I am trying groupby.filter but not getting the expected output.
My final output should look like
Cluster Visit Number Final
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18
The easiest way is to replace the 'Visit Number Final' values bigger than 3, before you group the dataframe:
df.loc[df['Visit Number Final'] > 3, 'Visit Number Final'] = '>=4'
df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Try this:
visit_val = df.index.get_level_values(1)
grp = np.where((visit_val <= 3) == 0, '>=4', visit_val)
(df.groupby(['Cluster',grp])['Number Final'].sum()
.reset_index().rename(columns={'level_1':'Visit'}))
Output:
Cluster Visit Number Final
0 0 1 21846
1 0 2 1485
2 0 3 299
3 0 >=4 130
4 1 1 33600
5 1 2 2283
6 1 3 404
7 1 >=4 158
8 2 1 5858
9 2 2 311
10 2 3 55
11 2 >=4 24
12 3 1 19699
13 3 2 1101
14 3 3 214
15 3 >=4 103
16 4 1 10086
17 4 2 344
18 4 3 59
19 4 >=4 18
Or to get dataframe with indexes:
(df.groupby(['Cluster',grp])['Number Final'].sum()
.rename_axis(['Cluster','Visit']).to_frame())
Output:
Number Final
Cluster Visit
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18
I have a dataframe like this :
df = pd.DataFrame({'dir': [1,1,1,1,0,0,1,1,1,0], 'price':np.random.randint(100,200,10)})
dir price
0 1 100
1 1 150
2 1 190
3 1 194
4 0 152
5 0 151
6 1 131
7 1 168
8 1 112
9 0 193
and I want a new column that shows the maximum price as long as the dir is 1 and reset if dir is 0.
My desired outcome looks like this:
dir price max
0 1 100 194
1 1 150 194
2 1 190 194
3 1 194 194
4 0 152 NaN
5 0 151 NaN
6 1 131 168
7 1 168 168
8 1 112 168
9 0 193 NaN
Use transform with max for filtered rows:
#get unique groups for consecutive values
g = df['dir'].ne(df['dir'].shift()).cumsum()
#filter only 1
m = df['dir'] == 1
df['max'] = df[m].groupby(g)['price'].transform('max')
print (df)
dir price max
0 1 100 194.0
1 1 150 194.0
2 1 190 194.0
3 1 194 194.0
4 0 152 NaN
5 0 151 NaN
6 1 131 168.0
7 1 168 168.0
8 1 112 168.0
9 0 193 NaN
I'm trying to create a new column, lets call it "HomeForm", that is the sum of the last 5 values of "FTHG" for each of the entries in the "HomeTeam" column.
Say for Team 0, the idea would be to populate the cell on the new column with the sum of the last 5 values of "FTHG" that correspond to Team 0. The table is ordered by date.
How can it be done in Python?
HomeTeam FTHG HomeForm
Date
136 0 4
135 2 0
135 4 2
135 5 0
135 6 1
135 13 0
135 17 3
135 18 1
134 11 4
134 12 0
128 1 0
128 3 0
128 8 2
128 9 1
128 13 3
128 14 1
128 15 0
127 7 1
127 16 1
126 10 1
Thanks.
You'll groupby on HomeTeam and perform a rolling sum here, summing for a minimum of 1 period, and maximum of 5.
First, define a function -
def f(x):
return x.shift().rolling(window=5, min_periods=1).sum()
This function performs the rolling sum of the previous 5 games (hence the shift). Pass this function to dfGroupBy.transform -
df['HomeForm'] = df.groupby('HomeTeam', sort=False).FTHG.transform(f)
df
HomeTeam FTHG HomeForm
Date
136 0 4 NaN
135 2 0 NaN
135 4 2 NaN
135 5 0 NaN
135 6 1 NaN
135 13 0 NaN
135 17 3 NaN
135 18 1 NaN
134 11 4 NaN
134 12 0 NaN
128 1 0 NaN
128 3 0 NaN
128 8 2 NaN
128 9 1 NaN
128 13 3 0.0
128 14 1 NaN
128 15 0 NaN
127 7 1 NaN
127 16 1 NaN
126 10 1 NaN
If needed, fill the NaNs with zeros and convert to integer -
df['HomeForm'] = df['HomeForm'].fillna(0).astype(int)
I have this file
0 0 716
0 1 851
0 2 900
1 0 724
1 1 857
1 2 903
2 0 812
2 1 858
2 2 902
3 0 799
3 1 852
3 2 905
4 0 833
4 1 871
4 2 907
5 0 940
5 1 955
5 2 995
6 0 941
6 1 956
6 2 996
7 0 942
7 1 957
7 2 999
8 0 944
8 1 958
8 2 992
9 0 946
9 1 952
9 2 998
I want to write third column values like this
0 0 716
1 0 724
2 0 812
3 0 799
4 0 833
0 1 851
1 1 857
2 1 858
3 1 852
4 1 871
0 2 900
1 2 903
2 2 902
3 2 905
4 2 907
5 0 940
6 0 941
7 0 942
8 0 944
9 0 946
5 1 955
6 1 956
7 1 957
8 1 958
9 1 952
5 2 995
6 2 996
7 2 999
8 2 992
9 2 998
I have read file
l= [line.rstrip('\n') for line in open('test.txt')]
Now I am stuck,how to read this as 3d array? With enumerate function,does not work because it includes first value on its own,I do not need that.
This works:
with open('input.txt') as infile:
rows = [map(int, line.split()) for line in infile]
def part(minval, maxval):
return [r for r in rows if minval <= r[0] <= maxval]
with open('output.txt', 'w') as outfile:
for half in [part(0, 4), part(5, 9)]:
half.sort(key=lambda (a, b, c): (b, a, c))
for row in half:
outfile.write('%s %s %s\n' % tuple(row))
Let me know if you have questions.
it would be very simple if you could use pandas module:
import pandas as pd
fn = r'D:\temp\.data\37146154.txt'
df = pd.read_csv(fn, delim_whitespace=True, header=None, names=['col1','col2','col3'])
df.sort_values(['col2','col1','col3'])
if you want to write it back to a new file:
df.sort_values(['col2','col1','col3']).to_csv('new_file', sep='\t', index=False, header=None)
Test:
In [15]: df.sort_values(['col2','col1','col3'])
Out[15]:
col1 col2 col3
0 0 0 716
3 1 0 724
6 2 0 812
9 3 0 799
12 4 0 833
15 5 0 940
18 6 0 941
21 7 0 942
24 8 0 944
27 9 0 946
1 0 1 851
4 1 1 857
7 2 1 858
10 3 1 852
13 4 1 871
16 5 1 955
19 6 1 956
22 7 1 957
25 8 1 958
28 9 1 952
2 0 2 900
5 1 2 903
8 2 2 902
11 3 2 905
14 4 2 907
17 5 2 995
20 6 2 996
23 7 2 999
26 8 2 992
29 9 2 998