Subtraction using different columns in multiple dictionaries

Subtraction using different columns in multiple dictionaries - python

I have two dicts, one with three columns (A) and another with six columns (B), I would like to be able to use the value in the first column (index which is constant for both 1-4) and also the value in the second column (1-2000) to specify the correct element in the third column for subtraction. The second dict is similar in that the first and second columns are used to find the correct row however it is the value in the sixth column of that row that is needed for the subtraction.
A B
1 1 260 541 1 1 260 280 0.001 521.4
1 1 390 1195 1 1 390 900 0.02 963.3
1 1 102 6 1 1 102 2 0.01 4.8
2 1 65 12 2 1 65 9 0.13 13.1
2 1 515 659 2 1 515 356 0.002 532.2
2 1 354 1200 2 1 354 1087 0.119 1502.3
3 1 1190 53 3 1 1190 46 0.058 12.0
3 1 1985 3 3 1 1985 1 0.006 1.02
3 1 457 192 3 1 25 3 0.001 178.2
4 1 261 2084 4 1 261 1792 0.196 100.7
4 1 12 0 4 1 12 0 0.000 12.6
4 1 1756 30 4 1 1756 28 0.006 23.7
4 1 592 354 4 1 592 291 0.357 251.9
So basically I would like to subtract the last column of B from the last column of A whilst retaining the information held in the first and second columns.
C (desired output)
1 1 260 19.6
1 1 390 231.7
1 1 102 1.2
2 1 65 -1.1
2 1 515 126.8
2 1 354 -302.3
3 1 1190 41.0
3 1 1985 1.98
3 1 457 13.8
4 1 261 1983.3
4 1 12 -12.6
4 1 1756 6.3
4 1 592 102.1
I have been through SO for hours looking for a solution but havent found a solution as of yet but I'm sure it must be possible.
I need to be able to create a scatter graph afterwards as well in case anyone has any suggestions as to how to plot positive values and ignore the negatives.
EDIT:
I have added my code below to make it clearer, I take in a three column csv file and then need to get a count of the frequency of each value of the third column when they have the same value in the first column. B then has further alterations to get out the desired data streams and then the subtraction needs to be made. In a few of the comments it mentioned that column one and two are unnecessary but the value in column three is linked to the value in column one and thus must always remain in the same row together.
import pandas as pd
import numpy as np
def ba(fn, float1, float2):
ba=pd.read_csv(fn,header=None, skipfooter=6, engine='python')
ba['col4']=ba.groupby(['col1','col3']).transform(np.size)
ba['col5']=ba['col4'].apply(lambda x: x/float(float2))
ba['col6']=ba['col5'].apply(lambda x: x*float1)
ba=ba.set_index('col1')
ba = dict(tuple(ba.groupby('col1')))
return ba

IIUIC, A and B are dataframes then
In [1062]: A.iloc[:, :3].assign(output=A.iloc[:, -1] - B.iloc[:, -1])
Out[1062]:
0 1 2 output
0 1 1 260 19.60
1 1 1 390 231.70
2 1 1 102 1.20
3 2 1 65 -1.10
4 2 1 515 126.80
5 2 1 354 -302.30
6 3 1 1190 41.00
7 3 1 1985 1.98
8 3 1 457 13.80
9 4 1 261 1983.30
10 4 1 12 -12.60
11 4 1 1756 6.30
12 4 1 592 102.10
Details
In [1063]: A
Out[1063]:
0 1 2 3
0 1 1 260 541
1 1 1 390 1195
2 1 1 102 6
3 2 1 65 12
4 2 1 515 659
5 2 1 354 1200
6 3 1 1190 53
7 3 1 1985 3
8 3 1 457 192
9 4 1 261 2084
10 4 1 12 0
11 4 1 1756 30
12 4 1 592 354
In [1064]: B
Out[1064]:
0 1 2 3 4 5
0 1 1 260 280 0.001 521.40
1 1 1 390 900 0.020 963.30
2 1 1 102 2 0.010 4.80
3 2 1 65 9 0.130 13.10
4 2 1 515 356 0.002 532.20
5 2 1 354 1087 0.119 1502.30
6 3 1 1190 46 0.058 12.00
7 3 1 1985 1 0.006 1.02
8 3 1 25 3 0.001 178.20
9 4 1 261 1792 0.196 100.70
10 4 1 12 0 0.000 12.60
11 4 1 1756 28 0.006 23.70
12 4 1 592 291 0.357 251.90

Related

How to divide multiple columns based on three conditions

This is my dataset where I have different countries, different models for the different countries, years and the price and volume.
data_dic = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9]
}
Country Model Year Price Volume
0 1 A 2005 100 4
4 2 A 2005 350 12
3 1 A 2020 953 10
7 2 A 2020 896 9
1 1 B 2005 172 8
5 2 B 2005 452 6
2 1 B 2020 852 9
6 2 B 2020 658 8
I would like to obtain the following where 1) column "Division_Price" is the division of price for Country 1 of Model A between the year 2005 and 2020 and 2) column "Division_Volume" is the division in volume for Country 1 of Model A between the year 2005 and 2020.
data_dic2 = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9],
"Division_Price": [0.953,4.95,4.95,0.953,2.56,1.45,1.45,2.56],
"Division_Volume": [2.5,1.125,1.125,2.5,1,1.33,1.33,1],
}
print(data_dic2)
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 0.953 2.500
4 2 A 2005 350 12 2.560 1.000
3 1 A 2020 953 10 0.953 2.500
7 2 A 2020 896 9 2.560 1.000
1 1 B 2005 172 8 4.950 1.125
5 2 B 2005 452 6 1.450 1.330
2 1 B 2020 852 9 4.950 1.125
6 2 B 2020 658 8 1.450 1.330
My whole dataset has up to 50 countries and I have up to 10 models with years ranging 1990 to 2030.
I am still unsure how to account for the multiple conditions of three columns so that I can divide automatically the column Price and Volume based on the three conditions (i.e., Country, Year and Models)?
Thanks !

You can try the following, using df.pivot, df.stack() and df.merge:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.diff().bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Difference_')
)
>>> df2
Difference_Price Difference_Volume
Year Country Model
2005 1 A 853 6
2 A 546 3
2020 1 A 853 6
2 A 546 3
2005 1 B 680 1
2 B 206 2
2020 1 B 680 1
2 B 206 2
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Difference_Price Difference_Volume
0 1 A 2005 100 4 853 6
1 2 A 2005 350 12 546 3
2 1 A 2020 953 10 853 6
3 2 A 2020 896 9 546 3
4 1 B 2005 172 8 680 1
5 2 B 2005 452 6 206 2
6 1 B 2020 852 9 680 1
7 2 B 2020 658 8 206 2
EDIT:
For your new dataframe, I think the 0.953 would be 9.530, if so, you can use pct_change and add 1:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.pct_change(1).add(1).bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Division_').round(3)
)
>>> df2
Division_Price Division_Volume
Year Country Model
2005 1 A 9.530 2.500
2 A 2.560 0.750
2020 1 A 9.530 2.500
2 A 2.560 0.750
2005 1 B 4.953 1.125
2 B 1.456 1.333
2020 1 B 4.953 1.125
2 B 1.456 1.333
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 9.530 2.500
1 2 A 2005 350 12 2.560 0.750
2 1 A 2020 953 10 9.530 2.500
3 2 A 2020 896 9 2.560 0.750
4 1 B 2005 172 8 4.953 1.125
5 2 B 2005 452 6 1.456 1.333
6 1 B 2020 852 9 4.953 1.125
7 2 B 2020 658 8 1.456 1.333

How to compress rows after groupby in pandas

I have performed a groupby on my dataframe.
grouped = data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
I am getting the below output :
data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Out[81]:
Cluster Visit Number Final
0 1 21846
2 1485
3 299
4 95
5 24
6 8
7 3
1 1 33600
2 2283
3 404
4 117
5 34
6 7
2 1 5858
2 311
3 55
4 14
5 6
6 3
7 1
3 1 19699
2 1101
3 214
4 78
5 14
6 8
7 3
4 1 10086
2 344
3 59
4 14
5 3
6 1
Name: Visitor_ID, dtype: int64
Now i want to compress the rows whose Visit Number Final >3(Add a new row which has the summation for visit number final 4,5,6). I am trying groupby.filter but not getting the expected output.
My final output should look like
Cluster Visit Number Final
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18

The easiest way is to replace the 'Visit Number Final' values bigger than 3, before you group the dataframe:
df.loc[df['Visit Number Final'] > 3, 'Visit Number Final'] = '>=4'
df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()

Try this:
visit_val = df.index.get_level_values(1)
grp = np.where((visit_val <= 3) == 0, '>=4', visit_val)
(df.groupby(['Cluster',grp])['Number Final'].sum()
.reset_index().rename(columns={'level_1':'Visit'}))
Output:
Cluster Visit Number Final
0 0 1 21846
1 0 2 1485
2 0 3 299
3 0 >=4 130
4 1 1 33600
5 1 2 2283
6 1 3 404
7 1 >=4 158
8 2 1 5858
9 2 2 311
10 2 3 55
11 2 >=4 24
12 3 1 19699
13 3 2 1101
14 3 3 214
15 3 >=4 103
16 4 1 10086
17 4 2 344
18 4 3 59
19 4 >=4 18
Or to get dataframe with indexes:
(df.groupby(['Cluster',grp])['Number Final'].sum()
.rename_axis(['Cluster','Visit']).to_frame())
Output:
Number Final
Cluster Visit
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18

find the maximum value for each streak of numbers in another column in pandas

I have a dataframe like this :
df = pd.DataFrame({'dir': [1,1,1,1,0,0,1,1,1,0], 'price':np.random.randint(100,200,10)})
dir price
0 1 100
1 1 150
2 1 190
3 1 194
4 0 152
5 0 151
6 1 131
7 1 168
8 1 112
9 0 193
and I want a new column that shows the maximum price as long as the dir is 1 and reset if dir is 0.
My desired outcome looks like this:
dir price max
0 1 100 194
1 1 150 194
2 1 190 194
3 1 194 194
4 0 152 NaN
5 0 151 NaN
6 1 131 168
7 1 168 168
8 1 112 168
9 0 193 NaN

Use transform with max for filtered rows:
#get unique groups for consecutive values
g = df['dir'].ne(df['dir'].shift()).cumsum()
#filter only 1
m = df['dir'] == 1
df['max'] = df[m].groupby(g)['price'].transform('max')
print (df)
dir price max
0 1 100 194.0
1 1 150 194.0
2 1 190 194.0
3 1 194 194.0
4 0 152 NaN
5 0 151 NaN
6 1 131 168.0
7 1 168 168.0
8 1 112 168.0
9 0 193 NaN

Create column that sums the last x occurrences of another column

I'm trying to create a new column, lets call it "HomeForm", that is the sum of the last 5 values of "FTHG" for each of the entries in the "HomeTeam" column.
Say for Team 0, the idea would be to populate the cell on the new column with the sum of the last 5 values of "FTHG" that correspond to Team 0. The table is ordered by date.
How can it be done in Python?
HomeTeam FTHG HomeForm
Date
136 0 4
135 2 0
135 4 2
135 5 0
135 6 1
135 13 0
135 17 3
135 18 1
134 11 4
134 12 0
128 1 0
128 3 0
128 8 2
128 9 1
128 13 3
128 14 1
128 15 0
127 7 1
127 16 1
126 10 1
Thanks.

You'll groupby on HomeTeam and perform a rolling sum here, summing for a minimum of 1 period, and maximum of 5.
First, define a function -
def f(x):
return x.shift().rolling(window=5, min_periods=1).sum()
This function performs the rolling sum of the previous 5 games (hence the shift). Pass this function to dfGroupBy.transform -
df['HomeForm'] = df.groupby('HomeTeam', sort=False).FTHG.transform(f)
df
HomeTeam FTHG HomeForm
Date
136 0 4 NaN
135 2 0 NaN
135 4 2 NaN
135 5 0 NaN
135 6 1 NaN
135 13 0 NaN
135 17 3 NaN
135 18 1 NaN
134 11 4 NaN
134 12 0 NaN
128 1 0 NaN
128 3 0 NaN
128 8 2 NaN
128 9 1 NaN
128 13 3 0.0
128 14 1 NaN
128 15 0 NaN
127 7 1 NaN
127 16 1 NaN
126 10 1 NaN
If needed, fill the NaNs with zeros and convert to integer -
df['HomeForm'] = df['HomeForm'].fillna(0).astype(int)

How to set array index in this case?

I have this file
0 0 716
0 1 851
0 2 900
1 0 724
1 1 857
1 2 903
2 0 812
2 1 858
2 2 902
3 0 799
3 1 852
3 2 905
4 0 833
4 1 871
4 2 907
5 0 940
5 1 955
5 2 995
6 0 941
6 1 956
6 2 996
7 0 942
7 1 957
7 2 999
8 0 944
8 1 958
8 2 992
9 0 946
9 1 952
9 2 998
I want to write third column values like this
0 0 716
1 0 724
2 0 812
3 0 799
4 0 833
0 1 851
1 1 857
2 1 858
3 1 852
4 1 871
0 2 900
1 2 903
2 2 902
3 2 905
4 2 907
5 0 940
6 0 941
7 0 942
8 0 944
9 0 946
5 1 955
6 1 956
7 1 957
8 1 958
9 1 952
5 2 995
6 2 996
7 2 999
8 2 992
9 2 998
I have read file
l= [line.rstrip('\n') for line in open('test.txt')]
Now I am stuck,how to read this as 3d array? With enumerate function,does not work because it includes first value on its own,I do not need that.

This works:
with open('input.txt') as infile:
rows = [map(int, line.split()) for line in infile]
def part(minval, maxval):
return [r for r in rows if minval <= r[0] <= maxval]
with open('output.txt', 'w') as outfile:
for half in [part(0, 4), part(5, 9)]:
half.sort(key=lambda (a, b, c): (b, a, c))
for row in half:
outfile.write('%s %s %s\n' % tuple(row))
Let me know if you have questions.

it would be very simple if you could use pandas module:
import pandas as pd
fn = r'D:\temp\.data\37146154.txt'
df = pd.read_csv(fn, delim_whitespace=True, header=None, names=['col1','col2','col3'])
df.sort_values(['col2','col1','col3'])
if you want to write it back to a new file:
df.sort_values(['col2','col1','col3']).to_csv('new_file', sep='\t', index=False, header=None)
Test:
In [15]: df.sort_values(['col2','col1','col3'])
Out[15]:
col1 col2 col3
0 0 0 716
3 1 0 724
6 2 0 812
9 3 0 799
12 4 0 833
15 5 0 940
18 6 0 941
21 7 0 942
24 8 0 944
27 9 0 946
1 0 1 851
4 1 1 857
7 2 1 858
10 3 1 852
13 4 1 871
16 5 1 955
19 6 1 956
22 7 1 957
25 8 1 958
28 9 1 952
2 0 2 900
5 1 2 903
8 2 2 902
11 3 2 905
14 4 2 907
17 5 2 995
20 6 2 996
23 7 2 999
26 8 2 992
29 9 2 998

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subtraction using different columns in multiple dictionaries - python

Related

How to divide multiple columns based on three conditions

How to compress rows after groupby in pandas

find the maximum value for each streak of numbers in another column in pandas

Create column that sums the last x occurrences of another column

How to set array index in this case?

Categories

Resources