how to compare two uneven dataset columns with each other?

how to compare two uneven dataset columns with each other? - python

As shown in the above picture there are two datasets that do not have the same row count. The task is to compare the distance between each city to the range each vehicle can travel i.e. (compare the distance between city1 & city2 with all the vehicle type ranges.)

Let's assume that we have 2 dictionaries(If you have a DataFrame, you can use to_dict() method to convert it to the dictionary)
vehicles = {'A320': 5000, 'A330': 8000, 'B737': 5000, 'B747': 10000, 'Q400': 1500, 'ATR72': 1000}
city_distances = {'AA-BB': 3000, 'BB-CC': 6500, 'CC-AA': 400, 'AA-DD': 1000}
You can simply create a nested for loop and check whatever condition you want. I, for example, did check, whether the vehicle could travel the city route.
for city_route in city_distances.keys():
for vehicle in vehicles.keys():
if vehicles[vehicle] >= city_distances[city_route]:
print(f'Vehile {vehicle} can travel the {city_route} route')
else:
print(f"Vehile {vehicle} can't travel the {city_route} route")

Well, you didn't tell/show us the expected output, so I'll give you the code to let you .merge() both DF, from there you can do pretty everything you want:
df3 = df2.merge(df1, how='cross')
df3
Truncated result:
index
city_a
city_b
vehicles
range
0
AA-BB
3000
A320
5000
1
AA-BB
3000
A330
8000
2
AA-BB
3000
B737
5000
3
AA-BB
3000
B747
10000
4
AA-BB
3000
Q400
1500
5
AA-BB
3000
ATR72
1000
6
BB-CC
6500
A320
5000
7
BB-CC
6500
A330
8000
...
index
city_a
city_b
vehicles
range
16
CC-AA
400
Q400
1500
17
CC-AA
400
ATR72
1000
18
AA-DD
1000
A320
5000
19
AA-DD
1000
A330
8000
20
AA-DD
1000
B737
5000
21
AA-DD
1000
B747
10000
22
AA-DD
1000
Q400
1500
23
AA-DD
1000
ATR72
1000

Related

create a new column for each strategy and add or subtract an amount

I want to extract from a dataset the amount accumulated by strategy according to the transactions between the strategies (from or to):
import pandas as pd
df = pd.DataFrame({"value": [1000, 4000, 2000, 3000],
"out": ["cash", "cash", "lending", "DCA"],
"in": ["DCA", "lending", "cash", "lending"]})
value out in
0 1000 cash DCA
1 4000 cash lending
2 2000 lending cash
3 3000 DCA lending
What I expect to do:
value out in cash lending DCA
0 1000 cash DCA -1000 0 1000
1 4000 cash lending -5000 4000 1000
2 2000 lending cash -3000 2000 1000
3 3000 DCA lending -3000 5000 -2000
I don't know how to approach the problem.
Any help would be appreciated.

You can try like this:
import pandas as pd
df = pd.DataFrame({"value": [1000, 4000, 2000, 3000],
"out": ["cash", "cash", "lending", "DCA"],
"in": ["DCA", "lending", "cash", "lending"]})
# get strategies from data source and create an account for each
accounts = {strat: 0 for strat in list(df["out"]) + list(df["in"])}
# add new columns for each strategy to dataframe
for strat in accounts.keys():
df[strat] = 0
# loop through transactions and enter values to accounts
for i, t in df.iterrows():
accounts[t["out"]] -= t["value"]
accounts[t["in"]] += t["value"]
for strat, v in accounts.items():
df.loc[i, strat] = v
print(df)
Output:
value out in cash lending DCA
0 1000 cash DCA -1000 0 1000
1 4000 cash lending -5000 4000 1000
2 2000 lending cash -3000 2000 1000
3 3000 DCA lending -3000 5000 -2000

pandas qcut and pandas cut functions do not distribute the number of items uniformly

I'm using pandas.qcut and pandas.cut to distribute the number of items uniformly across deciles based on a probability calculated. However, the items do not get distributed uniformly across deciles in both cases. Below is my code in each case:
pd.qcut(df['prob'], q=10, labels=False, duplicates='drop')
pd.cut (df['prob'], bins=10, labels=False)
Below is what I get in each case:
for pd.qcut:
Decile Count of Items
0 20300
1 7000
2 13800
3 14000
4 13000
5 13800
6 13700
7 14600
8 19000
9 70000
for pd.cut:
Decile Count of Items
0 1700
1 19000
2 39000
3 39000
4 32000
5 3100
6 3000
7 100
8 20
9 25
I didn't put the exact numbers but the magnitude should give an idea. the probability ranges from 0.01 to 0.15.
How can I distribute the items evenly across deciles?

DataFrame: update one column value based on dynamic column determined by value in 3rd column

Consider the following dataframe created from a dictionary
d = { 'p_symbol': ['A','B','C','D','E']
, 'p_volume': [0,0,0,0,0]
, 'p_exchange': ['IEXG', 'ASE', 'PSE', 'NAS', 'NYS']
, 'p_volume_rh': [1000,1000,1000,1000,1000]
, 'p_volume.1': [2000,2000,2000,2000,2000]
, 'p_volume.2': [3000,3000,3000,3000,3000]
, 'p_volume.3': [4000,4000,4000,4000,4000]
, 'p_volume.4': [5000,5000,5000,5000,5000]
}
snapshot = pd.DataFrame(d)
I need to set the value in p_volume to be the value in one of the last 5 p_volume* columns based on the value in p_exchange. I need to do it this way due to the way data is being returned from a third party vendor API over which I have no control.
I have tried setting up a dictionary that given the value in p_exchange gives me the column name with the resulting code tried
us_primary_exchange_map = {
"NYS": "xp_volume_rh"
, "NAS": "xp_volume.1"
, "PSE": "xp_volume.2"
, "ASE": "xp_volume.3"
, "IEXG": "xp_volume.4"
}
snapshot["p_volume"] = snapshot[us_primary_exchange_map[snapshot["p_exchange"]]])
But this does not work...
Can someone help me out here? Is there a straightforward way to do this without having to iterate over the rows?

I hope I've understood your question right (and xp_volume_* is a typo, should be p_volume_* without x?):
snapshot['p_volume'] = snapshot.lookup(snapshot.index, snapshot['p_exchange'].map(us_primary_exchange_map))
print(snapshot)
Prints:
p_symbol p_volume p_exchange ... p_volume.2 p_volume.3 p_volume.4
0 A 5000 NYS ... 3000 4000 5000
1 B 4000 ASE ... 3000 4000 5000
2 C 3000 PSE ... 3000 4000 5000
3 D 2000 NAS ... 3000 4000 5000
4 E 1000 NYS ... 3000 4000 5000
[5 rows x 8 columns]

You can use pandas.dataframe.apply with argument axis=1 to apply a function on dataframe rows:
snapshot['p_volume'] = snapshot.apply(lambda row: snapshot.loc[row.name,
us_primary_exchange_map[row['p_exchange']]], axis=1)
And your dataframe will look like:
p_symbol p_volume p_exchange p_volume_rh p_volume.1 p_volume.2 \
0 A 5000 IEXG 1000 2000 3000
1 B 4000 ASE 1000 2000 3000
2 C 3000 PSE 1000 2000 3000
3 D 2000 NAS 1000 2000 3000
4 E 1000 NYS 1000 2000 3000
p_volume.3 p_volume.4
0 4000 5000
1 4000 5000
2 4000 5000
3 4000 5000
4 4000 5000
I'm not sure it's more efficient than iterate the rows, but I think it's prettier.

combining three different timestamp dataframes using duration match

I have three data frames with different dataframes and frequencies. I want to combine them into one dataframe.
First dataframe collects sunlight from sun as given below:
df1 =
index light_data
05/01/2019 06:54:00.000 10
05/01/2019 06:55:00.000 20
05/01/2019 06:56:00.000 30
05/01/2019 06:57:00.000 40
05/01/2019 06:59:00.000 50
05/01/2019 07:01:00.000 60
05/01/2019 07:03:00.000 70
05/01/2019 07:04:00.000 80
05/01/2019 07:06:00.000 90
Second dataframe collects solar power from unit-A
df2 =
index P1
05/01/2019 06:54:24.000 100
05/01/2019 06:59:32.000 200
05/01/2019 07:04:56.000 300
Third dataframe collects solar power from unit-B
df3 =
index P2
05/01/2019 06:56:45.000 400
05/01/2019 07:01:21.000 500
05/01/2019 07:06:34.000 600
Above three are measurements coming from the field. Three have different timestamps. Now I want to combine all three into data frame with one timestamp.
df1 data occurs every minute
df2 and df3 occur every five minutes at different times.
Combine three data frames with df2 timestamp as a reference index with no seconds information.
Finally, I want the output something like as given below:
df_combine =
combine_index P1 light_data1 P2 light_data2
05/01/2019 06:54:00 100 10 400 30
05/01/2019 06:59:00 200 50 500 60
05/01/2019 07:04:00 300 80 600 90
# Note: combine_index is df2 index with no seconds

Nice question I am using reindex with nearest as method 1
df1['row']=df1.index
s1=df1.reindex(df2.index,method='nearest')
s2=df1.reindex(df3.index,method='nearest')
s1=s1.join(df2).set_index('row')
s2=s2.join(df3).set_index('row')
pd.concat([s1,s2.reindex(s1.index,method='nearest')],1)
Out[67]:
light_data A light_data B
row
2019-05-01 06:54:00 10 100 40 400
2019-05-01 06:59:00 50 200 60 500
2019-05-01 07:04:00 80 300 90 600
Or at the last line using merge_asof
pd.merge_asof(s1,s2,left_index=True,right_index=True,direction='nearest')
Out[81]:
light_data_x A light_data_y B
row
2019-05-01 06:54:00 10 100 40 400
2019-05-01 06:59:00 50 200 40 400
2019-05-01 07:04:00 80 300 90 600
Make it extendable
df1['row']=df1.index
l=[]
for i,x in enumerate([df2,df3]):
s1=df1.reindex(x.index,method='nearest')
if i==0:
l.append(s1.join(x).set_index('row').add_suffix(x.columns[0].str[-1]))
else :
l.append(s1.join(x).set_index('row').reindex(l[0].index,method='nearest').add_suffix(x.columns[0].str[-1]))
pd.concat(l,1)

Create subcolumns in pandas dataframe python

I have a dataframe with multiple columns
df = pd.DataFrame({"cylinders":[2,2,1,1],
"horsepower":[120,100,89,70],
"weight":[5400,6200,7200,1200]})
cylinders horsepower weight
0 2 120 5400
1 2 100 6200
2 1 80 7200
3 1 70 1200
i would like to create a new dataframe and make two subcolumns of weight with the median and mean while gouping it by cylinders.
example:
weight
cylinders horsepower median mean
0 1 100 5299 5000
1 1 120 5100 5200
2 2 70 7200 6500
3 2 80 1200 1000
For my example tables i have used random values. I cant manage to achieve that.
I know how to get median and mean its described here in this stackoverflow question.
:
df.weight.median()
df.weight.mean()
df.groupby('cylinders') #groupby cylinders
But how to create this subcolumn?

The following code fragment adds the two requested columns. It groups the rows by cylinders, calculates the mean and median of weight, and combines the original dataframe and the result:
result = df.join(df.groupby('cylinders')['weight']\
.agg(['mean', 'median']))\
.sort_values(['cylinders', 'mean']).ffill()
# cylinders horsepower weight mean median
#2 1 80 7200 5800.0 5800.0
#3 1 70 1200 5800.0 5800.0
#1 2 100 6200 4200.0 4200.0
#0 2 120 5400 4200.0 4200.0
You cannot have "subcolumns" for select columns in pandas. If a column has "subcolumns," all other columns must have "subcolumns," too. It is called multiindexing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to compare two uneven dataset columns with each other? - python

As shown in the above picture there are two datasets that do not have the same row count. The task is to compare the distance between each city to the range each vehicle can travel i.e. (compare the distance between city1 & city2 with all the vehicle type ranges.)

Related

create a new column for each strategy and add or subtract an amount

pandas qcut and pandas cut functions do not distribute the number of items uniformly

DataFrame: update one column value based on dynamic column determined by value in 3rd column

combining three different timestamp dataframes using duration match

Create subcolumns in pandas dataframe python

Categories

Resources